

### Parallel and Distributed Computing (B4B36PDV)

Matěj Kafka, Michal Jakob

kafkamat@fel.cvut.cz <u>https://pdv.pages.fel.cvut.cz</u>

### **Parallel and Distributed Computing**

What is the difference?

Parallel computing

Computing in distributed systems

## **Parallel and Distributed Computing**

What is the difference?

#### Parallel computing

Computing in distributed systems

Utilize multiple computation units to get the result faster.

"single computer" (shared memory) Utilize a network of separate computers to either get the result faster, or more reliably.

"multiple computers" (message passing)



More robust system

## PARALLEL COMPUTING

Making programs run faster using parallelization

#### **Motivation** End of frequency scaling

50 Years of Microprocessor Trend Data



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2021 by K. Rupp

#### Source: <u>https://github.com/karlrupp/microprocessor-trend-data</u>

### **Contemporary hardware**





#### **Threadripper PRO 7995WX**

- 96 cores (192 hyperthreads)
- 12.6 TFLOPS

#### **NVIDIA RTX 4090**

- 16384 shader units
- 178.8 TFLOPS

#### **Contemporary laptop hardware** Apple MacBook Pro (2024), 10 CPU cores, 1280 shader units



## Apple Silicon

**X4 28 billion transistors** N3E TSMC (2n gen. 3nm) Die zise : 1.28mm x1,21mm





#### **Contemporary laptop hardware** Surface Laptop Studio 2, 14 CPU cores, 20 threads, 3072 shader units



Source: <u>https://www.anandtech.com/</u>



#### **Compilers will not help us...** ...yet

- For single-threaded programs, compilers work hard to make our programs fast (and tend to be good at it).
- Contemporary compilers will **not** magically make our programs multi-threaded.
- Libraries can often help, but we still need to know where and how we want to run things in parallel.
- Parallelism does not easily compose.

## **DEMO 1: PARALLEL FOREACH**

**CPU-bound computation** 

Brain-bound computation?

## **DEMO 2: STUDENT SUM**

-----

#### Speedup is limited by serialized execution Amdahl's Law



S = speedup s = serial part P = core count

https://www.desmos.com/calculator/7hday0d3ne

## ORGANIZATION

B4B36PDV

### Who are we?

#### Lecturers





#### Tutors





Peter Macejko

Jakub Dupák

Jáchym Herynek Max Hollmann Adéla Kubíková







á David Milec

### What will we use?

#### Parallel computing

- C++20, OpenMP
- Linux / Mac / WSL
- JetBrains CLion / VS Code

 Knowledge from APO, OSY and ALG

#### Distributed computing

- Java 17
- Linux / Mac / Windows
- IntelliJ IDEA

 Knowledge from LGR and OSY

### How do we evaluate?

- Assignments (50%)
  - 7 small assignments
  - 2 large assignments
- Implementation exam (20%)
- Theoretical exam (30%)

You need to achieve at least 50% from each part to pass.

### How to succeed in PDV?

- **Review PRP, APO and OSY.** It will make the parallel part much easier.
  - C knowledge, pipelining, caches, threads, mutexes, race conditions

• Learn to combine high-level algorithmic **decomposition** with low-level **understanding of hardware**.

• Think while debugging. Randomly throwing code at the wall rarely fixes multithreading issues.

• Use "AI" chatbots wisely.

## QUICK REVISION OF CPU ARCHITECTURE

## Why is CPU architecture relevant?

Matrix-vector multiplication

```
float x[SIZE];
float y[SIZE];
float A[SIZE * SIZE];
```

Which version will be faster?

// VERSION 1
for (size\_t i = 0; i < SIZE; i++)
 for (size\_t j = 0; j < SIZE; j++)
 y[i] += A[i \* SIZE + j] \* x[j];</pre>

// VERSION 2
for (size\_t j = 0; j < SIZE; j++)
 for (size\_t i = 0; i < SIZE; i++)
 y[i] += A[i \* SIZE + j] \* x[j];</pre>

#### **CPU** structure

AMD Zen 5 annotated die shot



Source: <u>https://nemez.net/die/</u>

#### **CPU** structure

#### AMD Zen 5 annotated die shot



Source: <u>https://nemez.net/die/</u>

#### **CPU cache latency** Intel i7-10750H



Source: <a href="https://curiouscoding.nl/posts/cpu-benchmarks/">https://curiouscoding.nl/posts/cpu-benchmarks/</a>

### Hardware threads (MIMD)

Multiple instructions, multiple data

| Register | Register  | Register |
|----------|-----------|----------|
| Counter  | Counter   | Counter  |
| Stack    | Stack     | Stack    |
|          |           |          |
| Da       | ita Files | •        |
| Da       | rta Files |          |
|          |           |          |

Multiple threads of execution, each one with separate control unit and data

Multi-core CPUs (you should already know from OSY), hyper-threading

Single Process P with three threads

Source: <a href="https://www.tutorialspoint.com/operating\_system/os\_multi\_threading.htm">https://www.tutorialspoint.com/operating\_system/os\_multi\_threading.htm</a>

#### **SIMD** Single instruction, multiple data



Single pipeline, single control unit, multiple ALUs

"data parallelism"

GPUs, vector ALUs in CPUs, various parallel accelerators

Source: <a href="https://commons.wikimedia.org/wiki/File:SIMD2.svg">https://commons.wikimedia.org/wiki/File:SIMD2.svg</a>

#### Why is CPU architecture relevant? Array sum ("reduction")

# float array[SIZE]; float sum = 0.0f;

// split parts of the for loop
// between multiple threads
# pragma omp parallel for
for (size\_t i = 0; i < SIZE; i++) {
 sum += array[i];
}</pre>

## Why is CPU architecture relevant?

Array sum ("reduction"), improved version

# float array[SIZE]; float sums[THREAD\_COUNT] = {0.0f};

// split parts of the for loop
// between multiple threads
# pragma omp parallel for
for (size\_t i = 0; i < SIZE; i++) {
 sums[THREAD\_ID] += array[i];
}</pre>

#### Resources

Interesting articles used in this lecture (not mandatory)

- <u>https://www.cs.cmu.edu/~15418/schedule.html</u> course from Carnegie Mellon University, great slides, similar area but more in-depth
- <u>https://www.youtube.com/watch?v=eanvgGt-D1o</u> old recording of the first lecture from the course above
- <u>https://curiouscoding.nl/posts/cpu-benchmarks/</u> very well-done benchmarks of CPU cache latency
- <u>http://gotw.ca/publications/concurrency-ddj.htm</u>
   "The Free Lunch Is Over" by Herb Sutter article explaining why parallelism is now the answer to improving performance