Parallel Computing: shared memory
Alec Bills
A Walking Race Condition
1
1
1
1
Landscape of parallel computing
Shared Memory
(this week)
Distributed Memory
(next week)
MPI
Memory
thread
thread
thread
Memory
process
process
process
Memory
Memory
This week will be Julia.Threads, OpenMP is very similar
2
2
2
2
Fork-Join Parallelism
serial
thread0
thread1
thread2
thread3
thread4
serial
thread0
thread1
thread2
thread3
thread4
serial
Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020
3
3
3
3
How much speedup can we get from parallel computing (ignore cache effects, etc.)?
Speedup
Serial 1
Parallel 1
Serial 2
Parallel 2
Serial 3
4
4
4
4
Total Speedup: N serial regions, M parallel regions, P processors
Total time to run a program on P Processors
Speedup
5
5
5
5
Amdahl’s law: Parallel Programs almost never scale linearly!
Speedup
Serial Fraction
Embarrassingly parallel
6
6
6
6
In practice, programs don’t parallelize perfectly
serial
thread5
thread1
thread2
thread3
thread4
serial
thread1
thread2
thread3
thread4
thread5
serial
7
7
7
7
Starting threads and synchronization: two options
Options to start threads
@spawn
@threads [schedule] for …
Synchronization
8
8
8
8
First Parallel Program
9
9
9
9
Our sample problem
Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020
10
10
10
10
Homework solution (sample problem)
All benchmarks on h001
11
11
11
11
Pi Threading? (thanks, copilot)
12
12
12
12
Unit Testing: Make sure your code is correct
13
13
13
13
Race conditions: when 2 or more threads are simultaneously reading and writing the same data
var
thread0
thread1
…
var
14
14
14
14
Options to avoid race conditions
15
15
15
15
Task based parallelism
Spawn is not deterministic
16
16
16
16
Common Parallelism Pattern: Single Program Multiple Data (SPMD)
thread1
thread2
thread3
thread4
arr[1]
arr[2]
arr[3]
arr[4]
serial
serial
arr[1]
arr[2]
arr[3]
arr[4]
17
17
17
17
Task Based Concurrency
18
18
18
18
Multithreaded Loop (most Julian way to do this)
19
19
19
19
Static vs Dynamic Thread Scheduling
Static: Work is evenly divided among threads when they are spawned
Dynamic: Work is divided as it is completed (i.e. when thread N is done, it gets a new task)
In very small and uniform workflows, the overhead from dynamic outweighs its benefits. In cases of nonuniform workflow, use dynamic (default)
20
20
20
20
It takes time to schedule threads (but utilization can improve)
21
21
21
21
Atomics and Locks
22
22
22
22
htop: check where your code is actually running
23
23
23
23
Scaling up past 64 cores (N=1,000,000,000)…
24
24
24
24
Why doesn’t it scale well? Access cost is different on different cores
UMA
NUMA
25
25
25
25
Why doesn’t it scale well? False Sharing
Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020
26
26
26
26
Why doesn’t it scale well? False Sharing
Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020
27
27
27
27
Why doesn’t it scale well? It takes time to schedule threads
28
28
28
28
Why doesn’t it scale well? The scheduler may not be doing that great a job anyway
29
29
29
29
Check out these Packages for more features and less overhead
30
30
30
30
Conclusions
31
31
31
31
Further References (things I used to either learn this or make this deck)
32
32
32
32
Homework (will be released on Monday)
33
33
33
33