1 of 33

Parallel Computing: shared memory

Alec Bills

A Walking Race Condition

2 of 33

Landscape of parallel computing

Shared Memory

(this week)

Distributed Memory

(next week)

MPI

Memory

thread

Memory

process

Memory

This week will be Julia.Threads, OpenMP is very similar

3 of 33

Fork-Join Parallelism

serial

thread0

thread1

thread2

thread3

thread4

serial

thread0

thread1

thread2

thread3

thread4

serial

Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

4 of 33

How much speedup can we get from parallel computing (ignore cache effects, etc.)?

Speedup

Serial 1

Parallel 1

Serial 2

Parallel 2

Serial 3

5 of 33

Total Speedup: N serial regions, M parallel regions, P processors

Total time to run a program on P Processors

Speedup

6 of 33

Amdahl’s law: Parallel Programs almost never scale linearly!

Speedup

Serial Fraction

Embarrassingly parallel

7 of 33

In practice, programs don’t parallelize perfectly

serial

thread5

thread1

thread2

thread3

thread4

serial

thread1

thread2

thread3

thread4

thread5

serial

8 of 33

Starting threads and synchronization: two options

Options to start threads

@spawn

Creates task and schedules it to run on any thread

@threads [schedule] for …

Run each iteration of the for loop in parallel

Synchronization

@sync expr: waits until all @spawn, etc. in expr are finished.

9 of 33

First Parallel Program

10 of 33

Our sample problem

Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

11 of 33

Homework solution (sample problem)

All benchmarks on h001

12 of 33

Pi Threading? (thanks, copilot)

13 of 33

Unit Testing: Make sure your code is correct

Don’t write code without writing tests to check it
Return true for pass and false for fail
Can use to check for race conditions

Conservation laws
Analytical solutions

Julia provides Test package, python provides unittest module

@test
isapprox

14 of 33

Race conditions: when 2 or more threads are simultaneously reading and writing the same data

var

thread0

thread1

…

var

15 of 33

Options to avoid race conditions

Use thread-local variables
Operate on Array Elements
Synchronization
Atomics/Locks
Reductions (no built in threading reductions in Julia)

16 of 33

Task based parallelism

Spawn is not deterministic

17 of 33

Common Parallelism Pattern: Single Program Multiple Data (SPMD)

thread1

thread2

thread3

thread4

arr[1]

arr[2]

arr[3]

arr[4]

serial

arr[1]

arr[2]

arr[3]

arr[4]

18 of 33

Task Based Concurrency

19 of 33

Multithreaded Loop (most Julian way to do this)

20 of 33

Static vs Dynamic Thread Scheduling

Static: Work is evenly divided among threads when they are spawned

Dynamic: Work is divided as it is completed (i.e. when thread N is done, it gets a new task)

In very small and uniform workflows, the overhead from dynamic outweighs its benefits. In cases of nonuniform workflow, use dynamic (default)

21 of 33

It takes time to schedule threads (but utilization can improve)

22 of 33

Atomics and Locks

23 of 33

htop: check where your code is actually running

24 of 33

Scaling up past 64 cores (N=1,000,000,000)…

25 of 33

Why doesn’t it scale well? Access cost is different on different cores

UMA

NUMA

https://www.intel.com/content/dam/develop/external/us/en/documents/3-5-memmgt-optimizing-applications-for-numa-184398.pdf

26 of 33

Why doesn’t it scale well? False Sharing

Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

27 of 33

Why doesn’t it scale well? False Sharing

Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

https://stackoverflow.com/questions/52593588/julia-why-doesnt-shared-memory-multi-threading-give-me-a-speedup

28 of 33

Why doesn’t it scale well? It takes time to schedule threads

https://www.intel.com/content/dam/develop/external/us/en/documents/3-5-memmgt-optimizing-applications-for-numa-184398.pdf

29 of 33

Why doesn’t it scale well? The scheduler may not be doing that great a job anyway

30 of 33

Check out these Packages for more features and less overhead

Floops
Loopvectorization.jl

31 of 33

Conclusions

Use multithreading to accomplish “small-scale parallelism”
Parallel code doesn’t scale linearly
Test your code to avoid inaccuracies
Avoid using locks whenever possible
If you’re using a package for research, use its parallelism first

32 of 33

Further References (things I used to either learn this or make this deck)

https://book.sciml.ai/
Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

33 of 33

Homework (will be released on Monday)

Build multi-threaded heat equation solver (serial code will be given)