1 of 33

Parallel Computing: shared memory

Alec Bills

A Walking Race Condition

1

1

1

1

2 of 33

Landscape of parallel computing

Shared Memory

(this week)

Distributed Memory

(next week)

MPI

Memory

thread

thread

thread

Memory

process

process

process

Memory

Memory

This week will be Julia.Threads, OpenMP is very similar

2

2

2

2

3 of 33

Fork-Join Parallelism

serial

thread0

thread1

thread2

thread3

thread4

serial

thread0

thread1

thread2

thread3

thread4

serial

Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

3

3

3

3

4 of 33

How much speedup can we get from parallel computing (ignore cache effects, etc.)?

 

Speedup

Serial 1

Parallel 1

Serial 2

Parallel 2

Serial 3

 

 

 

 

 

4

4

4

4

5 of 33

Total Speedup: N serial regions, M parallel regions, P processors

 

 

Total time to run a program on P Processors

Speedup

5

5

5

5

6 of 33

Amdahl’s law: Parallel Programs almost never scale linearly!

 

 

Speedup

Serial Fraction

 

Embarrassingly parallel

6

6

6

6

7 of 33

In practice, programs don’t parallelize perfectly

serial

thread5

thread1

thread2

thread3

thread4

serial

thread1

thread2

thread3

thread4

thread5

serial

7

7

7

7

8 of 33

Starting threads and synchronization: two options

Options to start threads

@spawn

  • Creates task and schedules it to run on any thread

@threads [schedule] for …

  • Run each iteration of the for loop in parallel

Synchronization

  • @sync expr: waits until all @spawn, etc. in expr are finished.

8

8

8

8

9 of 33

First Parallel Program

9

9

9

9

10 of 33

Our sample problem

 

 

 

Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

 

10

10

10

10

11 of 33

Homework solution (sample problem)

All benchmarks on h001

11

11

11

11

12 of 33

Pi Threading? (thanks, copilot)

12

12

12

12

13 of 33

Unit Testing: Make sure your code is correct

  • Don’t write code without writing tests to check it
  • Return true for pass and false for fail
  • Can use to check for race conditions
    • Conservation laws
    • Analytical solutions
  • Julia provides Test package, python provides unittest module
    • @test
    • isapprox

13

13

13

13

14 of 33

Race conditions: when 2 or more threads are simultaneously reading and writing the same data

var

thread0

thread1

var

14

14

14

14

15 of 33

Options to avoid race conditions

  • Use thread-local variables
  • Operate on Array Elements
  • Synchronization
  • Atomics/Locks
  • Reductions (no built in threading reductions in Julia)

15

15

15

15

16 of 33

Task based parallelism

Spawn is not deterministic

16

16

16

16

17 of 33

Common Parallelism Pattern: Single Program Multiple Data (SPMD)

thread1

thread2

thread3

thread4

arr[1]

arr[2]

arr[3]

arr[4]

serial

serial

arr[1]

arr[2]

arr[3]

arr[4]

17

17

17

17

18 of 33

Task Based Concurrency

18

18

18

18

19 of 33

Multithreaded Loop (most Julian way to do this)

19

19

19

19

20 of 33

Static vs Dynamic Thread Scheduling

Static: Work is evenly divided among threads when they are spawned

Dynamic: Work is divided as it is completed (i.e. when thread N is done, it gets a new task)

In very small and uniform workflows, the overhead from dynamic outweighs its benefits. In cases of nonuniform workflow, use dynamic (default)

20

20

20

20

21 of 33

It takes time to schedule threads (but utilization can improve)

21

21

21

21

22 of 33

Atomics and Locks

22

22

22

22

23 of 33

htop: check where your code is actually running

23

23

23

23

24 of 33

Scaling up past 64 cores (N=1,000,000,000)…

24

24

24

24

25 of 33

Why doesn’t it scale well? Access cost is different on different cores

UMA

NUMA

25

25

25

25

26 of 33

Why doesn’t it scale well? False Sharing

Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

26

26

26

26

27 of 33

Why doesn’t it scale well? False Sharing

Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

https://stackoverflow.com/questions/52593588/julia-why-doesnt-shared-memory-multi-threading-give-me-a-speedup

27

27

27

27

28 of 33

Why doesn’t it scale well? It takes time to schedule threads

28

28

28

28

29 of 33

Why doesn’t it scale well? The scheduler may not be doing that great a job anyway

29

29

29

29

30 of 33

Check out these Packages for more features and less overhead

  • Floops
  • Loopvectorization.jl

30

30

30

30

31 of 33

Conclusions

  • Use multithreading to accomplish “small-scale parallelism”
  • Parallel code doesn’t scale linearly
  • Test your code to avoid inaccuracies
  • Avoid using locks whenever possible
  • If you’re using a package for research, use its parallelism first

31

31

31

31

32 of 33

Further References (things I used to either learn this or make this deck)

  1. https://book.sciml.ai/
  2. Mattesson, Tim. “OpenMP and the fundamental design patterns of parallel programming.” ATPESC, 2020

32

32

32

32

33 of 33

Homework (will be released on Monday)

  • Build multi-threaded heat equation solver (serial code will be given)

33

33

33

33