1 of 15

Parallelism in Julia

Trying to get the best/easiest parallelism:

Warning: This isn’t easy, straightforward

2 of 15

Mapreduce

Paradigm Example:

MAP: Generate Random Samples of Your Favorite Something (Data Parallel 101)

REDUCE: Take an average, sum, or histogram (Data Parallel 201)

3 of 15

MapReduce: A classic data parallel primitive

Paradigm Example:

MAP: Generate Random Samples of Your Favorite Something

REDUCE: Take an average, sum, or histogram

The lesson of parallelism:

You don’t have to follow the sequence as it is presented (big vector than reduce)

e.g. You can have a running sum/histogram saving memory/allocations (reduce as you go)

Big vector than reduce

reduce as you go

reduce(+, map(f, A))

mapreduce(f, +, A) # simple syntax

using Statistics # To get mean

mean([rand() for i=1:N])

using Statistics # To get mean

mean(rand() for i=1:N)

reduce(+, map(_->rand(), 1:N))/N

mapreduce(_->rand(), +, 1:N)/N

4 of 15

Example 1: Shooting Darts

using Distributions

function throw_dart()

x = rand(Uniform(-1, 1))

y = rand(Uniform(-1, 1))

return (x^2 + y^2) 1

end

Circle Area = π

Square Area = 2 x 2 = 4

r=1

julia> N=1000000; 4*mapreduce(_->throw_dart(), +, 1:N)/N

(Circle)/(Square) =

π/4

5 of 15

Observation: Vectorization (really) doesn’t matter!

(and in this case is harder to read)

julia> @benchmark 4*mean(throw_dart() for _ in 1:10000000)

BenchmarkTools.Trial: 40 samples with 1 evaluation.

Range (min … max): 120.871 ms … 133.568 ms ┊ GC (min … max): 0.00%0.00%

Time (median): 124.788 ms ┊ GC (median): 0.00%

Time (mean ± σ): 125.968 ms ± 3.521 ms ┊ GC (mean ± σ): 0.00% ± 0.00%

▂ █ ▂

█▁▁▁▅▅▅▅▅▅▁▁▅█▅▁█▅█▁▅▁▁▁▁▁▁█▁█▁██▅▁▁▁▁▁▅▁▁▁▁▁▅▅▁▅▁▁▁▁▁▁▅▅▁▁▅▅ ▁

121 ms Histogram: frequency by time 134 ms <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark 4*mean(sum(rand(Uniform(-1, 1), 10000000, 2).^2, dims=2) .<= 1)

BenchmarkTools.Trial: 38 samples with 1 evaluation.

Range (min … max): 122.799 ms … 164.660 ms ┊ GC (min … max): 2.01%1.98%

Time (median): 126.778 ms ┊ GC (median): 2.14%

Time (mean ± σ): 133.407 ms ± 13.121 ms ┊ GC (mean ± σ): 2.09% ± 0.35%

█▁▁▁

▇████▇▇▇▁▁▄▄▁▁▄▄▁▇▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄▁▁▁▁▄▁▁▄▁▁▄▁▁▄▄▁▁▁▄ ▁

123 ms Histogram: frequency by time 165 ms <

Memory estimate: 382.67 MiB, allocs estimate: 14.

6 of 15

Serial: For small matrices use StaticArrays

using LinearAlgebra

function throw_det(k)

det(randn(k, k))^2 # array on heap

end

using LinearAlgebra, StaticArrays

function throw_det_2(::Val{k}) where k

det(randn(SMatrix{k, k}))^2 # array on stack

end

throw_det_2(Val(4)::Val{4})

  • We want to avoid allocations as much as possible when multi-threading (arrays always end up on heaps)
  • StaticArrays is ideal for small matrices and provides a fast determinant
  • Julia uses types to communicate with the compiler – Hence Val{k}
    • Val{3} is a different type from Val{4} etc

7 of 15

julia> @benchmark mean(throw_det(4) for _ in 1:1000000)

BenchmarkTools.Trial: 10 samples with 1 evaluation.

Range (min … max): 528.478 ms … 585.884 ms ┊ GC (min … max): 2.84%3.25%

Time (median): 558.212 ms ┊ GC (median): 2.91%

Time (mean ± σ): 555.475 ms ± 17.402 ms ┊ GC (mean ± σ): 2.97% ± 0.17%

█ █ ██ ██ █ ██ █

█▁▁▁▁▁▁▁▁▁█▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁█▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁

528 ms Histogram: frequency by time 586 ms <

Memory estimate: 457.76 MiB, allocs estimate: 3000000.

julia> @benchmark mean(throw_det_2(Val(4)) for _ in 1:1000000)

BenchmarkTools.Trial: 50 samples with 1 evaluation.

Range (min … max): 93.034 ms … 115.309 ms ┊ GC (min … max): 0.00%0.00%

Time (median): 98.659 ms ┊ GC (median): 0.00%

Time (mean ± σ): 100.321 ms ± 5.205 ms ┊ GC (mean ± σ): 0.00% ± 0.00%

▁ ▁▁▁▁ ▁█▄▄ ▁ ▁▁ ▁ ▁

▆▆█▁▆▆▁████▁████▆▁█▆██▆▆▆▆▁▆▆▁▁█▆▁▁▁▁▆█▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▆▆▁▁▁▁▆ ▁

93 ms Histogram: frequency by time 115 ms <

Memory estimate: 0 bytes, allocs estimate: 0.

8 of 15

We think we are as fast as possible in serial, so now and only now we consider parallel!

We start with threads

  • Shared Memory Parallelism
  • Works on all your laptops/desktops
  • Issues:
    • Every thread has its own stack
    • Contention for allocating data on the heap

9 of 15

How we wish to write this multithreaded code:

(but alas not today)

Can be multithreaded with the Folds package

Threads = number available by default

mean(throw_det_2(Val(4)) for _ in 1:1000000)

using Folds�

N = 1000000

Folds.sum(throw_det_2(Val(4)) for _ in 1:N)/N

Folds.sum(throw_det_2(Val(4), ThreadedEx()) for _ in 1:N)/N # default

using FoldsCUDA

Folds.sum(throw_det_2(Val(4), CUDAEx()) for _ in 1:N)/N

10 of 15

Why avoiding contention matters

11 of 15

For larger problems: Less contention (6 cores/12 HT)

When granularity goes up contention may go down.�

julia> @benchmark sum(_->throw_det(128), 1:100)

BenchmarkTools.Trial: 294 samples with 1 evaluation.

Range (min … max): 14.185 ms … 24.868 ms ┊ GC (min … max): 0.00%6.07%

Time (median): 16.500 ms ┊ GC (median): 4.97%

Time (mean ± σ): 16.966 ms ± 1.861 ms ┊ GC (mean ± σ): 3.65% ± 3.21%

▃▂ ▂▂▄▁▅█▇▄ ▅▁▄

▆▆▃▆██▇▆████████▆███▇█▇█▆▅▇▆▄▇▆▇▅▅▄▆▅▅▄▆▃▄▅▃▄▅▃▃▆▆▁▁▄▁▃▄▁▁▃ ▄

14.2 ms Histogram: frequency by time 22.1 ms <

Memory estimate: 25.11 MiB, allocs estimate: 500.

julia> @benchmark Folds.sum(_->throw_det(128), 1:100)

BenchmarkTools.Trial: 815 samples with 1 evaluation.

Range (min … max): 1.860 ms … 29.514 ms ┊ GC (min … max): 0.00%0.00%

Time (median): 4.404 ms ┊ GC (median): 0.00%

Time (mean ± σ): 6.106 ms ± 4.243 ms ┊ GC (mean ± σ): 4.51% ± 10.91%

▁▇█▆▆▆▃▁

▆████████▇▇▇▆▇▅▅▆▅▄▄▄▄▄▄▄▃▃▃▃▃▃▃▄▃▄▃▂▃▂▃▃▃▄▂▃▃▁▃▃▂▂▂▂▂▁▁▁▃ ▄

1.86 ms Histogram: frequency by time 20 ms <

Memory estimate: 25.12 MiB, allocs estimate: 651.

Speedup 7.6x

12 of 15

Contention in libraries: We need composability

  • Everyone works in their own silo
  • To avoid contention in LAPACK/BLAS, we will shut down its parallelism
  • It may only make sense to parallelize inside the linear algebra for much, much larger matrices, and this gets very hard

julia> @benchmark Folds.sum(_->throw_det(128), 1:100)/100

BenchmarkTools.Trial: 1 sample with 1 evaluation.

Single result which took 5.177 s (0.00% GC) to evaluate,

with a memory estimate of 25.12 MiB, over 655 allocations.

julia> LinearAlgebra.BLAS.set_num_threads(1)

julia> @benchmark Folds.sum(_->throw_det(128), 1:100)/100

BenchmarkTools.Trial: 815 samples with 1 evaluation.

Range (min … max): 1.860 ms … 29.514 ms ┊ GC (min … max): 0.00%0.00%

Time (median): 4.404 ms ┊ GC (median): 0.00%

Time (mean ± σ): 6.106 ms ± 4.243 ms ┊ GC (mean ± σ): 4.51% ± 10.91%

▁▇█▆▆▆▃▁

▆████████▇▇▇▆▇▅▅▆▅▄▄▄▄▄▄▄▃▃▃▃▃▃▃▄▃▄▃▂▃▂▃▃▃▄▂▃▃▁▃▃▂▂▂▂▂▁▁▁▃ ▄

1.86 ms Histogram: frequency by time 20 ms <

Memory estimate: 25.12 MiB, allocs estimate: 651.

13 of 15

Granularity Ranges (K=Problem Size, N=# Samples)

with GPUs

K < 16

16 <= K <= 5000?

5000? <= K <= 10000?

K >> 10000

Parallelize Samples

Hybrid?

Serialize Samples

StaticArrays

Serial Problem

Some threads per Problem

Parallel Problem

K < 16, N > 1000

16 <= K <= 5000?

5000? <= K <= 10000?

K >> 10000

Parallelize Samples

Hybrid???

Hybrid

Serialize Samples

StaticArrays on GPU

Unsolved

Solve problem on GPU

Parallel Problem

14 of 15

Hybrid solution on GPU

julia> problems = [CUDA.randn(2000,2000) for i in 1:100]

julia> @sync for p in problems

Threads.@spawn begin

CUDA.@allowscalar det(p)

CUDA.synchronize()

end

end

15 of 15

Parallel prefix