Parallelism in Julia
Trying to get the best/easiest parallelism:
Warning: This isn’t easy, straightforward
Mapreduce
Paradigm Example:
MAP: Generate Random Samples of Your Favorite Something (Data Parallel 101)
REDUCE: Take an average, sum, or histogram (Data Parallel 201)
MapReduce: A classic data parallel primitive
Paradigm Example:
MAP: Generate Random Samples of Your Favorite Something
REDUCE: Take an average, sum, or histogram
The lesson of parallelism:
You don’t have to follow the sequence as it is presented (big vector than reduce)
e.g. You can have a running sum/histogram saving memory/allocations (reduce as you go)
Big vector than reduce | reduce as you go |
reduce(+, map(f, A)) | mapreduce(f, +, A) # simple syntax |
using Statistics # To get mean mean([rand() for i=1:N]) | using Statistics # To get mean mean(rand() for i=1:N) |
reduce(+, map(_->rand(), 1:N))/N | mapreduce(_->rand(), +, 1:N)/N |
Example 1: Shooting Darts
using Distributions
function throw_dart()
x = rand(Uniform(-1, 1))
y = rand(Uniform(-1, 1))
return (x^2 + y^2) ≤ 1
end
Circle Area = π
Square Area = 2 x 2 = 4
r=1
julia> N=1000000; 4*mapreduce(_->throw_dart(), +, 1:N)/N
(Circle)/(Square) =
π/4
Observation: Vectorization (really) doesn’t matter!
(and in this case is harder to read)
julia> @benchmark 4*mean(throw_dart() for _ in 1:10000000)
BenchmarkTools.Trial: 40 samples with 1 evaluation.
Range (min … max): 120.871 ms … 133.568 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 124.788 ms ┊ GC (median): 0.00%
Time (mean ± σ): 125.968 ms ± 3.521 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ █ ▂
█▁▁▁▅▅▅▅▅▅▁▁▅█▅▁█▅█▁▅▁▁▁▁▁▁█▁█▁██▅▁▁▁▁▁▅▁▁▁▁▁▅▅▁▅▁▁▁▁▁▁▅▅▁▁▅▅ ▁
121 ms Histogram: frequency by time 134 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark 4*mean(sum(rand(Uniform(-1, 1), 10000000, 2).^2, dims=2) .<= 1)
BenchmarkTools.Trial: 38 samples with 1 evaluation.
Range (min … max): 122.799 ms … 164.660 ms ┊ GC (min … max): 2.01% … 1.98%
Time (median): 126.778 ms ┊ GC (median): 2.14%
Time (mean ± σ): 133.407 ms ± 13.121 ms ┊ GC (mean ± σ): 2.09% ± 0.35%
█▁▁▁
▇████▇▇▇▁▁▄▄▁▁▄▄▁▇▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄▁▁▁▁▄▁▁▄▁▁▄▁▁▄▄▁▁▁▄ ▁
123 ms Histogram: frequency by time 165 ms <
Memory estimate: 382.67 MiB, allocs estimate: 14.
Serial: For small matrices use StaticArrays
using LinearAlgebra
function throw_det(k)
det(randn(k, k))^2 # array on heap
end
using LinearAlgebra, StaticArrays
function throw_det_2(::Val{k}) where k
det(randn(SMatrix{k, k}))^2 # array on stack
end
throw_det_2(Val(4)::Val{4})
julia> @benchmark mean(throw_det(4) for _ in 1:1000000)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min … max): 528.478 ms … 585.884 ms ┊ GC (min … max): 2.84% … 3.25%
Time (median): 558.212 ms ┊ GC (median): 2.91%
Time (mean ± σ): 555.475 ms ± 17.402 ms ┊ GC (mean ± σ): 2.97% ± 0.17%
█ █ ██ ██ █ ██ █
█▁▁▁▁▁▁▁▁▁█▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁█▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
528 ms Histogram: frequency by time 586 ms <
Memory estimate: 457.76 MiB, allocs estimate: 3000000.
julia> @benchmark mean(throw_det_2(Val(4)) for _ in 1:1000000)
BenchmarkTools.Trial: 50 samples with 1 evaluation.
Range (min … max): 93.034 ms … 115.309 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 98.659 ms ┊ GC (median): 0.00%
Time (mean ± σ): 100.321 ms ± 5.205 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁▁▁▁ ▁█▄▄ ▁ ▁▁ ▁ ▁
▆▆█▁▆▆▁████▁████▆▁█▆██▆▆▆▆▁▆▆▁▁█▆▁▁▁▁▆█▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▆▆▁▁▁▁▆ ▁
93 ms Histogram: frequency by time 115 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
We think we are as fast as possible in serial, so now and only now we consider parallel!
We start with threads
How we wish to write this multithreaded code:
(but alas not today)
Can be multithreaded with the Folds package
Threads = number available by default
mean(throw_det_2(Val(4)) for _ in 1:1000000)
using Folds�
N = 1000000
Folds.sum(throw_det_2(Val(4)) for _ in 1:N)/N
Folds.sum(throw_det_2(Val(4), ThreadedEx()) for _ in 1:N)/N # default
using FoldsCUDA
Folds.sum(throw_det_2(Val(4), CUDAEx()) for _ in 1:N)/N
Why avoiding contention matters
For larger problems: Less contention (6 cores/12 HT)
When granularity goes up contention may go down.�
julia> @benchmark sum(_->throw_det(128), 1:100)
BenchmarkTools.Trial: 294 samples with 1 evaluation.
Range (min … max): 14.185 ms … 24.868 ms ┊ GC (min … max): 0.00% … 6.07%
Time (median): 16.500 ms ┊ GC (median): 4.97%
Time (mean ± σ): 16.966 ms ± 1.861 ms ┊ GC (mean ± σ): 3.65% ± 3.21%
▃▂ ▂▂▄▁▅█▇▄ ▅▁▄
▆▆▃▆██▇▆████████▆███▇█▇█▆▅▇▆▄▇▆▇▅▅▄▆▅▅▄▆▃▄▅▃▄▅▃▃▆▆▁▁▄▁▃▄▁▁▃ ▄
14.2 ms Histogram: frequency by time 22.1 ms <
Memory estimate: 25.11 MiB, allocs estimate: 500.
julia> @benchmark Folds.sum(_->throw_det(128), 1:100)
BenchmarkTools.Trial: 815 samples with 1 evaluation.
Range (min … max): 1.860 ms … 29.514 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.404 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.106 ms ± 4.243 ms ┊ GC (mean ± σ): 4.51% ± 10.91%
▁▇█▆▆▆▃▁
▆████████▇▇▇▆▇▅▅▆▅▄▄▄▄▄▄▄▃▃▃▃▃▃▃▄▃▄▃▂▃▂▃▃▃▄▂▃▃▁▃▃▂▂▂▂▂▁▁▁▃ ▄
1.86 ms Histogram: frequency by time 20 ms <
Memory estimate: 25.12 MiB, allocs estimate: 651.
Speedup 7.6x
Contention in libraries: We need composability
julia> @benchmark Folds.sum(_->throw_det(128), 1:100)/100
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 5.177 s (0.00% GC) to evaluate,
with a memory estimate of 25.12 MiB, over 655 allocations.
julia> LinearAlgebra.BLAS.set_num_threads(1)
julia> @benchmark Folds.sum(_->throw_det(128), 1:100)/100
BenchmarkTools.Trial: 815 samples with 1 evaluation.
Range (min … max): 1.860 ms … 29.514 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.404 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.106 ms ± 4.243 ms ┊ GC (mean ± σ): 4.51% ± 10.91%
▁▇█▆▆▆▃▁
▆████████▇▇▇▆▇▅▅▆▅▄▄▄▄▄▄▄▃▃▃▃▃▃▃▄▃▄▃▂▃▂▃▃▃▄▂▃▃▁▃▃▂▂▂▂▂▁▁▁▃ ▄
1.86 ms Histogram: frequency by time 20 ms <
Memory estimate: 25.12 MiB, allocs estimate: 651.
Granularity Ranges (K=Problem Size, N=# Samples)
with GPUs
K < 16 | 16 <= K <= 5000? | 5000? <= K <= 10000? | K >> 10000 |
Parallelize Samples | Hybrid? | Serialize Samples | |
StaticArrays | Serial Problem | Some threads per Problem | Parallel Problem |
K < 16, N > 1000 | 16 <= K <= 5000? | 5000? <= K <= 10000? | K >> 10000 |
Parallelize Samples | Hybrid??? | Hybrid | Serialize Samples |
StaticArrays on GPU | Unsolved | Solve problem on GPU | Parallel Problem |
Hybrid solution on GPU
julia> problems = [CUDA.randn(2000,2000) for i in 1:100]
julia> @sync for p in problems
Threads.@spawn begin
CUDA.@allowscalar det(p)
CUDA.synchronize()
end
end
Parallel prefix