GPU Acceleration
Rabab Alomairy & Evelyne Ringoot
GPU vs. CPU
Latency Oriented Bandwidth Oriented
2
Heterogeneous architecture
Latency Oriented Bandwidth Oriented
3
Overview of GPU Architecture
4
Overview of GPU Architecture
5
Streaming multiprocessor (SM)
6
Streaming multiprocessor (SM)
7
How to program a GPU
8
Mapping Software to Hardware�
9
Memory Hierarchy�
10
SIMT Programming model
11
Grid
Block
y
x
Existing GPUs and their Software
12
Hardware | NVIDIA | AMD | Intel | Apple |
Software | CUDA | ROCm | OneAPI | Metal |
Typical Execution Dataflow
13
NVIDIA Ampere A100
14
ASCI White supercomputer
Lawrence Livermore National Laboratory
Top #1 in 2007, 7.9 TFLOPS
NVIDIA H100: offers up to 30 TFLOPs of FP64 performance, which is over 3x the FP64 performance of the A100
GB200 Grace Blackwell Superchip: Provides 90 TFLOPS of FP64 performance
NVIDIA CUDA: CUDA.jl
AMD ROCm: AMDGPU.jl
Intel oneAPI: oneAPI.jl
Performance portability layer: KernelAbstractions.jl
Installation
pkg> add CUDA�julia> using CUDA
julia> CUDA.version()�Downloading artifact: CUDA10.2�Downloading artifact: CUDNN+CUDA10.2�Downloading artifact: CUTENSOR+CUDA10.2�v"10.2.89"
julia> CUDA.functional()�true
16
Array programming
julia> CuArray{Float32,2}(undef, 2, 2)�2×2 CuArray{Float32,2}:�0.0 0.0�0.0 0.0
julia> similar(a)�1×3 CuArray{Int64,2}:�0 0 0
julia> a = CuArray([1 2 3])�1×3 CuArray{Int64,2}:�1 2 3
julia> b = Array(a)�1×3 Array{Int64,2}:�1 2 3
Goal: API compatibility with Base.Array
17
Array programming
julia> CUDA.ones(2)�2-element CuArray{Float32,1}:�1.0�1.0
julia> CUDA.zeros(Float32, 2)�2-element CuArray{Float32,1}:�0.0�0.0
julia> CUDA.fill(42, (3,4))�3×4 CuArray{Int64,2}:�42 42 42 42�42 42 42 42�42 42 42 42
julia> rand(2, 2)�2×2 CuArray{Float32,2}:�0.73055 0.843176�0.939997 0.61159
Goal: API compatibility with Base.Array
18
Array programming
julia> a = CuArray{Float32}(undef, (2,2));
CURAND�julia> rand!(a)�2×2 CuArray{Float32,2}:�0.73055 0.843176�0.939997 0.61159
CUBLAS�julia> a * a�2×2 CuArray{Float32,2}:� 1.32629 1.13166� 1.26161 1.16663
CUSOLVER�julia> LinearAlgebra.qr!(a)�CuQR{Float32,CuArray{Float32,2}}�with factors Q and R:�Float32[-0.613648 -0.78958; -0.78958 0.613648]�Float32[-1.1905 -1.00031; 0.0 -0.290454]
CUFFT�julia> CUFFT.plan_fft(a) * a�2-element CuArray{Complex{Float32},1}:� -1.99196+0.0im 0.589576+0.0im� -2.38968+0.0im -0.969958+0.0im
CUDNN�julia> softmax(real(ans))�2×2 CuArray{Float32,2}:� 0.15712 0.32963� 0.84288 0.67037
CUSPARSE�julia> sparse(a)�2×2 CuSparseMatrixCSR{Float32,Int32}�with 4 stored entries:� [1, 1] = -1.1905� [2, 1] = 0.489313� [1, 2] = -1.00031� [2, 2] = -0.290454
19
Array programming
julia> a = CuArray([1 2 3])�julia> b = CuArray([4 5 6])
julia> map(a) do x� x + 1� end�1×3 CuArray{Int64,2}:�2 3 4
julia> a .+ 2b�1×3 CuArray{Int64,2}:�9 12 15
���julia> reduce(+, a)�6
julia> accumulate(+, b; dims=2)�1×3 CuArray{Int64,2}:�4 9 15
julia> findfirst(isequal(2), a)�CartesianIndex(1, 2)
Powerful array language: obviates need for custom kernels
makes it possible to write generic code
20
Vector Addition Example
21
vector_size = 1024
a = rand(1:4, vector_size)
b = rand(1:4, vector_size)
c = zeros(Int, vector_size)
function vadd(c, a, b)
Thread.@threads for i in 1:vector_size
c[i] = a[i] + b[i]
end
return
end
CPU code
da = CuArray(a)
db = CuArray(b)
dc = CUDA.zeros(Int, size(a))
function vadd(c, a, b)
i = threadIdx().x
c[i] = a[i] + b[i]
return
end
@cuda threads=length(a) vadd(dc, da, db)
GPU (CUDA) kernel
Notice: only local index
Any Ideas what is the limitation here?
Enhanced GPU Vector Addition
22
da = CuArray(a)
db = CuArray(b)
dc = CUDA.zeros(Int, size(a))
function vadd(c, a, b)
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
c[i] = a[i] + b[i]
return
end
@cuda threads=1024 blocks=cld(length(da),1024) vadd(dc, da, db)
Gets the global index of the thread in a multidimensional grid
1 | 2 | 3 | … | 256 |
1 | 2 | 3 | … | 256 |
1 | 2 | 3 | … | 256 |
1 | 2 | 3 | … | 256 |
…
blockIdx().x = 1
blockIdx().x = 2
blockIdx().x = 3
blockIdx().x = 2048
threadIdx().x
threadIdx().x
threadIdx().x
threadIdx().x
gridDim().x = 2048