1 of 23

GPU Acceleration

Rabab Alomairy & Evelyne Ringoot

2 of 23

GPU vs. CPU

Latency Oriented Bandwidth Oriented

2

3 of 23

Heterogeneous architecture

Latency Oriented Bandwidth Oriented

3

4 of 23

Overview of GPU Architecture

4

Transistor Count: 54.2 billion

Die Area: 826 mm²

Architecture: Ampere

Memory: Up to 80 GB HBM2e

Peak FP64 Performance: ~19.5 TFLOP/s (A100 80GB SXM)

Interconnect: NVLink (up to 600 GB/s), PCIe Gen4

128 SMs per full GPU

5 of 23

Overview of GPU Architecture

5

Transistor Count: 80 billion

Die Area: 814 mm²

Architecture: Hopper

Memory: Up to 96 GB HBM2e

Peak FP64 Performance: ~34 TFLOP/s (H100 96GB SXM)

Interconnect: NVLink (up to 900 GB/s), PCIe Gen4

132 SMs per full GPU

6 of 23

Streaming multiprocessor (SM)

6

7 of 23

Streaming multiprocessor (SM)

SM contains four processing blocks that share an L1 cache for data caching.
Each processing block has a Warp scheduler
16 INT32 CUDA cores
16 FP32 CUDA cores
8 FP64 CUDA cores
8 Load/Store cores
Tensor core for matrix multiplication
16K 32-bit register file.
The maximum number of thread blocks per SM is 32.

7

8 of 23

How to program a GPU

Single-Instruction-Multiple-Threads (SIMT).
SIMT allows GPUs to execute the same instruction across multiple threads
Perspective is a thread
Threads grouped into blocks
Blocks grouped into grids
Threads within block can access shared memory and synchronize
GPUs are designed to maximize throughput.
Transferring data from the CPU to the GPU can be a bottleneck

8

9 of 23

Mapping Software to Hardware�

Thread: Smallest unit of execution, executes a kernel function.
Scalar Processor (core): Executes one thread. Multiple cores live inside an SM.
Thread Block: Group of threads executed on a single SM. Threads in a block can share memory and synchronize.
SM (Streaming Multiprocessor): Executes thread blocks. Contains cores, Tensor Cores, registers, shared memory, etc.
Grid: Collection of thread blocks. Can span across multiple SMs.
GPU Device: The full hardware accelerator; includes many SMs and memory hierarchies (L2 cache, global memory, etc.).

9

10 of 23

Memory Hierarchy�

Each thread has private local memory.

Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block.

Thread blocks in a thread block cluster can perform read, write, and atomics operations on each other’s shared memory.

All threads have access to the same global memory.

10

11 of 23

SIMT Programming model

11

Grid

Block

y

x

Global thread ID: the position of the thread within block (threadIdx) , the position of the block within the grid (blockIdx), and block dimension (blockDim)

12 of 23

Existing GPUs and their Software

12

Hardware	NVIDIA	AMD	Intel	Apple
Software	CUDA	ROCm	OneAPI	Metal

13 of 23

Typical Execution Dataflow

13

14 of 23

NVIDIA Ampere A100

An NVIDIA A100 GPU
It has 40 MB L2 cache, which helps reduce latency
It supports up to 80 GB of HBM2 memory with a maximum memory bandwidth of 2039 GB/s , essential for handling large data and reducing data transfer.
It contains in total 6912 FP32/INT32 CUDA cores and 3456 FP64 CUDA cores

14

ASCI White supercomputer

Lawrence Livermore National Laboratory

Top #1 in 2007, 7.9 TFLOPS

NVIDIA H100: offers up to 30 TFLOPs of FP64 performance, which is over 3x the FP64 performance of the A100

GB200 Grace Blackwell Superchip: Provides 90 TFLOPS of FP64 performance

15 of 23

NVIDIA CUDA: CUDA.jl

AMD ROCm: AMDGPU.jl

Intel oneAPI: oneAPI.jl

Performance portability layer: KernelAbstractions.jl

16 of 23

Installation

pkg> add CUDA�julia> using CUDA

julia> CUDA.version()�Downloading artifact: CUDA10.2�Downloading artifact: CUDNN+CUDA10.2�Downloading artifact: CUTENSOR+CUDA10.2�v"10.2.89"

julia> CUDA.functional()�true

Automatic installation of CUDA

lazily, on first use
unless the environment contains JULIA_CUDA_USE_BINARYBUILDER=false

No need for Requires.jl
Container-friendly

16

17 of 23

Array programming

julia> CuArray{Float32,2}(undef, 2, 2)�2×2 CuArray{Float32,2}:�0.0 0.0�0.0 0.0

julia> similar(a)�1×3 CuArray{Int64,2}:�0 0 0

julia> a = CuArray([1 2 3])�1×3 CuArray{Int64,2}:�1 2 3

julia> b = Array(a)�1×3 Array{Int64,2}:�1 2 3

Goal: API compatibility with Base.Array

17

18 of 23

Array programming

julia> CUDA.ones(2)�2-element CuArray{Float32,1}:�1.0�1.0

julia> CUDA.zeros(Float32, 2)�2-element CuArray{Float32,1}:�0.0�0.0

julia> CUDA.fill(42, (3,4))�3×4 CuArray{Int64,2}:�42 42 42 42�42 42 42 42�42 42 42 42

julia> rand(2, 2)�2×2 CuArray{Float32,2}:�0.73055 0.843176�0.939997 0.61159

Goal: API compatibility with Base.Array

18

19 of 23

Array programming

julia> a = CuArray{Float32}(undef, (2,2));

CURAND�julia> rand!(a)�2×2 CuArray{Float32,2}:�0.73055 0.843176�0.939997 0.61159

CUBLAS�julia> a * a�2×2 CuArray{Float32,2}:� 1.32629 1.13166� 1.26161 1.16663

CUSOLVER�julia> LinearAlgebra.qr!(a)�CuQR{Float32,CuArray{Float32,2}}�with factors Q and R:�Float32[-0.613648 -0.78958; -0.78958 0.613648]�Float32[-1.1905 -1.00031; 0.0 -0.290454]

CUFFT�julia> CUFFT.plan_fft(a) * a�2-element CuArray{Complex{Float32},1}:� -1.99196+0.0im 0.589576+0.0im� -2.38968+0.0im -0.969958+0.0im

CUDNN�julia> softmax(real(ans))�2×2 CuArray{Float32,2}:� 0.15712 0.32963� 0.84288 0.67037

CUSPARSE�julia> sparse(a)�2×2 CuSparseMatrixCSR{Float32,Int32}�with 4 stored entries:� [1, 1] = -1.1905� [2, 1] = 0.489313� [1, 2] = -1.00031� [2, 2] = -0.290454

19

20 of 23

Array programming

julia> a = CuArray([1 2 3])�julia> b = CuArray([4 5 6])

julia> map(a) do x� x + 1� end�1×3 CuArray{Int64,2}:�2 3 4

julia> a .+ 2b�1×3 CuArray{Int64,2}:�9 12 15

��julia> reduce(+, a)�6

julia> accumulate(+, b; dims=2)�1×3 CuArray{Int64,2}:�4 9 15

julia> findfirst(isequal(2), a)�CartesianIndex(1, 2)

Powerful array language: obviates need for custom kernels

makes it possible to write generic code

20

21 of 23

Vector Addition Example

21

vector_size = 1024

a = rand(1:4, vector_size)

b = rand(1:4, vector_size)

c = zeros(Int, vector_size)

function vadd(c, a, b)

Thread.@threads for i in 1:vector_size

c[i] = a[i] + b[i]

end

return

end

CPU code

da = CuArray(a)

db = CuArray(b)

dc = CUDA.zeros(Int, size(a))

function vadd(c, a, b)

i = threadIdx().x

c[i] = a[i] + b[i]

return

end

@cuda threads=length(a) vadd(dc, da, db)

GPU (CUDA) kernel

Notice: only local index

Any Ideas what is the limitation here?

22 of 23

Enhanced GPU Vector Addition

22

da = CuArray(a)

db = CuArray(b)

dc = CUDA.zeros(Int, size(a))

function vadd(c, a, b)

i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

c[i] = a[i] + b[i]

return

end

@cuda threads=1024 blocks=cld(length(da),1024) vadd(dc, da, db)

threadIdx().x + (blockIdx().x - 1) * blockDim().x = 3 + (3-1) * 256 = 515

Gets the global index of the thread in a multidimensional grid

1	2	3	…	256

1	2	3	…	256

1	2	3	…	256

1	2	3	…	256

…

blockIdx().x = 1

blockIdx().x = 2

blockIdx().x = 3

blockIdx().x = 2048

threadIdx().x

gridDim().x = 2048

23 of 23

23

Notebook:

https://github.com/Rabab53/mit-parallel-computing-course/blob/main/Introduction_to_julia_gpu/gpu_introduction.ipynb