1 of 23

GPU Acceleration

Rabab Alomairy & Evelyne Ringoot

2 of 23

GPU vs. CPU

Latency Oriented Bandwidth Oriented

2

3 of 23

Heterogeneous architecture

Latency Oriented Bandwidth Oriented

3

4 of 23

Overview of GPU Architecture

4

  • Transistor Count: 54.2 billion

  • Die Area: 826 mm²

  • Architecture: Ampere

  • Memory: Up to 80 GB HBM2e

  • Peak FP64 Performance: ~19.5 TFLOP/s (A100 80GB SXM)

  • Interconnect: NVLink (up to 600 GB/s), PCIe Gen4

  • 128 SMs per full GPU

5 of 23

Overview of GPU Architecture

5

  • Transistor Count: 80 billion

  • Die Area: 814 mm²

  • Architecture: Hopper

  • Memory: Up to 96 GB HBM2e

  • Peak FP64 Performance: ~34 TFLOP/s (H100 96GB SXM)

  • Interconnect: NVLink (up to 900 GB/s), PCIe Gen4

  • 132 SMs per full GPU

6 of 23

Streaming multiprocessor (SM)

6

7 of 23

Streaming multiprocessor (SM)

  • SM contains four processing blocks that share an L1 cache for data caching.
  • Each processing block has a Warp scheduler
  • 16 INT32 CUDA cores
  • 16 FP32 CUDA cores
  • 8 FP64 CUDA cores
  • 8 Load/Store cores
  • Tensor core for matrix multiplication
  • 16K 32-bit register file.
  • The maximum number of thread blocks per SM is 32.

7

8 of 23

How to program a GPU

  • Single-Instruction-Multiple-Threads (SIMT).
  • SIMT allows GPUs to execute the same instruction across multiple threads
  • Perspective is a thread
  • Threads grouped into blocks
  • Blocks grouped into grids
  • Threads within block can access shared memory and synchronize
  • GPUs are designed to maximize throughput.
  • Transferring data from the CPU to the GPU can be a bottleneck

8

9 of 23

 Mapping Software to Hardware�

  • Thread: Smallest unit of execution, executes a kernel function.
  • Scalar Processor (core): Executes one thread. Multiple cores live inside an SM.
  • Thread Block: Group of threads executed on a single SM. Threads in a block can share memory and synchronize.
  • SM (Streaming Multiprocessor): Executes thread blocks. Contains cores, Tensor Cores, registers, shared memory, etc.
  • Grid: Collection of thread blocks. Can span across multiple SMs.
  • GPU Device: The full hardware accelerator; includes many SMs and memory hierarchies (L2 cache, global memory, etc.).

9

10 of 23

Memory Hierarchy�

  • Each thread has private local memory.

  • Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block.

  • Thread blocks in a thread block cluster can perform read, write, and atomics operations on each other’s shared memory.

  • All threads have access to the same global memory.

10

11 of 23

SIMT Programming model

11

Grid

Block

y

x

  • Global thread ID: the position of the thread within block (threadIdx) , the position of the block within the grid (blockIdx), and block dimension (blockDim)

12 of 23

Existing GPUs and their Software

12

Hardware

NVIDIA

AMD

Intel

Apple

Software

CUDA

ROCm

OneAPI

Metal

13 of 23

Typical Execution Dataflow

13

 

 

14 of 23

NVIDIA Ampere A100

  • An NVIDIA A100 GPU
  • It has 40 MB L2 cache, which helps reduce latency
  • It supports up to 80 GB of HBM2 memory with a maximum memory bandwidth of 2039 GB/s , essential for handling large data and reducing data transfer.
  • It contains in total 6912 FP32/INT32 CUDA cores and 3456 FP64 CUDA cores

14

ASCI White supercomputer

Lawrence Livermore National Laboratory

Top #1 in 2007, 7.9 TFLOPS

NVIDIA H100: offers up to 30 TFLOPs of FP64 performance, which is over 3x the FP64 performance of the A100

GB200 Grace Blackwell Superchip: Provides 90 TFLOPS of FP64 performance

15 of 23

NVIDIA CUDA: CUDA.jl

AMD ROCm: AMDGPU.jl

Intel oneAPI: oneAPI.jl

Performance portability layer: KernelAbstractions.jl

16 of 23

Installation

pkg> add CUDA�julia> using CUDA

julia> CUDA.version()�Downloading artifact: CUDA10.2�Downloading artifact: CUDNN+CUDA10.2�Downloading artifact: CUTENSOR+CUDA10.2�v"10.2.89"

julia> CUDA.functional()�true

  • Automatic installation of CUDA
    • lazily, on first use
    • unless the environment contains JULIA_CUDA_USE_BINARYBUILDER=false
  • No need for Requires.jl
  • Container-friendly

16

17 of 23

Array programming

julia> CuArray{Float32,2}(undef, 2, 2)�2×2 CuArray{Float32,2}:�0.0 0.0�0.0 0.0

julia> similar(a)�1×3 CuArray{Int64,2}:�0 0 0

julia> a = CuArray([1 2 3])�1×3 CuArray{Int64,2}:�1 2 3

julia> b = Array(a)�1×3 Array{Int64,2}:�1 2 3

Goal: API compatibility with Base.Array

17

18 of 23

Array programming

julia> CUDA.ones(2)�2-element CuArray{Float32,1}:�1.0�1.0

julia> CUDA.zeros(Float32, 2)�2-element CuArray{Float32,1}:�0.0�0.0

julia> CUDA.fill(42, (3,4))�3×4 CuArray{Int64,2}:�42 42 42 42�42 42 42 42�42 42 42 42

julia> rand(2, 2)�2×2 CuArray{Float32,2}:�0.73055 0.843176�0.939997 0.61159

Goal: API compatibility with Base.Array

18

19 of 23

Array programming

julia> a = CuArray{Float32}(undef, (2,2));

CURANDjulia> rand!(a)�2×2 CuArray{Float32,2}:�0.73055 0.843176�0.939997 0.61159

CUBLASjulia> a * a�2×2 CuArray{Float32,2}:� 1.32629 1.13166� 1.26161 1.16663

CUSOLVERjulia> LinearAlgebra.qr!(a)�CuQR{Float32,CuArray{Float32,2}}�with factors Q and R:�Float32[-0.613648 -0.78958; -0.78958 0.613648]�Float32[-1.1905 -1.00031; 0.0 -0.290454]

CUFFTjulia> CUFFT.plan_fft(a) * a�2-element CuArray{Complex{Float32},1}:� -1.99196+0.0im 0.589576+0.0im� -2.38968+0.0im -0.969958+0.0im

CUDNNjulia> softmax(real(ans))�2×2 CuArray{Float32,2}:� 0.15712 0.32963� 0.84288 0.67037

CUSPARSEjulia> sparse(a)�2×2 CuSparseMatrixCSR{Float32,Int32}�with 4 stored entries:� [1, 1] = -1.1905� [2, 1] = 0.489313� [1, 2] = -1.00031� [2, 2] = -0.290454

19

20 of 23

Array programming

julia> a = CuArray([1 2 3])julia> b = CuArray([4 5 6])

julia> map(a) do x� x + 1� end�1×3 CuArray{Int64,2}:�2 3 4

julia> a .+ 2b�1×3 CuArray{Int64,2}:�9 12 15

���julia> reduce(+, a)�6

julia> accumulate(+, b; dims=2)�1×3 CuArray{Int64,2}:�4 9 15

julia> findfirst(isequal(2), a)�CartesianIndex(1, 2)

Powerful array language: obviates need for custom kernels

makes it possible to write generic code

20

21 of 23

Vector Addition Example

21

vector_size = 1024

a = rand(1:4, vector_size)

b = rand(1:4, vector_size)

c = zeros(Int, vector_size)

function vadd(c, a, b)

Thread.@threads for i in 1:vector_size

c[i] = a[i] + b[i]

end

return

end

CPU code

da = CuArray(a)

db = CuArray(b)

dc = CUDA.zeros(Int, size(a))

 

function vadd(c, a, b)

i = threadIdx().x

c[i] = a[i] + b[i]

return

end

@cuda threads=length(a) vadd(dc, da, db)

GPU (CUDA) kernel

Notice: only local index

Any Ideas what is the limitation here?

22 of 23

Enhanced GPU Vector Addition

22

da = CuArray(a)

db = CuArray(b)

dc = CUDA.zeros(Int, size(a))

function vadd(c, a, b)

i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

c[i] = a[i] + b[i]

return

end

@cuda threads=1024 blocks=cld(length(da),1024) vadd(dc, da, db)

  • threadIdx().x + (blockIdx().x - 1) * blockDim().x = 3 + (3-1) * 256 = 515

Gets the global index of the thread in a multidimensional grid

1

2

3

256

1

2

3

256

1

2

3

256

1

2

3

256

blockIdx().x = 1

blockIdx().x = 2

blockIdx().x = 3

blockIdx().x = 2048

threadIdx().x

threadIdx().x

threadIdx().x

threadIdx().x

gridDim().x = 2048

23 of 23

23