1 of 33

pmpp book ch. 1-3

CUDA-MODE

Lecture 2

2 of 33

Agenda for Lecture 2

1: Introduction
2: Heterogeneous data parallel computing
3: Multidimensional grids and data

3 of 33

Ch 1: Introduction

motivation: GPU go brrr, more FLOPS please
Why? Simulation & world-models (games, weather, proteins, robotics)
Bigger models are smarter -> AGI (prevent wars, fix climate, cure cancer)
GPUs are the backbone of modern deep learning
classic software: sequential programs
higher clock rate trend for CPU slowed in 2003: energy consumption & heat dissipation
multi-core CPU came up
developers had to learn multi-threading (deadlocks, races etc.)

4 of 33

The Power Wall

Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).

Transistors (thousands)

Frequency

(MHz)

10⁰

10¹

10²

10³

10⁴

10⁵

10⁶

10⁷

10⁸

(increasing frequency further would make the chip too hot to cool feasibly)

5 of 33

The rise of CUDA

CUDA is all about parallel programs (modern software)
GPUs have (much) higher peak FLOPS than multi-core CPUs
main principle: divide work among threads
GPUs focus on execution throughput of massive number of threads
programs with few threads perform poorly on GPUs
CPU+GPU: sequential parts on CPU, numerical intensive parts on GPU
CUDA: Compute Unified Device Architect
GPGPU: Before CUDA tricks were used to compute with graphics APIs (OpenGL or Direct3D)
GPU programming is now attractive for developers (thanks to massive availability)

6 of 33

Amdahl's Law

7 of 33

Challenges

"if you do not care about performance, parallel programming is very easy"
designing parallel algorithms in practice harder than sequential algorithms�e.g. parallelizing recurrent computations requires nonintuitive thinking (like prefix sum)
speed is often limited by memory latency/throughput (memory bound)
perf of parallel programs can vary dramatically based on input data characterists
not all apps are "embarassingly parallel" - synchronization imposes overhead (waits)

8 of 33

Main Goals of the Book

1. Parallel programming & computational thinking

2. Correct & reliable: debugging function & performance

3. Scalability: regularize and localize memory access

PMPP aims to build up the foundation for parallel programming in general
GPUs as learning vehicle - techniques apply to other accelerators
concepts are introduced hands-on as concrete CUDA examples

9 of 33

Ch 2: Heterogeneous data parallel computing

heterogeneous: CPU + GPU
data parallelism: break work down into computations that can be executed independently
Two examples: vector addition & kernel to convert an RGB image to grayscale
Independence: each RGB pixel can be converted individually
L = r*0.21 + g*0.72 + b*0.07 (L=luminance)
simple weighted sum

10 of 33

RGB->Grayscale, data independence

11 of 33

CUDA C

extends ANSI C with minimal new syntax
Terminology: CPU = host, GPU = device
CUDA C source can be mixture of host & device code
device code functions: kernels
grid of threads: many threads are launched to execute a kernel
CPU & GPU code runs concurrently (overlapped)
on GPU: don't be afraid of launching many threads
e.g. one thread pre (output) tensor element is fine

12 of 33

Example: Vector Addition

vector addition example:

main concept loop -> threads
Easily parallelizable: all additions can be computed independently

Naïve GPU vector addition:

Allocate device memory for vectors
Transfer inputs host -> device
Launch kernel and perform additions
Copy device -> host back
Free device memory

normally we keep data on the gpu as long as possible to asynchronously schedule many kernel launches

13 of 33

One thread per vector element

Input Vector x:

Input Vector y:

Output Vector z:

14 of 33

CUDA Essentials: Memory allocation

nvidia devices come with their own DRAM (device) global memory �(in Ch 5 we learn about other mem types)

cudaMalloc & cudaFree:�

float *A_d;

size_t size = n * sizeof(float); // size in bytes

cudaMalloc((void**)&A_d, size); // pointer to pointer!

...

cudaFree(A_d);

15 of 33

cudaMemcpy: Host <-> Device Transfer

Copy data from CPU memory to GPU memory and vice versa

// copy input vectors to device (host -> device)

cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice);

cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice);

...

// transfer result back to CPU memory (device -> host)

cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost);

16 of 33

CUDA Error handling

CUDA functions return `cudaError_t` .. if not `cudaSuccess` we have a problem ...
always check returned error status ☺

17 of 33

Kernel functions fn<<<>>>

Launching kernel = grid of threads is launched
All threads execute the same code: Single program multiple-data (SPMD)
Threads are hierarchically organized into grid blocks & thread blocks
up to 1024 threads can be in a thread block

18 of 33

Kernel Coordinates

built-in variables available inside the kernel: blockIdx, threadIdx
these "coordinates" allow threads (all executing the same code) to identify what to do (e.g. which portion of the data to process)
each thread can be uniquely identified by threadIdx & blockIdx
telephone system analogy: think of blockIdx as the area code and threadIdx as the local phone number
built-in blockDim tells us the number of threads in a block
for vector addition we can calculate the array index of the thread�`int i = blockIdx.x * blockDim.x + threadIdx.x;`

19 of 33

Threads execute the same kernel code

20 of 33

__global__ & __host__

declare a kernel function with __global__
calling a __global__ function -> launches new grid of cuda threads
functions declared with __device__ can be called from within a cuda thread
if both __host__ & __device__ are used in a function declaration CPU & GPU versions will be compiled

21 of 33

Vector Addition Example

general strategy: replace loop by grid of threads!
data sizes might not perfectly divisible by block sizes: always check bounds
prevent threads of boundary block to read/write outside allocated memory

01 // compute vector sum C = A + B

02 // each thread peforms one pair-wise addition

03 __global__

04 void vecAddKernel(float* A, float* B, float* C, int n) {

05 int i = threadIdx.x + blockDim.x * blockIdx.x;

06 if (i < n) { // check bounds

07 C[i] = A[i] + B[i];

08 }

09 }

22 of 33

Calling Kernels

kernel configuration is specified between `<<<` and `>>>`
number of blocks, number of threads in each block

we will learn about additional launch parameters (shared-mem size, cudaStream) later

dim3 numThreads(256);

dim3 numBlocks((n + numThreads - 1) / numThreads);

vecAddKernel<<<numBlocks, numThreads>>>(A_d, B_d, C_d, n);

23 of 33

Compiler

nvcc (NVIDIA C compiler) is used to compile kernels into PTX
Parallel Thread Execution (PTX) is a low-level VM & instruction set
graphics driver translates PTX into executable binary code (SASS)

24 of 33

Ch 3: Multidimensional grids and data

CUDA grid: 2 level hierarchy: blocks, threads
Idea: map threads to multi-dimensional data
all threads in a grid execute the same kernel
threads in same block can access the same shared mem
max block size: 1024 threads
built-in 3D coordinates of a thread: blockIdx, threadIdx - identify which portion of the data to process
shape of grid & blocks:

gridDim: number of blocks in the grid
blockDim: number of threads in a block

25 of 33

26 of 33

Grid continued

grid can be different for each kernel launch, e.g. dependent on data shapes
typical grids contain thousands to millions of threads
simple strategy: one thread per output element (e.g. one thread per pixel, one thread per tensor element)
threads can be scheduled in any order
can use fewer than 3 dims (set others to 1)
e.g. 1D for sequences, 2D for images etc.

dim3 grid(32, 1, 1);

dim3 block(128, 1, 1);

kernelFunction<<<grid, block>>>(..);

// Number of threads: 128 * 32=4096

27 of 33

Built-in Variables

Built-in variables inside kernels:

(blockDim & gridDim have the same values in all threads)

blockIdx // dim3 block coordinate

threadIdx // dim3 thread coordinate

blockDim // number of threads in a block

gridDim // number of blocks in a grid

28 of 33

nd-Arrays in Memory

memory of multi-dim arrays under the hood is flat 1d
2d array can be linearized different ways:

torch tensors & numpy ndarrays use strides to specify how elements are laid out in memory.

- row-major:

A B C

D E F

G H I

0,0	0,1	0,2	0,3
1,0	1,1	1,2	1,3
2,0	2,1	2,2	2,3
3,0	3,1	3,2	3,3

Logical view of data

0,0	0,1	0,2	0,3	1,0	1,1	1,2	1,3	2,0	2,1	2,2	2,3	3,0	3,1	3,2	3,3

Actual layout in memory

- column-major:

A D G

B E H

C F I

29 of 33

Image blur example (3.3, p. 60)

mean filter example blurKernel
each thread writes one output element, reads multiple values
single plane in book, can be extended easily to multi-channel
shows row-major pixel memory access (in & out pointers)
track of how many pixels values are summed
handles boundary conditions in ln 5 & 25

30 of 33

Handling Boundary Conditions

31 of 33

32 of 33

Matrix Multiplication

staple of science & engineering (and deep learning)
compute inner-products of rows & columns
Strategy: 1 thread per output matrix element
Example: Multiplying square matrices (rows == cols)