GPU Programming Fundamentals:
A Hardware-First Perspective
William Brandon
August 2025
Who is this talk for?
etc…
Who am I?
Deep learning compilers
LLM efficiency research +
creating 6.S894 (“Accelerated Computing”)
Making Claude faster
Who am I?
LLM efficiency research +
creating 6.S894 (“Accelerated Computing”)
Who am I?
LLM efficiency research +
creating 6.S894 (“Accelerated Computing”)
Class co-created with…
Jonathan Ragan-Kelley
Nikita Lazarev
Materials are online!
What is a CUDA program?
__global__ void my_kernel(...) {
...
}
my_kernel<<<128, 1024>>>(...);
Number of “blocks”
Number of “threads per block”
What is a CUDA program?
Your program, according to CUDA:
…
1024 threads
…
1024 threads
…
1024 threads
…
128 blocks
A Puzzle
A Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
my_kernel<<<128, 1024>>>(...);
A Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
…
256 threads
…
128 blocks
…
256 threads
my_kernel<<<128, 1024>>>(...);
my_kernel<<<128, 256>>>(...);
How do these compare?
A Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
…
256 threads
…
128 blocks
…
256 threads
my_kernel<<<128, 1024>>>(...);
my_kernel<<<128, 256>>>(...);
How do these compare?
A Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
…
256 threads
…
128 blocks
…
256 threads
my_kernel<<<128, 1024>>>(...);
my_kernel<<<128, 256>>>(...);
Often, ~same speed!
(why? 🤔)
Another Puzzle
Another Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
my_kernel<<<128, 1024>>>(...);
Another Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
my_kernel<<<128, 1024>>>(...);
How do these compare?
…
1024 threads
…
1024 threads
…
133 blocks
my_kernel<<<133, 1024>>>(...);
~4% increase
Another Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
my_kernel<<<128, 1024>>>(...);
…
1024 threads
…
1024 threads
…
133 blocks
my_kernel<<<133, 1024>>>(...);
~4% increase
Sometimes, ~2x slower!
(why? 🤔)
What is a CUDA program?
Your program, according to CUDA:
…
1024 threads
…
1024 threads
…
1024 threads
…
128 blocks
Useful, but
incomplete
What is a CUDA program?
What is a CUDA program?
What is a GPU?
What is a CUDA program?
What is a GPU?
What is a CUDA program?
What is a GPU?
What is a CUDA program?
What is a GPU?
What is a CUDA program?
What is a GPU? A GPU is a bunch of tiny squares.
What is a CUDA program?
What is a GPU? A GPU is a bunch of tiny squares.
“streaming multiprocessors”
A GPU is a bunch of streaming multiprocessors.
How many? Order of magnitude: ~100
A GPU is a bunch of streaming multiprocessors.
How many? Order of magnitude: ~100
What is inside a streaming multiprocessor?
1 SM = 4 thingies
A GPU is a bunch of streaming multiprocessors.
How many? Order of magnitude: ~100
What’s inside a streaming multiprocessor?
1 SM = 4 thingies
(“quadrants” or
“partitions” or
“warp schedulers”)
What’s inside a streaming multiprocessor?
Four warp schedulers
What’s inside a streaming multiprocessor?
Four warp schedulers
What is a warp scheduler? It’s like a CPU core.
What’s inside a streaming multiprocessor?
Four warp schedulers
What is a warp scheduler? It’s like a CPU core*.
(*a CPU core specialized for running SIMD instructions)
What’s inside a streaming multiprocessor?
Four warp schedulers
What is a warp scheduler? It’s like a CPU core*.
(*a CPU core specialized for running SIMD instructions)
In what sense?
What is a warp scheduler? It’s like a CPU core.
What is a warp scheduler? It’s like a CPU core.
Sanity check: does the math work out?
1) You may often run kernels with
hundreds of thousands of threads
Example
kernel<<<128, 1024>>>(...);
128 * 1024 = 131,072
2) But a GPU only has hundreds of warp schedulers
Example
NVIDIA H100 GPU:
132 SMs * 4 = 528 warp schedulers
What gives?
What is a warp scheduler? It’s like a CPU core.
Sanity check: does the math work out?
1) You may often run kernels with
hundreds of thousands of threads
2) But a GPU only has hundreds of warp schedulers
What gives?
Three answers:
Warp scheduler instructions are (implicitly) SIMD
CPU – explicit SIMD
GPU – implicit SIMD
fp32x16 x = {...};
fp32x16 y = {...};
fp32x16 z =
vector_add_fp32x16(x, y);
Single instruction issued
→ 16 scalar additions in parallel
float x = ...;
float y = ...;
float z = x + y;
Single instruction issued
→ 32 scalar additions in parallel
(serving 32 consecutive threads)
Warp scheduler can time multiplex between different threads
Time (cycles)
Instruction issue
(threads 0-31)
Instruction issue
(threads 32-64)
Instruction issue
(threads 0-32)
Instruction issue
(threads 32-64)
…
Instruction latency
Instruction latency
(CPU cores can do this too!)
Can also overlap instructions without multiple threads
(if the pattern is right)
Time (cycles)
Instruction issue
(threads 0-31)
Instruction issue
(threads 0-32)
Instruction issue
(threads 0-32)
Instruction issue
(threads 0-32)
…
Instruction latency
Instruction latency
(CPU cores can do this too!)
How many threads do you need to keep an SM busy?
(at minimum)
4 warp schedulers * 32 = 128 threads
How many threads do you need to keep the GPU busy?
(e.g. H100)
132 SMs * 128 = 16,896 threads
(but more threads often improves latency hiding!)
The First Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
…
256 threads
…
128 blocks
…
256 threads
my_kernel<<<128, 1024>>>(...);
my_kernel<<<128, 256>>>(...);
How do these compare?
The First Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
…
256 threads
…
128 blocks
…
256 threads
my_kernel<<<128, 1024>>>(...);
my_kernel<<<128, 256>>>(...);
Each one (might) have enough to saturate 128 SMs!
Too many blocks → run them serially
396 blocks
Example:
132 SMs
Each SM runs
3 blocks
The Second Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
my_kernel<<<128, 1024>>>(...);
How do these compare?
…
1024 threads
…
1024 threads
…
133 blocks
my_kernel<<<133, 1024>>>(...);
~4% increase
The Second Puzzle
…
1024 threads
…
1024 threads
…
128 blocks
my_kernel<<<128, 1024>>>(...);
…
1024 threads
…
1024 threads
…
133 blocks
my_kernel<<<133, 1024>>>(...);
< 4% increase
133 blocks, 132 SMs → 1 SM has to run 2 blocks!
Bonus Chatter
Q&A!
Materials are online!