1 of 45

GPU Programming Fundamentals:

A Hardware-First Perspective

William Brandon

August 2025

2 of 45

Who is this talk for?

  • Anyone who might want to write a kernel at some point!
    • (In any language)

  • Try to build foundations from the bottom up
  • If you’re more experienced, I hope it still helps you see things a bit differently

etc…

3 of 45

Who am I?

Deep learning compilers

LLM efficiency research +

creating 6.S894 (“Accelerated Computing”)

Making Claude faster

4 of 45

Who am I?

LLM efficiency research +

creating 6.S894 (“Accelerated Computing”)

5 of 45

Who am I?

LLM efficiency research +

creating 6.S894 (“Accelerated Computing”)

Class co-created with…

Jonathan Ragan-Kelley

Nikita Lazarev

Materials are online!

Recommended reading:

Lab 1: http://bit.ly/4lMkoBz

Lab 2: http://bit.ly/4mPB2Bu

6 of 45

What is a CUDA program?

__global__ void my_kernel(...) {

...

}

my_kernel<<<128, 1024>>>(...);

Number of “blocks”

Number of “threads per block”

7 of 45

What is a CUDA program?

Your program, according to CUDA:

1024 threads

1024 threads

1024 threads

128 blocks

8 of 45

A Puzzle

9 of 45

A Puzzle

1024 threads

1024 threads

128 blocks

my_kernel<<<128, 1024>>>(...);

10 of 45

A Puzzle

1024 threads

1024 threads

128 blocks

256 threads

128 blocks

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

How do these compare?

11 of 45

A Puzzle

1024 threads

1024 threads

128 blocks

256 threads

128 blocks

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

How do these compare?

12 of 45

A Puzzle

1024 threads

1024 threads

128 blocks

256 threads

128 blocks

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

Often, ~same speed!

(why? 🤔)

13 of 45

Another Puzzle

14 of 45

Another Puzzle

1024 threads

1024 threads

128 blocks

my_kernel<<<128, 1024>>>(...);

15 of 45

Another Puzzle

1024 threads

1024 threads

128 blocks

my_kernel<<<128, 1024>>>(...);

How do these compare?

1024 threads

1024 threads

133 blocks

my_kernel<<<133, 1024>>>(...);

~4% increase

16 of 45

Another Puzzle

1024 threads

1024 threads

128 blocks

my_kernel<<<128, 1024>>>(...);

1024 threads

1024 threads

133 blocks

my_kernel<<<133, 1024>>>(...);

~4% increase

Sometimes, ~2x slower!

(why? 🤔)

17 of 45

What is a CUDA program?

Your program, according to CUDA:

1024 threads

1024 threads

1024 threads

128 blocks

Useful, but

incomplete

18 of 45

What is a CUDA program?

19 of 45

What is a CUDA program?

What is a GPU?

20 of 45

What is a CUDA program?

What is a GPU?

21 of 45

What is a CUDA program?

What is a GPU?

22 of 45

What is a CUDA program?

What is a GPU?

23 of 45

What is a CUDA program?

What is a GPU? A GPU is a bunch of tiny squares.

24 of 45

What is a CUDA program?

What is a GPU? A GPU is a bunch of tiny squares.

“streaming multiprocessors”

25 of 45

A GPU is a bunch of streaming multiprocessors.

How many? Order of magnitude: ~100

26 of 45

A GPU is a bunch of streaming multiprocessors.

How many? Order of magnitude: ~100

What is inside a streaming multiprocessor?

1 SM = 4 thingies

27 of 45

A GPU is a bunch of streaming multiprocessors.

How many? Order of magnitude: ~100

What’s inside a streaming multiprocessor?

1 SM = 4 thingies

(“quadrants” or

partitions” or

warp schedulers”)

28 of 45

What’s inside a streaming multiprocessor?

Four warp schedulers

29 of 45

What’s inside a streaming multiprocessor?

Four warp schedulers

What is a warp scheduler? It’s like a CPU core.

30 of 45

What’s inside a streaming multiprocessor?

Four warp schedulers

What is a warp scheduler? It’s like a CPU core*.

(*a CPU core specialized for running SIMD instructions)

31 of 45

What’s inside a streaming multiprocessor?

Four warp schedulers

What is a warp scheduler? It’s like a CPU core*.

(*a CPU core specialized for running SIMD instructions)

In what sense?

  • Issues a stream of instructions in sequence

  • Stores execution state for the running program (program counter, registers)

  • Has functional units which can do math, talk to main memory, etc.

32 of 45

What is a warp scheduler? It’s like a CPU core.

33 of 45

What is a warp scheduler? It’s like a CPU core.

Sanity check: does the math work out?

1) You may often run kernels with

hundreds of thousands of threads

Example

kernel<<<128, 1024>>>(...);

128 * 1024 = 131,072

2) But a GPU only has hundreds of warp schedulers

Example

NVIDIA H100 GPU:

132 SMs * 4 = 528 warp schedulers

What gives?

34 of 45

What is a warp scheduler? It’s like a CPU core.

Sanity check: does the math work out?

1) You may often run kernels with

hundreds of thousands of threads

2) But a GPU only has hundreds of warp schedulers

What gives?

Three answers:

  1. Warp scheduler instructions are SIMD�(32 threads at a time)
  2. Warp scheduler can time multiplex between different threads
  3. Too many blocks → run them serially

35 of 45

Warp scheduler instructions are (implicitly) SIMD

CPU – explicit SIMD

GPU – implicit SIMD

fp32x16 x = {...};

fp32x16 y = {...};

fp32x16 z =

vector_add_fp32x16(x, y);

Single instruction issued

→ 16 scalar additions in parallel

float x = ...;

float y = ...;

float z = x + y;

Single instruction issued

→ 32 scalar additions in parallel

(serving 32 consecutive threads)

36 of 45

Warp scheduler can time multiplex between different threads

Time (cycles)

Instruction issue

(threads 0-31)

Instruction issue

(threads 32-64)

Instruction issue

(threads 0-32)

Instruction issue

(threads 32-64)

Instruction latency

Instruction latency

(CPU cores can do this too!)

37 of 45

Can also overlap instructions without multiple threads

(if the pattern is right)

Time (cycles)

Instruction issue

(threads 0-31)

Instruction issue

(threads 0-32)

Instruction issue

(threads 0-32)

Instruction issue

(threads 0-32)

Instruction latency

Instruction latency

(CPU cores can do this too!)

38 of 45

How many threads do you need to keep an SM busy?

(at minimum)

4 warp schedulers * 32 = 128 threads

How many threads do you need to keep the GPU busy?

(e.g. H100)

132 SMs * 128 = 16,896 threads

(but more threads often improves latency hiding!)

39 of 45

The First Puzzle

1024 threads

1024 threads

128 blocks

256 threads

128 blocks

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

How do these compare?

40 of 45

The First Puzzle

1024 threads

1024 threads

128 blocks

256 threads

128 blocks

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

Each one (might) have enough to saturate 128 SMs!

41 of 45

Too many blocks → run them serially

396 blocks

Example:

132 SMs

Each SM runs

3 blocks

42 of 45

The Second Puzzle

1024 threads

1024 threads

128 blocks

my_kernel<<<128, 1024>>>(...);

How do these compare?

1024 threads

1024 threads

133 blocks

my_kernel<<<133, 1024>>>(...);

~4% increase

43 of 45

The Second Puzzle

1024 threads

1024 threads

128 blocks

my_kernel<<<128, 1024>>>(...);

1024 threads

1024 threads

133 blocks

my_kernel<<<133, 1024>>>(...);

< 4% increase

133 blocks, 132 SMs → 1 SM has to run 2 blocks!

44 of 45

Bonus Chatter

  • What if different threads have different control flow?
    • To a first approximation: masking
  • Why do SMs exist at all? Why not only warp schedulers?
    • Shared scratchpad memory
  • What about clusters / tensor cores / TMA / weirder features?
    • Q&A!

45 of 45

Q&A!

Materials are online!