2 of 45

Who is this talk for?

Anyone who might want to write a kernel at some point!

(In any language)

Try to build foundations from the bottom up
If you’re more experienced, I hope it still helps you see things a bit differently

etc…

3 of 45

Who am I?

Deep learning compilers

LLM efficiency research +

creating 6.S894 (“Accelerated Computing”)

Making Claude faster

4 of 45

Who am I?

LLM efficiency research +

creating 6.S894 (“Accelerated Computing”)

5 of 45

Who am I?

LLM efficiency research +

creating 6.S894 (“Accelerated Computing”)

Class co-created with…

Jonathan Ragan-Kelley

Nikita Lazarev

Materials are online!

6 of 45

What is a CUDA program?

__global__ void my_kernel(...) {

...

}

my_kernel<<<128, 1024>>>(...);

Number of “blocks”

Number of “threads per block”

7 of 45

What is a CUDA program?

Your program, according to CUDA:

…

1024 threads

…

1024 threads

…

1024 threads

…

128 blocks

9 of 45

A Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

my_kernel<<<128, 1024>>>(...);

10 of 45

A Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

…

256 threads

…

128 blocks

…

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

How do these compare?

11 of 45

A Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

…

256 threads

…

128 blocks

…

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

How do these compare?

12 of 45

A Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

…

256 threads

…

128 blocks

…

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

Often, ~same speed!

(why? 🤔)

13 of 45

Another Puzzle

14 of 45

Another Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

my_kernel<<<128, 1024>>>(...);

15 of 45

Another Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

my_kernel<<<128, 1024>>>(...);

How do these compare?

…

1024 threads

…

1024 threads

…

133 blocks

my_kernel<<<133, 1024>>>(...);

~4% increase

16 of 45

Another Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

my_kernel<<<128, 1024>>>(...);

…

1024 threads

…

1024 threads

…

133 blocks

my_kernel<<<133, 1024>>>(...);

~4% increase

Sometimes, ~2x slower!

(why? 🤔)

17 of 45

What is a CUDA program?

Your program, according to CUDA:

…

1024 threads

…

1024 threads

…

1024 threads

…

128 blocks

Useful, but

incomplete

18 of 45

What is a CUDA program?

19 of 45

What is a CUDA program?

What is a GPU?

20 of 45

What is a CUDA program?

What is a GPU?

21 of 45

What is a CUDA program?

What is a GPU?

22 of 45

What is a CUDA program?

What is a GPU?

23 of 45

What is a CUDA program?

What is a GPU? A GPU is a bunch of tiny squares.

24 of 45

What is a CUDA program?

What is a GPU? A GPU is a bunch of tiny squares.

“streaming multiprocessors”

25 of 45

A GPU is a bunch of streaming multiprocessors.

How many? Order of magnitude: ~100

26 of 45

A GPU is a bunch of streaming multiprocessors.

How many? Order of magnitude: ~100

What is inside a streaming multiprocessor?

1 SM = 4 thingies

27 of 45

A GPU is a bunch of streaming multiprocessors.

How many? Order of magnitude: ~100

What’s inside a streaming multiprocessor?

1 SM = 4 thingies

(“quadrants” or

“partitions” or

“warp schedulers”)

28 of 45

What’s inside a streaming multiprocessor?

Four warp schedulers

29 of 45

What’s inside a streaming multiprocessor?

Four warp schedulers

What is a warp scheduler? It’s like a CPU core.

30 of 45

What’s inside a streaming multiprocessor?

Four warp schedulers

What is a warp scheduler? It’s like a CPU core*.

(*a CPU core specialized for running SIMD instructions)

31 of 45

What’s inside a streaming multiprocessor?

Four warp schedulers

What is a warp scheduler? It’s like a CPU core*.

(*a CPU core specialized for running SIMD instructions)

In what sense?

Issues a stream of instructions in sequence

Stores execution state for the running program (program counter, registers)

Has functional units which can do math, talk to main memory, etc.

32 of 45

What is a warp scheduler? It’s like a CPU core.

33 of 45

What is a warp scheduler? It’s like a CPU core.

Sanity check: does the math work out?

1) You may often run kernels with

hundreds of thousands of threads

Example

kernel<<<128, 1024>>>(...);

128 * 1024 = 131,072

2) But a GPU only has hundreds of warp schedulers

Example

NVIDIA H100 GPU:

132 SMs * 4 = 528 warp schedulers

What gives?

34 of 45

What is a warp scheduler? It’s like a CPU core.

Sanity check: does the math work out?

1) You may often run kernels with

hundreds of thousands of threads

2) But a GPU only has hundreds of warp schedulers

What gives?

Three answers:

Warp scheduler instructions are SIMD�(32 threads at a time)
Warp scheduler can time multiplex between different threads
Too many blocks → run them serially

35 of 45

Warp scheduler instructions are (implicitly) SIMD

CPU – explicit SIMD

GPU – implicit SIMD

fp32x16 x = {...};

fp32x16 y = {...};

fp32x16 z =

vector_add_fp32x16(x, y);

Single instruction issued

→ 16 scalar additions in parallel

float x = ...;

float y = ...;

float z = x + y;

Single instruction issued

→ 32 scalar additions in parallel

(serving 32 consecutive threads)

36 of 45

Warp scheduler can time multiplex between different threads

Time (cycles)

Instruction issue

(threads 0-31)

Instruction issue

(threads 32-64)

Instruction issue

(threads 0-32)

Instruction issue

(threads 32-64)

…

Instruction latency

(CPU cores can do this too!)

37 of 45

Can also overlap instructions without multiple threads

(if the pattern is right)

Time (cycles)

Instruction issue

(threads 0-31)

Instruction issue

(threads 0-32)

Instruction issue

(threads 0-32)

Instruction issue

(threads 0-32)

…

Instruction latency

(CPU cores can do this too!)

38 of 45

How many threads do you need to keep an SM busy?

(at minimum)

4 warp schedulers * 32 = 128 threads

How many threads do you need to keep the GPU busy?

(e.g. H100)

132 SMs * 128 = 16,896 threads

(but more threads often improves latency hiding!)

39 of 45

The First Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

…

256 threads

…

128 blocks

…

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

How do these compare?

40 of 45

The First Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

…

256 threads

…

128 blocks

…

256 threads

my_kernel<<<128, 1024>>>(...);

my_kernel<<<128, 256>>>(...);

Each one (might) have enough to saturate 128 SMs!

41 of 45

Too many blocks → run them serially

396 blocks

Example:

132 SMs

Each SM runs

3 blocks

42 of 45

The Second Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

my_kernel<<<128, 1024>>>(...);

How do these compare?

…

1024 threads

…

1024 threads

…

133 blocks

my_kernel<<<133, 1024>>>(...);

~4% increase

43 of 45

The Second Puzzle

…

1024 threads

…

1024 threads

…

128 blocks

my_kernel<<<128, 1024>>>(...);

…

1024 threads

…

1024 threads

…

133 blocks

my_kernel<<<133, 1024>>>(...);

< 4% increase

133 blocks, 132 SMs → 1 SM has to run 2 blocks!

44 of 45

Bonus Chatter

What if different threads have different control flow?

To a first approximation: masking

Why do SMs exist at all? Why not only warp schedulers?

Shared scratchpad memory

What about clusters / tensor cores / TMA / weirder features?

Q&A!

1 of 45

2 of 45

3 of 45

4 of 45

5 of 45

6 of 45

7 of 45

8 of 45

9 of 45

10 of 45

11 of 45

12 of 45

13 of 45

14 of 45

15 of 45

16 of 45

17 of 45

18 of 45

19 of 45

20 of 45

21 of 45

22 of 45

23 of 45

24 of 45

25 of 45

26 of 45

27 of 45

28 of 45

29 of 45

30 of 45

31 of 45

32 of 45

33 of 45

34 of 45

35 of 45

36 of 45

37 of 45

38 of 45

39 of 45

40 of 45

41 of 45

42 of 45

43 of 45

44 of 45

45 of 45