1 of 23

Lecture 9: Reductions

Mark Saroufim

2 of 23

Follow along

Chapter 10 of PMPP book

Locally or remotely https://lightning.ai/

git clone https://github.com/cuda-mode/lectures

cd lecture9

nvcc -o sum *.cu

ncu sum

3 of 23

What’s a reduction

Operations that reduce the output size

Most typical take a vector and produce a scalar

min, max, argmax, argmin norm, sum, prod, mean, unique

Demo: torch_reductions.py

4 of 23

Reductions are everywhere

Mean/Max pooling
Classification: Argmax
Loss calculations
Softmax normalization

5 of 23

Reductions in PyTorch

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/ReduceOps.cpp

>>> a = torch.randn(1, 3)

>>> a

tensor([[ 0.6763, 0.7445, -2.2369]])

>>> torch.max(a)

tensor(0.7445)

6 of 23

Serial reduction example

Max operation

Go through elements 1 by 1

Compare new number to old max if greater then update

7 of 23

More general formulation

https://gist.github.com/msaroufim/a062aa0b08a4cc57e02db634a67c6b20

8 of 23

Transformation vs reduction

What should the thread strategy be?

Output size < Input size that’s why we call them reductions

https://www.youtube.com/watch?v=D4l1YMsGNlU&t=1763s

9 of 23

Parallel Reduction visualization

At each step take a pair of elements and compute their max and store the new max in new vector

Continue until there is 1 element in the vector

O(log n) steps

10 of 23

Reduction Trees:

11 of 23

Non determinism and accuracy

torch.use_deterministic_algorithms(True)

Demo

nondeterminism.py
accuracy.py

12 of 23

Reduction Kernel

A lot of threads will be inactive :(
A lot of warps (groups of 32 threads) will be inactive :(
Let’s check ncu -set full

simple_reduce.cu

13 of 23

Remember the performance checklist

Lecture 8!

Control divergence
Memory divergence
Minimize global memory access
Thread coarsening

14 of 23

Minimize Control Divergence

Ensure threads and their owned positions remain close together as time progresses

Quiz: Which other problem does this fix?

control_divergence_reduce

15 of 23

Minimize Global Memory ACcess

shared_reduce.cu

16 of 23

Hierarchical reduction

Let’s try running input size 4096

segment_reduce.cu

17 of 23

Thread Coarsening (Andreas’ favorite optimization)

reduce_coarsening.cu

18 of 23

Next steps

Lecture 1-8 gave you everything you need to start writing, profiling and shipping kernels in PyTorch so start picking a project - Look for collaborators in #general to stay motivated

Next Lecturer is Oscar who will talk about shipping production CUDA libraries

Looking for lecturers interested in covering prefix sum (scan) and NCCL

19 of 23

Bonus slides: Reductions in the real world

20 of 23

Example of reductions

User facing ops

How reductions are implemented in PyTorch

21 of 23

Key ideas

Implementation is accumulator and reduction op agnostic
TensorIterator to iterate over tensor elements
ReduceConfig: Has kernel launch parameters like block size and number of threads, grid etc.. and its set in setReduceConfig
Reduce_kernel is where it gets launched
Reduction strategies: thread level, block level x,y, or global reduce
Vectorization: Over input and/or output

22 of 23

torch.compile!

To the notebook - reduce_compile.py

Look out for

ReductionHint
tl.sum
triton_heuristics

23 of 23

Triton

https://github.com/openai/triton/blob/main/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp

// First reduce all the values along axis within each thread.

reduceWithinThreads(helper, srcValues, accs, indices, rewriter);

// Then reduce across threads within a warp.

reduceWithinWarps(helper, accs, rewriter);