Lecture 9: Reductions
Mark Saroufim
Follow along
Chapter 10 of PMPP book
Locally or remotely https://lightning.ai/
git clone https://github.com/cuda-mode/lectures
cd lecture9
nvcc -o sum *.cu
ncu sum
What’s a reduction
Operations that reduce the output size
Most typical take a vector and produce a scalar
min, max, argmax, argmin norm, sum, prod, mean, unique
Demo: torch_reductions.py
Reductions are everywhere
Reductions in PyTorch
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/ReduceOps.cpp
>>> a = torch.randn(1, 3)
>>> a
tensor([[ 0.6763, 0.7445, -2.2369]])
>>> torch.max(a)
tensor(0.7445)
Serial reduction example
Max operation
Go through elements 1 by 1
Compare new number to old max if greater then update
More general formulation
Transformation vs reduction
What should the thread strategy be?
Output size < Input size that’s why we call them reductions
https://www.youtube.com/watch?v=D4l1YMsGNlU&t=1763s
Parallel Reduction visualization
At each step take a pair of elements and compute their max and store the new max in new vector
Continue until there is 1 element in the vector
O(log n) steps
Reduction Trees:
Non determinism and accuracy
torch.use_deterministic_algorithms(True)
Demo
Reduction Kernel
simple_reduce.cu
Remember the performance checklist
Lecture 8!
Minimize Control Divergence
Ensure threads and their owned positions remain close together as time progresses
Quiz: Which other problem does this fix?
control_divergence_reduce
Minimize Global Memory ACcess
shared_reduce.cu
Hierarchical reduction
Let’s try running input size 4096
segment_reduce.cu
Thread Coarsening (Andreas’ favorite optimization)
reduce_coarsening.cu
Next steps
Lecture 1-8 gave you everything you need to start writing, profiling and shipping kernels in PyTorch so start picking a project - Look for collaborators in #general to stay motivated
Next Lecturer is Oscar who will talk about shipping production CUDA libraries
Looking for lecturers interested in covering prefix sum (scan) and NCCL
Bonus slides: Reductions in the real world
Example of reductions
User facing ops
How reductions are implemented in PyTorch
Key ideas
torch.compile!
To the notebook - reduce_compile.py
Look out for
Triton
https://github.com/openai/triton/blob/main/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
// First reduce all the values along axis within each thread.
reduceWithinThreads(helper, srcValues, accs, indices, rewriter);
// Then reduce across threads within a warp.
reduceWithinWarps(helper, accs, rewriter);