CHAPTER 11
Prefix Sum (Scan)
Wen-mei Hwu, David Kirk, Izzat El Hajj
Programming Massively Parallel Processors
A Hands-on Approach
z
Copyright © 2022 Elsevier
Copyright © 2022 Elsevier
Scan
Copyright © 2022 Elsevier
Scan Example
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
3 | 6 | 7 | 4 | 8 | 2 | 1 | 9 |
3 | 9 | 16 | 20 | 28 | 30 | 31 | 40 |
Inclusive Scan
Inclusive Scan
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
ID | x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 |
3 | 6 | 7 | 4 | 8 | 2 | 1 | 9 |
0 | 3 | 9 | 16 | 20 | 28 | 30 | 31 |
Exclusive Scan
Exclusive Scan
Copyright © 2022 Elsevier
Sequential Scan
output[0] = input[0];
for(i = 1; i < N; ++i) {
output[i] = output[i-1] + input[i];
}
Inclusive Scan
output[0] = 0.0f;
for(i = 1; i < N; ++i) {
output[i] = output[i-1] + input[i-1];
}
Exclusive Scan
output[0] = input[0];
for(i = 1; i < N; ++i) {
output[i] = f(output[i-1], input[i]);
}
Inclusive Scan
output[0] = IDENTITY;
for(i = 1; i < N; ++i) {
output[i] = f(output[i-1], input[i-1]);
}
Exclusive Scan
Copyright © 2022 Elsevier
Segmented Scan
Copyright © 2022 Elsevier
Segmented Scan Example
`
Block 0 (Scan)
Block 1 (Scan)
Block 2 (Scan)
Block 3 (Scan)
Scan Partial Sums
Block 1 (Add)
Block 2 (Add)
Block 3 (Add)
For now, we will focus on implementing a parallel scan in each block
Copyright © 2022 Elsevier
Parallel (Inclusive) Scan
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x2 | x2..x3 | x4 | x4..x5 | x6 | x6..x7 |
x0 | x0..x1 | x2 | x0..x3 | x4 | x4..x5 | x6 | x4..x7 |
x0 | x0..x1 | x2 | x0..x3 | x4 | x4..x5 | x6 | x0..x7 |
A parallel reduction tree for the last element gives some others as a byproduct
Copyright © 2022 Elsevier
Parallel (Inclusive) Scan
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x1..x2 | x2..x3 | x3..x4 | x4..x5 | x5..x6 | x6..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x3..x4 | x4..x5 | x3..x6 | x4..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x3..x4 | x4..x5 | x0..x6 | x0..x7 |
Another reduction tree gives us more elements
Copyright © 2022 Elsevier
Parallel (Inclusive) Scan
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x1..x2 | x2..x3 | x3..x4 | x4..x5 | x5..x6 | x6..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x3..x4 | x2..x5 | x3..x6 | x4..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x3..x4 | x0..x5 | x0..x6 | x0..x7 |
Keep doing reduction trees until we get all answers
Copyright © 2022 Elsevier
Parallel (Inclusive) Scan
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x1..x2 | x2..x3 | x3..x4 | x4..x5 | x5..x6 | x6..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x1..x4 | x2..x5 | x3..x6 | x4..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
Keep doing reduction trees until we get all answers
Copyright © 2022 Elsevier
Parallel (Inclusive) Scan
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x1..x2 | x2..x3 | x3..x4 | x4..x5 | x5..x6 | x6..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x1..x4 | x2..x5 | x3..x6 | x4..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
Overlap the trees and do them simultaneously
Copyright © 2022 Elsevier
Kogge-Stone Parallel (Inclusive) Scan
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x1..x2 | x2..x3 | x3..x4 | x4..x5 | x5..x6 | x6..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x1..x4 | x2..x5 | x3..x6 | x4..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
One thread for each element
Copyright © 2022 Elsevier
Using Shared Memory
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x1..x2 | x2..x3 | x3..x4 | x4..x5 | x5..x6 | x6..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x1..x4 | x2..x5 | x3..x6 | x4..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
Optimization: load once to a shared memory buffer and perform successive reads and writes to the same array can be done in shared memory
One thread for each element
Copyright © 2022 Elsevier
Kogge-Stone Parallel (Inclusive) Scan Code
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ float buffer_s[BLOCK_DIM];
buffer_s[threadIdx.x] = input[i];
__syncthreads();
for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {
if(threadIdx.x >= stride) {
buffer_s[threadIdx.x] += buffer_s[threadIdx.x - stride];
}
__syncthreads();
}
if(threadIdx.x == BLOCK_DIM - 1) {
partialSums[blockIdx.x] = buffer_s[threadIdx.x];
}
output[i] = buffer_s[threadIdx.x];
Incorrect!
Different threads are reading and writing the same data location without synchronizing
Copyright © 2022 Elsevier
Kogge-Stone Parallel (Inclusive) Scan
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x1..x2 | x2..x3 | x3..x4 | x4..x5 | x5..x6 | x6..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x1..x4 | x2..x5 | x3..x6 | x4..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
Thread 1 may update value at index 1 before thread 2 reads it
Solution: wait for everyone to read before updating
Copyright © 2022 Elsevier
Kogge-Stone Parallel (Inclusive) Scan Code
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ float buffer_s[BLOCK_DIM];
buffer_s[threadIdx.x] = input[i];
__syncthreads();
for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {
float v;
if(threadIdx.x >= stride) {
v = buffer_s[threadIdx.x - stride];
}
__syncthreads();
if(threadIdx.x >= stride) {
buffer_s[threadIdx.x] += v;
}
__syncthreads();
}
if(threadIdx.x == BLOCK_DIM - 1) {
partialSums[blockIdx.x] = buffer_s[threadIdx.x];
}
output[i] = buffer_s[threadIdx.x];
Wait for everyone to read before writing
Copyright © 2022 Elsevier
True and False Dependences
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ float buffer_s[BLOCK_DIM];
buffer_s[threadIdx.x] = input[i];
__syncthreads();
for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {
float v;
if(threadIdx.x >= stride) {
v = buffer_s[threadIdx.x - stride];
}
__syncthreads();
if(threadIdx.x >= stride) {
buffer_s[threadIdx.x] += v;
}
__syncthreads();
}
if(threadIdx.x == BLOCK_DIM - 1) {
partialSums[blockIdx.x] = buffer_s[threadIdx.x];
}
output[i] = buffer_s[threadIdx.x];
This synchronization enforces a false dependence (we only need to finish reading before others write because we are using the same buffer)
This synchronization enforces a true dependence (we must finish writing before others can read)
Copyright © 2022 Elsevier
Double Buffering
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x1..x2 | x2..x3 | x3..x4 | x4..x5 | x5..x6 | x6..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x1..x4 | x2..x5 | x3..x6 | x4..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
buffer1
buffer2
buffer1
buffer2
Optimization: eliminate the synchronization that enforces a false dependence by using separate buffers for reading and writing, and alternate the buffers each iteration (called double buffering)
in = buffer1, out = buffer2
in = buffer2, out = buffer1
in = buffer1, out = buffer2
Threads not adding must copy their values
Copyright © 2022 Elsevier
Double Buffering Code
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ float buffer1_s[BLOCK_DIM];
__shared__ float buffer2_s[BLOCK_DIM];
float* inBuffer_s = buffer1_s;
float* outBuffer_s = buffer2_s;
inBuffer_s[threadIdx.x] = input[i];
__syncthreads();
for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {
if(threadIdx.x >= stride) {
outBuffer_s[threadIdx.x] =
inBuffer_s[threadIdx.x] + inBuffer_s[threadIdx.x - stride];
} else {
outBuffer_s[threadIdx.x] = inBuffer_s[threadIdx.x];
}
__syncthreads();
float* tmp = inBuffer_s;
inBuffer_s = outBuffer_s;
outBuffer_s = tmp;
}
if(threadIdx.x == BLOCK_DIM - 1) {
partialSums[blockIdx.x] = inBuffer_s[threadIdx.x];
}
output[i] = inBuffer_s[threadIdx.x];
Copyright © 2022 Elsevier
Work Efficiency
= N*log(N) - (N-1) = O(N*log(N)) operations
Copyright © 2022 Elsevier
Brent-Kung Parallel (Inclusive) Scan
x0 | x0..x1 | x2 | x0..x3 | x4 | x4..x5 | x6 | x0..x7 |
x0 | x0..x1 | x2 | x0..x3 | x4 | x0..x5 | x6 | x0..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x2 | x2..x3 | x4 | x4..x5 | x6 | x6..x7 |
x0 | x0..x1 | x2 | x0..x3 | x4 | x4..x5 | x6 | x4..x7 |
Reduction Stage
Post-Reduction Stage
Copyright © 2022 Elsevier
Kogge-Stone vs. Brent-Kung
x0 | x0..x1 | x2 | x0..x3 | x4 | x4..x5 | x6 | x0..x7 |
x0 | x0..x1 | x2 | x0..x3 | x4 | x0..x5 | x6 | x0..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x2 | x2..x3 | x4 | x4..x5 | x6 | x6..x7 |
x0 | x0..x1 | x2 | x0..x3 | x4 | x4..x5 | x6 | x4..x7 |
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0 | x0..x1 | x1..x2 | x2..x3 | x3..x4 | x4..x5 | x5..x6 | x6..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x1..x4 | x2..x5 | x3..x6 | x4..x7 |
x0 | x0..x1 | x0..x2 | x0..x3 | x0..x4 | x0..x5 | x0..x6 | x0..x7 |
Copyright © 2022 Elsevier
Work Efficiency
Copyright © 2022 Elsevier
References
Copyright © 2022 Elsevier