3 of 24

Scan Example

Addition example:

In general:

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

3	6	7	4	8	2	1	9

3	9	16	20	28	30	31	40

Inclusive Scan

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

ID	x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆

3	6	7	4	8	2	1	9

0	3	9	16	20	28	30	31

Exclusive Scan

4 of 24

Sequential Scan

Sequential scan for sum:

In general:

output[0] = input[0];

for(i = 1; i < N; ++i) {

output[i] = output[i-1] + input[i];

}

Inclusive Scan

output[0] = 0.0f;

for(i = 1; i < N; ++i) {

output[i] = output[i-1] + input[i-1];

}

Exclusive Scan

output[0] = input[0];

for(i = 1; i < N; ++i) {

output[i] = f(output[i-1], input[i]);

}

Inclusive Scan

output[0] = IDENTITY;

for(i = 1; i < N; ++i) {

output[i] = f(output[i-1], input[i-1]);

}

Exclusive Scan

5 of 24

Segmented Scan

Parallel scan requires synchronization across parallel workers

Approach: segmented scan

Every thread block scans a segment
Scan the segments’ partial sums
Add each segment’s scanned partial sum to the next segment

6 of 24

Segmented Scan Example

Block 0 (Scan)

Block 1 (Scan)

Block 2 (Scan)

Block 3 (Scan)

Scan Partial Sums

Block 1 (Add)

Block 2 (Add)

Block 3 (Add)

For now, we will focus on implementing a parallel scan in each block

7 of 24

Parallel (Inclusive) Scan

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₂	x₂..x₃	x₄	x₄..x₅	x₆	x₆..x₇

x₀	x₀..x₁	x₂	x₀..x₃	x₄	x₄..x₅	x₆	x₄..x₇

x₀	x₀..x₁	x₂	x₀..x₃	x₄	x₄..x₅	x₆	x₀..x₇

A parallel reduction tree for the last element gives some others as a byproduct

8 of 24

Parallel (Inclusive) Scan

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₁..x₂	x₂..x₃	x₃..x₄	x₄..x₅	x₅..x₆	x₆..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₃..x₄	x₄..x₅	x₃..x₆	x₄..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₃..x₄	x₄..x₅	x₀..x₆	x₀..x₇

Another reduction tree gives us more elements

9 of 24

Parallel (Inclusive) Scan

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₁..x₂	x₂..x₃	x₃..x₄	x₄..x₅	x₅..x₆	x₆..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₃..x₄	x₂..x₅	x₃..x₆	x₄..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₃..x₄	x₀..x₅	x₀..x₆	x₀..x₇

Keep doing reduction trees until we get all answers

10 of 24

Parallel (Inclusive) Scan

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₁..x₂	x₂..x₃	x₃..x₄	x₄..x₅	x₅..x₆	x₆..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₁..x₄	x₂..x₅	x₃..x₆	x₄..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

Keep doing reduction trees until we get all answers

11 of 24

Parallel (Inclusive) Scan

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₁..x₂	x₂..x₃	x₃..x₄	x₄..x₅	x₅..x₆	x₆..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₁..x₄	x₂..x₅	x₃..x₆	x₄..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

Overlap the trees and do them simultaneously

12 of 24

Kogge-Stone Parallel (Inclusive) Scan

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₁..x₂	x₂..x₃	x₃..x₄	x₄..x₅	x₅..x₆	x₆..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₁..x₄	x₂..x₅	x₃..x₆	x₄..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

One thread for each element

13 of 24

Using Shared Memory

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₁..x₂	x₂..x₃	x₃..x₄	x₄..x₅	x₅..x₆	x₆..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₁..x₄	x₂..x₅	x₃..x₆	x₄..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

Optimization: load once to a shared memory buffer and perform successive reads and writes to the same array can be done in shared memory

One thread for each element

14 of 24

Kogge-Stone Parallel (Inclusive) Scan Code

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float buffer_s[BLOCK_DIM];

buffer_s[threadIdx.x] = input[i];

__syncthreads();

for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {

if(threadIdx.x >= stride) {

buffer_s[threadIdx.x] += buffer_s[threadIdx.x - stride];

}

__syncthreads();

}

if(threadIdx.x == BLOCK_DIM - 1) {

partialSums[blockIdx.x] = buffer_s[threadIdx.x];

}

output[i] = buffer_s[threadIdx.x];

Incorrect!

Different threads are reading and writing the same data location without synchronizing

15 of 24

Kogge-Stone Parallel (Inclusive) Scan

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₁..x₂	x₂..x₃	x₃..x₄	x₄..x₅	x₅..x₆	x₆..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₁..x₄	x₂..x₅	x₃..x₆	x₄..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

Thread 1 may update value at index 1 before thread 2 reads it

Solution: wait for everyone to read before updating

16 of 24

Kogge-Stone Parallel (Inclusive) Scan Code

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float buffer_s[BLOCK_DIM];

buffer_s[threadIdx.x] = input[i];

__syncthreads();

for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {

float v;

if(threadIdx.x >= stride) {

v = buffer_s[threadIdx.x - stride];

}

__syncthreads();

if(threadIdx.x >= stride) {

buffer_s[threadIdx.x] += v;

}

__syncthreads();

}

if(threadIdx.x == BLOCK_DIM - 1) {

partialSums[blockIdx.x] = buffer_s[threadIdx.x];

}

output[i] = buffer_s[threadIdx.x];

Wait for everyone to read before writing

17 of 24

True and False Dependences

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float buffer_s[BLOCK_DIM];

buffer_s[threadIdx.x] = input[i];

__syncthreads();

for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {

float v;

if(threadIdx.x >= stride) {

v = buffer_s[threadIdx.x - stride];

}

__syncthreads();

if(threadIdx.x >= stride) {

buffer_s[threadIdx.x] += v;

}

__syncthreads();

}

if(threadIdx.x == BLOCK_DIM - 1) {

partialSums[blockIdx.x] = buffer_s[threadIdx.x];

}

output[i] = buffer_s[threadIdx.x];

This synchronization enforces a false dependence (we only need to finish reading before others write because we are using the same buffer)

This synchronization enforces a true dependence (we must finish writing before others can read)

18 of 24

Double Buffering

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₁..x₂	x₂..x₃	x₃..x₄	x₄..x₅	x₅..x₆	x₆..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₁..x₄	x₂..x₅	x₃..x₆	x₄..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

buffer1

buffer2

buffer1

buffer2

Optimization: eliminate the synchronization that enforces a false dependence by using separate buffers for reading and writing, and alternate the buffers each iteration (called double buffering)

in = buffer1, out = buffer2

in = buffer2, out = buffer1

in = buffer1, out = buffer2

Threads not adding must copy their values

19 of 24

Double Buffering Code

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float buffer1_s[BLOCK_DIM];

__shared__ float buffer2_s[BLOCK_DIM];

float* inBuffer_s = buffer1_s;

float* outBuffer_s = buffer2_s;

inBuffer_s[threadIdx.x] = input[i];

__syncthreads();

for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {

if(threadIdx.x >= stride) {

outBuffer_s[threadIdx.x] =

inBuffer_s[threadIdx.x] + inBuffer_s[threadIdx.x - stride];

} else {

outBuffer_s[threadIdx.x] = inBuffer_s[threadIdx.x];

}

__syncthreads();

float* tmp = inBuffer_s;

inBuffer_s = outBuffer_s;

outBuffer_s = tmp;

}

if(threadIdx.x == BLOCK_DIM - 1) {

partialSums[blockIdx.x] = inBuffer_s[threadIdx.x];

}

output[i] = inBuffer_s[threadIdx.x];

20 of 24

Work Efficiency

A parallel algorithm is work-efficient if it performs the same amount of work as the corresponding sequential algorithm
Scan work efficiency

Sequential scan performs N additions
Kogge-Stone parallel scan performs:

log(N) steps, N - 2^step operations per step
Total: (N-1) + (N-2) + (N-4) + … + (N-N/2)

= N*log(N) - (N-1) = O(N*log(N)) operations

Algorithm is not work efficient

If resources are limited, parallel algorithm will be slow because of low work efficiency

21 of 24

Brent-Kung Parallel (Inclusive) Scan

x₀	x₀..x₁	x₂	x₀..x₃	x₄	x₄..x₅	x₆	x₀..x₇

x₀	x₀..x₁	x₂	x₀..x₃	x₄	x₀..x₅	x₆	x₀..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₂	x₂..x₃	x₄	x₄..x₅	x₆	x₆..x₇

x₀	x₀..x₁	x₂	x₀..x₃	x₄	x₄..x₅	x₆	x₄..x₇

Reduction Stage

Post-Reduction Stage

22 of 24

Kogge-Stone vs. Brent-Kung

x₀	x₀..x₁	x₂	x₀..x₃	x₄	x₄..x₅	x₆	x₀..x₇

x₀	x₀..x₁	x₂	x₀..x₃	x₄	x₀..x₅	x₆	x₀..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₂	x₂..x₃	x₄	x₄..x₅	x₆	x₆..x₇

x₀	x₀..x₁	x₂	x₀..x₃	x₄	x₄..x₅	x₆	x₄..x₇

x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇

x₀	x₀..x₁	x₁..x₂	x₂..x₃	x₃..x₄	x₄..x₅	x₅..x₆	x₆..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₁..x₄	x₂..x₅	x₃..x₆	x₄..x₇

x₀	x₀..x₁	x₀..x₂	x₀..x₃	x₀..x₄	x₀..x₅	x₀..x₆	x₀..x₇

1 of 24

2 of 24

3 of 24

4 of 24

5 of 24

6 of 24

7 of 24

8 of 24

9 of 24

10 of 24

11 of 24

12 of 24

13 of 24

14 of 24

15 of 24

16 of 24

17 of 24

18 of 24

19 of 24

20 of 24

21 of 24

22 of 24

23 of 24

24 of 24