1 of 24

CHAPTER 11

Prefix Sum (Scan)

Wen-mei Hwu, David Kirk, Izzat El Hajj

Programming Massively Parallel Processors

A Hands-on Approach

z

Copyright © 2022 Elsevier

Copyright © 2022 Elsevier

2 of 24

Scan

  • A scan operation:

    • Takes:
      • An input array [x0, x1, …, xn-1]
      • An associative operator ⊕
        • e.g., sum, product, min, max

    • Returns:
      • An output array [y0, y1, …, yn-1] where
        • Inclusive scan: yi = x0 ⊕ x1 ⊕ ... ⊕ xi
        • Exclusive scan: yi = x0 ⊕ x1 ⊕ ... ⊕ xi-1

Copyright © 2022 Elsevier

3 of 24

Scan Example

  • Addition example:

  • In general:

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

3

6

7

4

8

2

1

9

3

9

16

20

28

30

31

40

Inclusive Scan

Inclusive Scan

x0

x1

x2

x3

x4

x5

x6

x7

ID

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

3

6

7

4

8

2

1

9

0

3

9

16

20

28

30

31

Exclusive Scan

Exclusive Scan

Copyright © 2022 Elsevier

4 of 24

Sequential Scan

  • Sequential scan for sum:

  • In general:

output[0] = input[0];

for(i = 1; i < N; ++i) {

output[i] = output[i-1] + input[i];

}

Inclusive Scan

output[0] = 0.0f;

for(i = 1; i < N; ++i) {

output[i] = output[i-1] + input[i-1];

}

Exclusive Scan

output[0] = input[0];

for(i = 1; i < N; ++i) {

output[i] = f(output[i-1], input[i]);

}

Inclusive Scan

output[0] = IDENTITY;

for(i = 1; i < N; ++i) {

output[i] = f(output[i-1], input[i-1]);

}

Exclusive Scan

Copyright © 2022 Elsevier

5 of 24

Segmented Scan

  • Parallel scan requires synchronization across parallel workers

  • Approach: segmented scan
    • Every thread block scans a segment
    • Scan the segments’ partial sums
    • Add each segment’s scanned partial sum to the next segment

Copyright © 2022 Elsevier

6 of 24

Segmented Scan Example

`

Block 0 (Scan)

Block 1 (Scan)

Block 2 (Scan)

Block 3 (Scan)

Scan Partial Sums

Block 1 (Add)

Block 2 (Add)

Block 3 (Add)

For now, we will focus on implementing a parallel scan in each block

Copyright © 2022 Elsevier

7 of 24

Parallel (Inclusive) Scan

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x2

x2..x3

x4

x4..x5

x6

x6..x7

x0

x0..x1

x2

x0..x3

x4

x4..x5

x6

x4..x7

x0

x0..x1

x2

x0..x3

x4

x4..x5

x6

x0..x7

A parallel reduction tree for the last element gives some others as a byproduct

Copyright © 2022 Elsevier

8 of 24

Parallel (Inclusive) Scan

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x1..x2

x2..x3

x3..x4

x4..x5

x5..x6

x6..x7

x0

x0..x1

x0..x2

x0..x3

x3..x4

x4..x5

x3..x6

x4..x7

x0

x0..x1

x0..x2

x0..x3

x3..x4

x4..x5

x0..x6

x0..x7

Another reduction tree gives us more elements

Copyright © 2022 Elsevier

9 of 24

Parallel (Inclusive) Scan

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x1..x2

x2..x3

x3..x4

x4..x5

x5..x6

x6..x7

x0

x0..x1

x0..x2

x0..x3

x3..x4

x2..x5

x3..x6

x4..x7

x0

x0..x1

x0..x2

x0..x3

x3..x4

x0..x5

x0..x6

x0..x7

Keep doing reduction trees until we get all answers

Copyright © 2022 Elsevier

10 of 24

Parallel (Inclusive) Scan

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x1..x2

x2..x3

x3..x4

x4..x5

x5..x6

x6..x7

x0

x0..x1

x0..x2

x0..x3

x1..x4

x2..x5

x3..x6

x4..x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

Keep doing reduction trees until we get all answers

Copyright © 2022 Elsevier

11 of 24

Parallel (Inclusive) Scan

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x1..x2

x2..x3

x3..x4

x4..x5

x5..x6

x6..x7

x0

x0..x1

x0..x2

x0..x3

x1..x4

x2..x5

x3..x6

x4..x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

Overlap the trees and do them simultaneously

Copyright © 2022 Elsevier

12 of 24

Kogge-Stone Parallel (Inclusive) Scan

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x1..x2

x2..x3

x3..x4

x4..x5

x5..x6

x6..x7

x0

x0..x1

x0..x2

x0..x3

x1..x4

x2..x5

x3..x6

x4..x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

One thread for each element

Copyright © 2022 Elsevier

13 of 24

Using Shared Memory

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x1..x2

x2..x3

x3..x4

x4..x5

x5..x6

x6..x7

x0

x0..x1

x0..x2

x0..x3

x1..x4

x2..x5

x3..x6

x4..x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

Optimization: load once to a shared memory buffer and perform successive reads and writes to the same array can be done in shared memory

One thread for each element

Copyright © 2022 Elsevier

14 of 24

Kogge-Stone Parallel (Inclusive) Scan Code

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float buffer_s[BLOCK_DIM];

buffer_s[threadIdx.x] = input[i];

__syncthreads();

for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {

if(threadIdx.x >= stride) {

buffer_s[threadIdx.x] += buffer_s[threadIdx.x - stride];

}

__syncthreads();

}

if(threadIdx.x == BLOCK_DIM - 1) {

partialSums[blockIdx.x] = buffer_s[threadIdx.x];

}

output[i] = buffer_s[threadIdx.x];

Incorrect!

Different threads are reading and writing the same data location without synchronizing

Copyright © 2022 Elsevier

15 of 24

Kogge-Stone Parallel (Inclusive) Scan

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x1..x2

x2..x3

x3..x4

x4..x5

x5..x6

x6..x7

x0

x0..x1

x0..x2

x0..x3

x1..x4

x2..x5

x3..x6

x4..x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

Thread 1 may update value at index 1 before thread 2 reads it

Solution: wait for everyone to read before updating

Copyright © 2022 Elsevier

16 of 24

Kogge-Stone Parallel (Inclusive) Scan Code

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float buffer_s[BLOCK_DIM];

buffer_s[threadIdx.x] = input[i];

__syncthreads();

for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {

float v;

if(threadIdx.x >= stride) {

v = buffer_s[threadIdx.x - stride];

}

__syncthreads();

if(threadIdx.x >= stride) {

buffer_s[threadIdx.x] += v;

}

__syncthreads();

}

if(threadIdx.x == BLOCK_DIM - 1) {

partialSums[blockIdx.x] = buffer_s[threadIdx.x];

}

output[i] = buffer_s[threadIdx.x];

Wait for everyone to read before writing

Copyright © 2022 Elsevier

17 of 24

True and False Dependences

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float buffer_s[BLOCK_DIM];

buffer_s[threadIdx.x] = input[i];

__syncthreads();

for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {

float v;

if(threadIdx.x >= stride) {

v = buffer_s[threadIdx.x - stride];

}

__syncthreads();

if(threadIdx.x >= stride) {

buffer_s[threadIdx.x] += v;

}

__syncthreads();

}

if(threadIdx.x == BLOCK_DIM - 1) {

partialSums[blockIdx.x] = buffer_s[threadIdx.x];

}

output[i] = buffer_s[threadIdx.x];

This synchronization enforces a false dependence (we only need to finish reading before others write because we are using the same buffer)

This synchronization enforces a true dependence (we must finish writing before others can read)

Copyright © 2022 Elsevier

18 of 24

Double Buffering

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x1..x2

x2..x3

x3..x4

x4..x5

x5..x6

x6..x7

x0

x0..x1

x0..x2

x0..x3

x1..x4

x2..x5

x3..x6

x4..x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

buffer1

buffer2

buffer1

buffer2

Optimization: eliminate the synchronization that enforces a false dependence by using separate buffers for reading and writing, and alternate the buffers each iteration (called double buffering)

in = buffer1, out = buffer2

in = buffer2, out = buffer1

in = buffer1, out = buffer2

Threads not adding must copy their values

Copyright © 2022 Elsevier

19 of 24

Double Buffering Code

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float buffer1_s[BLOCK_DIM];

__shared__ float buffer2_s[BLOCK_DIM];

float* inBuffer_s = buffer1_s;

float* outBuffer_s = buffer2_s;

inBuffer_s[threadIdx.x] = input[i];

__syncthreads();

for(unsigned int stride = 1; stride <= BLOCK_DIM/2; stride *= 2) {

if(threadIdx.x >= stride) {

outBuffer_s[threadIdx.x] =

inBuffer_s[threadIdx.x] + inBuffer_s[threadIdx.x - stride];

} else {

outBuffer_s[threadIdx.x] = inBuffer_s[threadIdx.x];

}

__syncthreads();

float* tmp = inBuffer_s;

inBuffer_s = outBuffer_s;

outBuffer_s = tmp;

}

if(threadIdx.x == BLOCK_DIM - 1) {

partialSums[blockIdx.x] = inBuffer_s[threadIdx.x];

}

output[i] = inBuffer_s[threadIdx.x];

Copyright © 2022 Elsevier

20 of 24

Work Efficiency

  • A parallel algorithm is work-efficient if it performs the same amount of work as the corresponding sequential algorithm
  • Scan work efficiency
    • Sequential scan performs N additions
    • Kogge-Stone parallel scan performs:
      • log(N) steps, N - 2step operations per step
      • Total: (N-1) + (N-2) + (N-4) + … + (N-N/2)

= N*log(N) - (N-1) = O(N*log(N)) operations

      • Algorithm is not work efficient
  • If resources are limited, parallel algorithm will be slow because of low work efficiency

Copyright © 2022 Elsevier

21 of 24

Brent-Kung Parallel (Inclusive) Scan

x0

x0..x1

x2

x0..x3

x4

x4..x5

x6

x0..x7

x0

x0..x1

x2

x0..x3

x4

x0..x5

x6

x0..x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x2

x2..x3

x4

x4..x5

x6

x6..x7

x0

x0..x1

x2

x0..x3

x4

x4..x5

x6

x4..x7

Reduction Stage

Post-Reduction Stage

Copyright © 2022 Elsevier

22 of 24

Kogge-Stone vs. Brent-Kung

x0

x0..x1

x2

x0..x3

x4

x4..x5

x6

x0..x7

x0

x0..x1

x2

x0..x3

x4

x0..x5

x6

x0..x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x2

x2..x3

x4

x4..x5

x6

x6..x7

x0

x0..x1

x2

x0..x3

x4

x4..x5

x6

x4..x7

x0

x1

x2

x3

x4

x5

x6

x7

x0

x0..x1

x1..x2

x2..x3

x3..x4

x4..x5

x5..x6

x6..x7

x0

x0..x1

x0..x2

x0..x3

x1..x4

x2..x5

x3..x6

x4..x7

x0

x0..x1

x0..x2

x0..x3

x0..x4

x0..x5

x0..x6

x0..x7

Copyright © 2022 Elsevier

23 of 24

Work Efficiency

  • Recall: Kogge-Stone
    • log(N) steps
    • O(N*log(N)) operations
  • Brent-Kung
    • Reduction stage:
      • log(N) steps
      • N/2 + N/4 + … + 4 + 2 + 1 = N-1 operations
    • Post-Reduction stage:
      • log(N)-1 steps
      • (2-1) + (4-1) + … + (N/2-1) = (N-2) - (log(N)-1)
    • Total:
      • 2*log(N)-1 steps
      • (N-1) + (N-2) - (log(N)-1) = 2*N – log(N) – 2 = O(N) operations
  • Brent-Kung takes more steps but is more work-efficient
  • So which one is faster?

Copyright © 2022 Elsevier

24 of 24

References

  • Wen-mei W. Hwu, David B. Kirk, and Izzat El Hajj. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, 2022.

Copyright © 2022 Elsevier