1 of 135

Parallel Algorithms

Shehzan Mohammed

CIS 5650 - Fall 2025

2 of 135

Parallel Reduction

3 of 135

Agenda - Parallel Algorithms

  • Parallel Reduction
  • Scan (Naive and Work Efficient)
  • Applications of Scan
    • Stream Compaction
    • Summed Area Tables (SAT)
    • Radix Sort

4 of 135

Parallel Reduction

  • Given an array of numbers, design a parallel algorithm to find the sum.
  • Consider:
    • Arithmetic intensity: compute to memory access ratio�

5 of 135

Parallel Reduction

  • Given an array of numbers, design a parallel algorithm to find:
    • The sum
    • The maximum value
    • The product of values
    • The average value
  • How different are these algorithms?

6 of 135

Parallel Reduction

  • Reduction: An operation that computes a single result from a set of data
  • Parallel Reduction: Do it in parallel.

7 of 135

Parallel Reduction

  • Example: Find the sum:

0

1

5

2

3

4

6

7

8 of 135

Parallel Reduction

1

9

5

13

0

1

5

2

3

4

6

7

9 of 135

Parallel Reduction

6

22

1

9

5

13

0

1

5

2

3

4

6

7

10 of 135

Parallel Reduction

28

6

22

1

9

5

13

0

1

5

2

3

4

6

7

11 of 135

Parallel Reduction

  • Similar to brackets for a basketball tournament
  • log(n) passes for n elements

12 of 135

Parallel Reduction

  • d = 0, 2d+1 = 2
  • 2d+1 – 1 = 1
  • 2d – 1 = 0

for d = 0 to log2n - 1

for all k = 0 to n – 1 by 2d+1 in parallel

x[k + 2d+1 – 1] += x[k + 2d – 1];

// In this pass, for k = (0, 2, 4, 6)

// x[k + 1] += x[k];

1

9

5

13

0

1

5

2

3

4

6

7

13 of 135

Parallel Reduction

  • d = 1, 2d+1 = 4
  • 2d+1 – 1 = 3
  • 2d – 1 = 1

6

22

1

9

5

13

0

1

5

2

3

4

6

7

for d = 0 to log2n - 1

for all k = 0 to n – 1 by 2d+1 in parallel

x[k + 2d+1 – 1] += x[k + 2d – 1];

// In this pass, for k = (0, 4)

// x[k + 3] += x[k + 1];

14 of 135

Parallel Reduction

  • d = 2, 2d+1 = 8
  • 2d+1 – 1 = 7
  • 2d – 1 = 3

28

6

22

1

9

5

13

0

1

5

2

3

4

6

7

for d = 0 to log2n - 1

for all k = 0 to n – 1 by 2d+1 in parallel

x[k + 2d+1 – 1] += x[k + 2d – 1];

// In this pass, for k = (0)

// x[k + 7] += x[k + 3];

15 of 135

Parallel Reduction

  • Note the +=
  • The array is modified in place

28

1

9

2

6

4

6

0

6

22

1

9

5

13

0

1

5

2

3

4

6

7

for d = 0 to log2n - 1

for all k = 0 to n – 1 by 2d+1 in parallel

x[k + 2d+1 – 1] += x[k + 2d – 1]

16 of 135

Scan

17 of 135

All-Prefix-Sums

  • All-Prefix-Sums
  • Input
    • Array of n elements: [a0, a1, a2,………, an-1]
    • Binary associate operator:
    • Identity: I
  • Outputs the array: [I,a0,(a0⊕a1),(a0⊕a1⊕a2)……,(a0⊕a1⊕a2⊕an-2)]

http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

18 of 135

All-Prefix-Sums

  • Example
    • If is addition, the array
      • [3 1 7 0 4 1 6 3]
    • is transformed to
      • [0 3 4 11 11 15 16 22]
  • Seems sequential, but there is an efficient parallel solution

19 of 135

Scan

  • Exclusive Scan:  Element j of the result does not include element j of the input:
    • In: [ 3 1 7 0 4 1 6 3]
    • Out: [ 0 3 4 11 11 15 16 22]

  • Inclusive Scan (Prescan):  All elements including j are summed
    • In: [ 3 1 7 0 4 1 6 3]
    • Out: [ 3 4 11 11 15 16 22 25]

20 of 135

Scan

  • How do you generate an exclusive scan from an inclusive scan?
    • Input: [ 3 1 7 0 4 1 6 3]
    • Inclusive: [ 3 4 11 11 15 16 22 25]
    • Exclusive: [ 0 3 4 11 11 15 16 22]
    • // Shift right, insert identity

  • How do you go in the opposite direction?

21 of 135

Scan

  • Design a parallel algorithm for Inclusive Scan
    • In: [ 3 1 7 0 4 1 6 3]
    • Out: [ 3 4 11 11 15 16 22 25]

  • Consider:
    • Total number of additions

22 of 135

Scan

  • Single thread is straightforward

  • n - 1 adds for an array of length n
    • (ignoring array indices)

  • How many adds will our parallel version have?

out[0] = in[0]; // assuming n > 0

for (int k = 1; k < n; ++k)

out[k] = out[k – 1] + in[k];

23 of 135

Scan

  • Naive Parallel Scan

Image from http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

  • Is this exclusive or inclusive?
  • Each thread
    • Writes one sum
    • Reads two values

24 of 135

Scan

  • Naive Parallel Scan: Input

0

1

5

2

3

4

6

7

25 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

0

1

5

2

3

4

6

7

26 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

0

1

5

2

3

4

6

7

27 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

3

0

1

5

2

3

4

6

7

28 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

3

5

0

1

5

2

3

4

6

7

29 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

3

5

7

0

1

5

2

3

4

6

7

30 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

9

3

5

7

0

1

5

2

3

4

6

7

31 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

9

3

5

7

11

0

1

5

2

3

4

6

7

32 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

33 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

  • But remember, it runs in parallel!

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

34 of 135

Scan

  • Naive Parallel Scan: d = 1, 2d-1 = 1

  • But remember, it runs in parallel!

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

35 of 135

Scan

  • Naive Parallel Scan: d = 2, 2d-1 = 2

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

After d=1

36 of 135

Scan

  • Naive Parallel Scan: d = 2, 2d-1 = 2

  • Consider k=7

if (7 >= 2^2-1)

x[7] = x[7 - 2^2-1] + x[7]

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

22

After d=1

37 of 135

Scan

  • Naive Parallel Scan: d = 2, 2d-1 = 2

  • Consider k=7

if (7 >= 2^(2-1))

x[7] = x[7 - 2^(2-1)] + x[7]

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

After d=1

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

After d=2

38 of 135

Scan

  • Naive Parallel Scan: d = 2, 2d-1 = 2

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

After d=1

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

After d=2

39 of 135

Scan

  • Naive Parallel Scan: d = 3, 2d-1 = 4

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

After d=1

After d=2

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

40 of 135

Scan

  • Naive Parallel Scan: d = 3, 2d-1 = 4

  • Consider k=7

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

28

if (7 >= 2^(3-1))

x[7] = x[7 - 2^(3-1)] + x[7]

41 of 135

Scan

  • Naive Parallel Scan: Final

for d = 1 to log2n

for all k in parallel

if (k >= 2d-1)

x[k] = x[k – 2d-1] + x[k];

After d=1

After d=2

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

0

1

15

3

6

10

21

28

After d=3

42 of 135

Naive Parallel Scan

  • Number of adds
    • Sequential Scan: O(n)
    • Naive Parallel Scan: O(nlog2(n))
  • Algorithmic Complexity
    • Sequential Scan: O(n)
    • Naive Parallel Scan: O(log2(n))

43 of 135

Naive Parallel Scan

  • Number of adds
    • Sequential Scan: O(n)
    • Naive Parallel Scan: O(nlog2(n))
  • Algorithmic Complexity
    • Sequential Scan: O(n)
    • Naive Parallel Scan: O(log2(n))
  • Can we make it faster?

44 of 135

Work-Efficient Parallel Scan

  • Balanced binary tree
    • n leafs = log2n levels
    • Each level, d, has 2d nodes

45 of 135

Work-Efficient Parallel Scan

  • Balanced binary tree
    • n leafs = log2n levels
    • Each level, d, has 2d nodes

Level 0

Level 1

Level 2

Level 3

1 node

2 nodes

4 nodes

8 nodes

46 of 135

Work-Efficient Parallel Scan

  • Use a balanced binary tree (in concept) to perform Scan in two phases:
    • Up-Sweep (Parallel Reduction)
    • Down-Sweep

47 of 135

Work-Efficient Parallel Scan

  • Up-Sweep

// Same code as our Parallel Reduction

for d = 0 to log2n - 1

for all k = 0 to n – 1 by 2d+1 in parallel

x[k + 2d+1 – 1] += x[k + 2d – 1];

0

1

5

2

3

4

6

7

1

9

5

13

6

22

28

Level 0

Level 1

Level 2

Level 3

48 of 135

Work-Efficient Parallel Scan

  • Imagine array as a tree
    • Array stores only left child
    • Right child is the element itself

  • For node at index n
    • Left child index = n/2 (rounds down)
    • Right child index = n

0

2

4

6

1

9

6

28

28

0

2

4

6

1

9

6

49 of 135

Work-Efficient Parallel Scan

  • Up-Sweep

// Same code as our Parallel Reduction

for d = 0 to log2n - 1

for all k = 0 to n – 1 by 2d+1 in parallel

x[k + 2d+1 – 1] += x[k + 2d – 1];

0

1

5

2

3

4

6

7

1

9

5

13

6

22

28

After d=2

After d=1

After d=0

0

2

4

6

1

9

6

In-place reduction

On input array

50 of 135

Work-Efficient Parallel Scan

  • Down-Sweep
    • “Traverse” back down tree using partial sums to build the scan in place.
      • Set root to zero
      • At each pass, a node passes its value to its left child, and sets the right child to the sum of the previous left child’s value and its value

0

1

5

2

3

4

6

7

1

9

5

13

6

22

28

0

2

4

6

1

9

6

51 of 135

Work-Efficient Parallel Scan

  • Down-Sweep
    • “Traverse” back down tree using partial sums to build the scan in place.
      • Set root to zero
      • At each pass, a node passes its value to its left child, and sets the right child to the sum of the previous left child’s value and its value

x[n - 1] = 0

for d = log2n – 1 to 0

for all k = 0 to n – 1 by 2d+1 in parallel

t = x[k + 2d – 1]; // Save left child

x[k + 2d – 1] = x[k + 2d+1 – 1]; // Set left child to this node’s value

x[k + 2d+1 – 1] += t; // Set right child to old left value +

// this node’s value

52 of 135

Work-Efficient Parallel Scan

  • Down-Sweep

  • Remember: This is a tree, but stored as linear array

0

28

0

2

4

6

1

9

6

Replace with 0

53 of 135

Work-Efficient Parallel Scan

  • Down-Sweep

1

9

0

6

6

0

28

0

2

4

6

1

9

6

Replace with 0

After d=0

At each level

  • Left child: Copy the parent value
  • Right child: Add the parent value and left child value copying root value.

Remember to think of this as a tree, not as array

Orange Arrow = Copy

Green Arrow + Black Arrow = Add

54 of 135

Work-Efficient Parallel Scan

  • Down-Sweep

0

0

6

2

1

4

6

15

1

9

0

6

6

0

28

0

2

4

6

1

9

6

Replace with 0

After d=0

After d=1

At each level

  • Left child: Copy the parent value
  • Right child: Add the parent value and left child value copying root value.

Remember to think of this as a tree, not as array

Orange Arrow = Copy

Green Arrow + Black Arrow = Add

55 of 135

Work-Efficient Parallel Scan

  • Down-Sweep

0

0

6

2

1

4

6

15

1

9

0

6

6

0

28

0

2

4

6

1

9

6

0

0

10

1

3

6

15

21

Replace with 0

After d=0

After d=1

After d=2

At each level

  • Left child: Copy the parent value
  • Right child: Add the parent value and left child value copying root value.

Remember to think of this as a tree, not as array

Orange Arrow = Copy

Green Arrow + Black Arrow = Add

56 of 135

Work-Efficient Parallel Scan

  • Up-Sweep
    • O(n) adds
  • Down-Sweep
    • O(n) adds
    • O(n) swaps
  • This Exclusive Scan can then be converted to an Inclusive Scan

57 of 135

Stream Compaction

58 of 135

Stream Compaction

  • Given an array of elements
    • Create a new array with elements that meet a certain criteria, e.g. non null
    • Preserve order

a

b

f

c

d

e

g

h

59 of 135

Stream Compaction

  • Given an array of elements
    • Create a new array with elements that meet a certain criteria, e.g. non null
    • Preserve order

a

b

f

c

d

e

g

h

a

c

d

g

60 of 135

Stream Compaction

  • Used in path tracing, collision detection, sparse matrix compression, etc.
  • Can reduce data transferred from GPU to CPU

a

b

f

c

d

e

g

h

a

c

d

g

61 of 135

Stream Compaction

  • Step 1: Compute temporary array containing
    • 1 if corresponding element meets criteria
    • 0 if element does not meet criteria

a

b

f

c

d

e

g

h

1

62 of 135

Stream Compaction

  • Step 1: Compute temporary array containing
    • 1 if corresponding element meets criteria
    • 0 if element does not meet criteria

a

b

f

c

d

e

g

h

1

0

63 of 135

Stream Compaction

  • Step 1: Compute temporary array containing
    • 1 if corresponding element meets criteria
    • 0 if element does not meet criteria

a

b

f

c

d

e

g

h

1

0

1

64 of 135

Stream Compaction

  • Step 1: Compute temporary array containing
    • 1 if corresponding element meets criteria
    • 0 if element does not meet criteria

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

65 of 135

Stream Compaction

  • Step 1: Compute temporary array containing
    • 1 if corresponding element meets criteria
    • 0 if element does not meet criteria
  • Runs in parallel !!

a

b

f

c

d

e

g

h

66 of 135

Stream Compaction

  • Step 1: Compute temporary array containing
    • 1 if corresponding element meets criteria
    • 0 if element does not meet criteria
  • Runs in parallel !!

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

67 of 135

Stream Compaction

  • Step 2: Run exclusive scan on temporary array

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

68 of 135

Stream Compaction

  • Step 2: Run exclusive scan on temporary array
  • Scan also runs in parallel
  • What can we do with the result?

0

1

3

1

2

3

3

4

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

69 of 135

Stream Compaction

  • Step 3: Scatter
    • Result of scan is index into final array
    • Only write an element if temporary array has a 1

0

1

3

1

2

3

3

4

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

Final Array:

0

1

2

3

70 of 135

Stream Compaction

  • Step 3: Scatter
    • Result of scan is index into final array
    • Only write an element if temporary array has a 1

0

1

3

1

2

3

3

4

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

Final Array:

a

0

1

2

3

71 of 135

Stream Compaction

  • Step 3: Scatter
    • Result of scan is index into final array
    • Only write an element if temporary array has a 1

0

1

3

1

2

3

3

4

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

Final Array:

a

c

0

1

2

3

72 of 135

Stream Compaction

  • Step 3: Scatter
    • Result of scan is index into final array
    • Only write an element if temporary array has a 1

0

1

3

1

2

3

3

4

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

Final Array:

a

c

d

0

1

2

3

73 of 135

Stream Compaction

  • Step 3: Scatter
    • Result of scan is index into final array
    • Only write an element if temporary array has a 1

0

1

3

1

2

3

3

4

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

Final Array:

a

c

d

g

0

1

2

3

74 of 135

Stream Compaction

  • Step 3: Scatter
    • Result of scan is index into final array
    • Only write an element if temporary array has a 1

0

1

3

1

2

3

3

4

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

Final Array:

a

c

d

g

0

1

2

3

75 of 135

Stream Compaction

  • Step 3: Scatter – Runs in parallel !!

0

1

3

1

2

3

3

4

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

Final Array:

0

1

2

3

76 of 135

Stream Compaction

  • Step 3: Scatter – Runs in parallel !!

0

1

3

1

2

3

3

4

a

b

f

c

d

e

g

h

1

0

0

1

1

0

1

0

Scan Result:

Final Array:

a

c

d

g

0

1

2

3

77 of 135

Summed Area Table (SAT)

78 of 135

Summed Area Table

  • Summed Area Table (SAT): 2D table where each element stores the sum of all elements in an input image between the lower left corner and the entry location.

79 of 135

Summed Area Table

  • Example

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

2

5

6

8

2

6

9

11

4

9

12

14

SAT

(1 + 1 + 0) + (1 + 2 + 1) + (0 + 1 + 2) = 9

80 of 135

Summed Area Table

  • Benefit
    • Used to perform different width filters at every pixel in the image in constant time per pixel
    • Just sample four pixels in SAT:

Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

81 of 135

Summed Area Table

  • Uses
    • Approximate depth of field
    • Glossy environment reflections and refractions

Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

82 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

SAT

83 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

SAT

84 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

SAT

85 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

SAT

86 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

SAT

87 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

2

SAT

88 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

2

5

SAT

89 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

2

5

6

SAT

90 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

2

5

6

8

SAT

91 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

2

5

6

8

2

6

9

11

4

9

12

SAT

92 of 135

Summed Area Table

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

2

5

6

8

2

6

9

11

4

9

12

14

SAT

93 of 135

Summed Area Table

How would implement

this on the GPU?

94 of 135

Summed Area Table

How would implement

this on the GPU?

Hint: Inclusive Scan

95 of 135

Summed Area Table

  • Step 1 of 2: Row wise inclusive scan

One inclusive scan for each row

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

1

3

4

4

0

1

3

3

2

3

3

3

Partial SAT

96 of 135

Summed Area Table

  • Step 2 of 2: Column wise inclusive scan

One inclusive scan for each column

1

1

0

2

1

2

1

0

0

1

2

0

2

1

0

0

Input image

1

2

2

4

2

5

6

8

2

6

9

11

4

9

12

14

SAT

97 of 135

Radix Sort

98 of 135

Radix Sort

  • Efficient for small sort keys
    • k-bit keys require k passes

99 of 135

Radix Sort

  • Each radix sort pass partitions its input based on one bit
  • First pass starts with the least significant bit (LSB). Subsequent passes move towards the most significant bit (MSB)

0110

LSB

MSB

100 of 135

Radix Sort

  • Example input:

Example from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

100

111

010

110

011

101

001

000

101 of 135

Radix Sort

  • First pass: partition based on LSB

LSB == 0

LSB == 1

100

111

010

110

011

101

001

000

100

010

110

000

111

011

101

001

102 of 135

Radix Sort

  • Second pass: partition based on middle bit

100

000

101

001

010

110

111

011

Bit == 0

Bit == 1

100

111

010

110

011

101

001

000

100

010

110

000

111

011

101

001

103 of 135

Radix Sort

  • Final pass: partition based on MSB

100

000

101

001

010

110

111

011

MSB == 0

MSB == 1

100

111

010

110

011

101

001

000

100

010

110

000

111

011

101

001

000

001

010

011

100

101

110

111

104 of 135

Radix Sort

  • Completed

100

000

101

001

010

110

111

011

100

111

010

110

011

101

001

000

100

010

110

000

111

011

101

001

000

001

010

011

100

101

110

111

105 of 135

Radix Sort

  • Completed

4

0

5

1

2

6

7

3

4

7

2

6

3

5

1

0

4

2

6

0

7

3

5

1

0

1

2

3

4

5

6

7

106 of 135

Parallel Radix Sort

  • Where is the parallelism?

107 of 135

Parallel Radix Sort

  1. Break input arrays into tiles
    • Each tile fits into shared memory for an SM
  2. Sort tiles in parallel with radix sort
  3. Merge pairs of tiles using a parallel bitonic merge until all tiles are merged.

Our focus is on Step 2

108 of 135

Parallel Radix Sort

  • Where is the parallelism?
    • Each tile is sorted in parallel
    • Where is the parallelism within a tile?

109 of 135

Parallel Radix Sort

  • Where is the parallelism?
    • Each tile is sorted in parallel
    • Where is the parallelism within a tile?
      • Each pass is done in sequence after the previous pass. No parallelism
      • Can we parallelize an individual pass? How?
    • Merge also has parallelism

110 of 135

Parallel Radix Sort

  • Implement spilt. Given:
    • Array, i, at pass n:

    • Array, b, which is true/false for bit n:

  • Output array with false keys before true keys:

100

111

010

110

011

101

001

000

0

1

0

0

1

1

1

0

100

010

110

000

111

011

101

001

111 of 135

Parallel Radix Sort

  • Step 1: Compute e array

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

112 of 135

Parallel Radix Sort

  • Step 2: Exclusive scan e array and store it in f

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

113 of 135

Parallel Radix Sort

  • Step 3: Compute totalFalses

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

totalFalses = e[n – 1] + f[n – 1]

totalFalses = 1 + 3

totalFalses = 4

114 of 135

Parallel Radix Sort

  • Step 4: Compute t array
    • t array is address for writing true keys

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

totalFalses = 4

t[i] = i – f[i] + totalFalses

t array

115 of 135

Parallel Radix Sort

  • Step 4: Compute t array
    • t array is address for writing true keys

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

4

totalFalses = 4

t[i] = i – f[i] + totalFalses ==> t[0] = 0 – 0 + 4

t array

116 of 135

Parallel Radix Sort

  • Step 4: Compute t array
    • t array is address for writing true keys

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

4

4

totalFalses = 4

t[i] = i – f[i] + totalFalses ==> t[1] = 1 – 1 + 4

t array

117 of 135

Parallel Radix Sort

  • Step 4: Compute t array
    • t array is address for writing true keys

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

4

4

5

totalFalses = 4

t[i] = i – f[i] + totalFalses ==> t[2] = 2 – 1 + 4

t array

118 of 135

Parallel Radix Sort

  • Step 4: Compute t array
    • t array is address for writing true keys

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

4

4

5

5

5

6

7

8

totalFalses = 4

t[i] = i – f[i] + totalFalses

t array

119 of 135

Parallel Radix Sort

  • Step 5: Scatter based on address d

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

4

4

5

5

5

6

7

8

t array

d[i] = b[i] ? t[i] : f[i]

120 of 135

Parallel Radix Sort

  • Step 5: Scatter based on address d

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

4

4

5

5

5

6

7

8

t array

0

d[i] = b[i] ? t[i] : f[i]

121 of 135

Parallel Radix Sort

  • Step 5: Scatter based on address d

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

4

4

5

5

5

6

7

8

t array

0

4

d[i] = b[i] ? t[i] : f[i]

122 of 135

Parallel Radix Sort

  • Step 5: Scatter based on address d

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

4

4

5

5

5

6

7

8

t array

0

4

1

d[i] = b[i] ? t[i] : f[i]

123 of 135

Parallel Radix Sort

  • Step 5: Scatter based on address d

0

1

0

0

1

1

1

0

i array

b array

1

0

1

1

0

0

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

1

2

3

3

3

3

f array

4

4

5

5

5

6

7

8

t array

0

4

1

2

5

6

7

3

d[i] = b[i] ? t[i] : f[i]

124 of 135

Parallel Radix Sort

  • Step 5: Scatter based on address d

i array

100

111

010

110

011

101

001

000

0

4

1

2

5

6

7

3

d array

output

125 of 135

Parallel Radix Sort

  • Step 5: Scatter based on address d

i array

100

111

010

110

011

101

001

000

0

4

1

2

5

6

7

3

d array

output

0

1

2

3

4

5

6

7

126 of 135

Parallel Radix Sort

  • Given k-bit keys, how do we sort using our new split function?

  • Once each tile is sorted, how do we merge tiles to provide the final sorted array?

127 of 135

Scan Revisited

128 of 135

Scan Limitations

  • Requires power-of-two length
  • Executes in one block (unless only using global memory)
    • Length up to twice the number of threads in a block

129 of 135

Scan on Arrays of Arbitrary Length

  1. Divide the array into blocks

0

1

2

3

4

5

6

7

8

9

10

11

Block 0

Block 1

Block 2

Block 3

0

1

2

3

4

5

6

7

8

9

10

11

130 of 135

Scan on Arrays of Arbitrary Length

  1. Run scan on each block

Run in parallel over many SMs

Block 0

Block 1

Block 2

Block 3

0

1

2

3

4

5

6

7

8

9

10

11

Block 0

Block 1

Block 2

Block 3

0

1

3

3

7

12

6

13

21

9

19

30

131 of 135

Scan on Arrays of Arbitrary Length

  1. Write total sum of each block into a new array

Block 0

Block 1

Block 2

Block 3

0

1

3

3

7

12

6

13

21

9

19

30

3

12

21

30

132 of 135

Scan on Arrays of Arbitrary Length

  1. Exclusive scan block sums to compute block increments

3

12

21

30

0

3

15

36

Block sums:

Block increments:

133 of 135

Scan on Arrays of Arbitrary Length

  1. Add block increments to each element in the corresponding block

+

Block 0

Block 1

Block 2

Block 3

0

1

3

3

7

12

6

13

21

9

19

30

Block 0

Block 1

Block 2

Block 3

0

1

3

6

10

15

21

28

36

45

55

66

3

12

21

30

0

3

15

36

Block sums:

Block increments:

134 of 135

Scan on Arrays of Arbitrary Length

  • Non-power-of-two length:
  • Pad last block with zeros up until the block size

135 of 135

Summary

  • Parallel reduction, scan, and sort are building blocks for many algorithms
  • An understanding of parallel programming and GPU architecture yields efficient GPU implementations