1 of 135

Parallel Algorithms

Shehzan Mohammed

CIS 5650 - Fall 2025

2 of 135

Parallel Reduction

3 of 135

Agenda - Parallel Algorithms

Parallel Reduction
Scan (Naive and Work Efficient)
Applications of Scan

Stream Compaction
Summed Area Tables (SAT)
Radix Sort

4 of 135

Parallel Reduction

Given an array of numbers, design a parallel algorithm to find the sum.
Consider:

Arithmetic intensity: compute to memory access ratio�

5 of 135

Parallel Reduction

Given an array of numbers, design a parallel algorithm to find:

The sum
The maximum value
The product of values
The average value

How different are these algorithms?

6 of 135

Parallel Reduction

Reduction: An operation that computes a single result from a set of data
Parallel Reduction: Do it in parallel.

7 of 135

Parallel Reduction

Example: Find the sum:

0

1

5

2

3

4

6

7

8 of 135

Parallel Reduction

1

9

5

13

0

1

5

2

3

4

6

7

9 of 135

Parallel Reduction

6

22

1

9

5

13

0

1

5

2

3

4

6

7

10 of 135

Parallel Reduction

28

6

22

1

9

5

13

0

1

5

2

3

4

6

7

11 of 135

Parallel Reduction

Similar to brackets for a basketball tournament
log(n) passes for n elements

12 of 135

Parallel Reduction

d = 0, 2^d+1 = 2
2^d+1 – 1 = 1
2^d – 1 = 0

for d = 0 to log₂n - 1

for all k = 0 to n – 1 by 2^d+1in parallel

x[k + 2^d+1 – 1] += x[k + 2^d – 1];

// In this pass, for k = (0, 2, 4, 6)

// x[k + 1] += x[k];

1

9

5

13

0

1

5

2

3

4

6

7

13 of 135

Parallel Reduction

d = 1, 2^d+1 = 4
2^d+1 – 1 = 3
2^d – 1 = 1

6

22

1

9

5

13

0

1

5

2

3

4

6

7

for d = 0 to log₂n - 1

for all k = 0 to n – 1 by 2^d+1in parallel

x[k + 2^d+1 – 1] += x[k + 2^d – 1];

// In this pass, for k = (0, 4)

// x[k + 3] += x[k + 1];

14 of 135

Parallel Reduction

d = 2, 2^d+1 = 8
2^d+1 – 1 = 7
2^d – 1 = 3

28

6

22

1

9

5

13

0

1

5

2

3

4

6

7

for d = 0 to log₂n - 1

for all k = 0 to n – 1 by 2^d+1in parallel

x[k + 2^d+1 – 1] += x[k + 2^d – 1];

// In this pass, for k = (0)

// x[k + 7] += x[k + 3];

15 of 135

Parallel Reduction

Note the +=
The array is modified in place

28

1

9

2

6

4

6

0

6

22

1

9

5

13

0

1

5

2

3

4

6

7

for d = 0 to log₂n - 1

for all k = 0 to n – 1 by 2^d+1in parallel

x[k + 2^d+1 – 1] += x[k + 2^d – 1]

16 of 135

Scan

17 of 135

All-Prefix-Sums

All-Prefix-Sums
Input

Array of n elements: [a0, a1, a2,………, an-1]
Binary associate operator: ⊕
Identity: I

Outputs the array: [I,a₀,(a₀⊕a₁),(a₀⊕a₁⊕a₂)……,(a₀⊕a₁⊕a₂⊕a_n-2)]

http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

18 of 135

All-Prefix-Sums

Example

If ⊕ is addition, the array

[3 1 7 0 4 1 6 3]

is transformed to

[0 3 4 11 11 15 16 22]

Seems sequential, but there is an efficient parallel solution

19 of 135

Scan

Exclusive Scan: Element j of the result does not include element j of the input:

In: [ 3 1 7 0 4 1 6 3]
Out: [ 0 3 4 11 11 15 16 22]

Inclusive Scan (Prescan): All elements including j are summed

In: [ 3 1 7 0 4 1 6 3]
Out: [ 3 4 11 11 15 16 22 25]

20 of 135

Scan

How do you generate an exclusive scan from an inclusive scan?

Input: [ 3 1 7 0 4 1 6 3]
Inclusive: [ 3 4 11 11 15 16 22 25]
Exclusive: [ 0 3 4 11 11 15 16 22]
// Shift right, insert identity

How do you go in the opposite direction?

21 of 135

Scan

Design a parallel algorithm for Inclusive Scan

In: [ 3 1 7 0 4 1 6 3]
Out: [ 3 4 11 11 15 16 22 25]

Consider:

Total number of additions

22 of 135

Scan

Single thread is straightforward

n - 1 adds for an array of length n

(ignoring array indices)

How many adds will our parallel version have?

out[0] = in[0]; // assuming n > 0

for (int k = 1; k < n; ++k)

out[k] = out[k – 1] + in[k];

23 of 135

Scan

Naive Parallel Scan

Image from http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

Is this exclusive or inclusive?
Each thread

Writes one sum
Reads two values

24 of 135

Scan

Naive Parallel Scan: Input

0

1

5

2

3

4

6

7

25 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

5

2

3

4

6

7

26 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

0

1

5

2

3

4

6

7

27 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

3

0

1

5

2

3

4

6

7

28 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

3

5

0

1

5

2

3

4

6

7

29 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

3

5

7

0

1

5

2

3

4

6

7

30 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

9

3

5

7

0

1

5

2

3

4

6

7

31 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

9

3

5

7

11

0

1

5

2

3

4

6

7

32 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

33 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

But remember, it runs in parallel!

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

34 of 135

Scan

Naive Parallel Scan: d = 1, 2^d-1 = 1

But remember, it runs in parallel!

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

35 of 135

Scan

Naive Parallel Scan: d = 2, 2^d-1 = 2

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

After d=1

36 of 135

Scan

Naive Parallel Scan: d = 2, 2^d-1 = 2

Consider k=7

if (7 >= 2^2-1)

x[7] = x[7 - 2^2-1] + x[7]

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

22

After d=1

37 of 135

Scan

Naive Parallel Scan: d = 2, 2^d-1 = 2

Consider k=7

if (7 >= 2^(2-1))

x[7] = x[7 - 2^(2-1)] + x[7]

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

After d=1

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

After d=2

38 of 135

Scan

Naive Parallel Scan: d = 2, 2^d-1 = 2

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

After d=1

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

After d=2

39 of 135

Scan

Naive Parallel Scan: d = 3, 2^d-1 = 4

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

After d=1

After d=2

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

40 of 135

Scan

Naive Parallel Scan: d = 3, 2^d-1 = 4

Consider k=7

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

28

if (7 >= 2^(3-1))

x[7] = x[7 - 2^(3-1)] + x[7]

41 of 135

Scan

Naive Parallel Scan: Final

for d = 1 to log₂n

for all k in parallel

if (k >= 2^d-1)

x[k] = x[k – 2^d-1] + x[k];

After d=1

After d=2

0

1

9

3

5

7

11

13

0

1

5

2

3

4

6

7

0

1

14

3

6

10

18

22

0

1

15

3

6

10

21

28

After d=3

42 of 135

Naive Parallel Scan

Number of adds

Sequential Scan: O(n)
Naive Parallel Scan: O(nlog2(n))

Algorithmic Complexity

Sequential Scan: O(n)
Naive Parallel Scan: O(log2(n))

43 of 135

Naive Parallel Scan

Number of adds

Sequential Scan: O(n)
Naive Parallel Scan: O(nlog2(n))

Algorithmic Complexity

Sequential Scan: O(n)
Naive Parallel Scan: O(log2(n))

Can we make it faster?

44 of 135

Work-Efficient Parallel Scan

Balanced binary tree

n leafs = log₂n levels
Each level, d, has 2^d nodes

45 of 135

Work-Efficient Parallel Scan

Balanced binary tree

n leafs = log₂n levels
Each level, d, has 2^d nodes

Level 0

Level 1

Level 2

Level 3

1 node

2 nodes

4 nodes

8 nodes

46 of 135

Work-Efficient Parallel Scan

Use a balanced binary tree (in concept) to perform Scan in two phases:

Up-Sweep (Parallel Reduction)
Down-Sweep

47 of 135

Work-Efficient Parallel Scan

Up-Sweep

// Same code as our Parallel Reduction

for d = 0 to log₂n - 1

for all k = 0 to n – 1 by 2^d+1in parallel

x[k + 2^d+1 – 1] += x[k + 2^d – 1];

0

1

5

2

3

4

6

7

1

9

5

13

6

22

28

Level 0

Level 1

Level 2

Level 3

48 of 135

Work-Efficient Parallel Scan

Imagine array as a tree

Array stores only left child
Right child is the element itself

For node at index n

Left child index = n/2 (rounds down)
Right child index = n

0

2

4

6

1

9

6

28

0

2

4

6

1

9

6

49 of 135

Work-Efficient Parallel Scan

Up-Sweep

// Same code as our Parallel Reduction

for d = 0 to log₂n - 1

for all k = 0 to n – 1 by 2^d+1in parallel

x[k + 2^d+1 – 1] += x[k + 2^d – 1];

0

1

5

2

3

4

6

7

1

9

5

13

6

22

28

After d=2

After d=1

After d=0

0

2

4

6

1

9

6

In-place reduction

On input array

50 of 135

Work-Efficient Parallel Scan

Down-Sweep

“Traverse” back down tree using partial sums to build the scan in place.

Set root to zero
At each pass, a node passes its value to its left child, and sets the right child to the sum of the previous left child’s value and its value

0

1

5

2

3

4

6

7

1

9

5

13

6

22

28

0

2

4

6

1

9

6

51 of 135

Work-Efficient Parallel Scan

Down-Sweep

“Traverse” back down tree using partial sums to build the scan in place.

Set root to zero
At each pass, a node passes its value to its left child, and sets the right child to the sum of the previous left child’s value and its value

x[n - 1] = 0

for d = log₂n – 1 to 0

for all k = 0 to n – 1 by 2^d+1in parallel

t = x[k + 2^d – 1]; // Save left child

x[k + 2^d – 1] = x[k + 2^d+1 – 1]; // Set left child to this node’s value

x[k + 2^d+1 – 1] += t; // Set right child to old left value +

// this node’s value

52 of 135

Work-Efficient Parallel Scan

Down-Sweep

Remember: This is a tree, but stored as linear array

0

28

0

2

4

6

1

9

6

Replace with 0

53 of 135

Work-Efficient Parallel Scan

Down-Sweep

1

9

0

6

0

28

0

2

4

6

1

9

6

Replace with 0

After d=0

At each level

Left child: Copy the parent value
Right child: Add the parent value and left child value copying root value.

Remember to think of this as a tree, not as array

Orange Arrow = Copy

Green Arrow + Black Arrow = Add

54 of 135

Work-Efficient Parallel Scan

Down-Sweep

0

6

2

1

4

6

15

1

9

0

6

0

28

0

2

4

6

1

9

6

Replace with 0

After d=0

After d=1

At each level

Left child: Copy the parent value
Right child: Add the parent value and left child value copying root value.

Remember to think of this as a tree, not as array

Orange Arrow = Copy

Green Arrow + Black Arrow = Add

55 of 135

Work-Efficient Parallel Scan

Down-Sweep

0

6

2

1

4

6

15

1

9

0

6

0

28

0

2

4

6

1

9

6

0

10

1

3

6

15

21

Replace with 0

After d=0

After d=1

After d=2

At each level

Left child: Copy the parent value
Right child: Add the parent value and left child value copying root value.

Remember to think of this as a tree, not as array

Orange Arrow = Copy

Green Arrow + Black Arrow = Add

56 of 135

Work-Efficient Parallel Scan

Up-Sweep

O(n) adds

Down-Sweep

O(n) adds
O(n) swaps

This Exclusive Scan can then be converted to an Inclusive Scan

57 of 135

Stream Compaction

58 of 135

Stream Compaction

Given an array of elements

Create a new array with elements that meet a certain criteria, e.g. non null
Preserve order

a

b

f

c

d

e

g

h

59 of 135

Stream Compaction

Given an array of elements

Create a new array with elements that meet a certain criteria, e.g. non null
Preserve order

a

b

f

c

d

e

g

h

a

c

d

g

60 of 135

Stream Compaction

Used in path tracing, collision detection, sparse matrix compression, etc.
Can reduce data transferred from GPU to CPU

a

b

f

c

d

e

g

h

a

c

d

g

61 of 135

Stream Compaction

Step 1: Compute temporary array containing

1 if corresponding element meets criteria
0 if element does not meet criteria

a

b

f

c

d

e

g

h

1

62 of 135

Stream Compaction

Step 1: Compute temporary array containing

1 if corresponding element meets criteria
0 if element does not meet criteria

a

b

f

c

d

e

g

h

1

0

63 of 135

Stream Compaction

Step 1: Compute temporary array containing

1 if corresponding element meets criteria
0 if element does not meet criteria

a

b

f

c

d

e

g

h

1

0

1

64 of 135

Stream Compaction

Step 1: Compute temporary array containing

1 if corresponding element meets criteria
0 if element does not meet criteria

a

b

f

c

d

e

g

h

1

0

1

0

1

0

65 of 135

Stream Compaction

Step 1: Compute temporary array containing

1 if corresponding element meets criteria
0 if element does not meet criteria

Runs in parallel !!

a

b

f

c

d

e

g

h

66 of 135

Stream Compaction

Step 1: Compute temporary array containing

1 if corresponding element meets criteria
0 if element does not meet criteria

Runs in parallel !!

a

b

f

c

d

e

g

h

1

0

1

0

1

0

67 of 135

Stream Compaction

Step 2: Run exclusive scan on temporary array

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

68 of 135

Stream Compaction

Step 2: Run exclusive scan on temporary array
Scan also runs in parallel
What can we do with the result?

0

1

3

1

2

3

4

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

69 of 135

Stream Compaction

Step 3: Scatter

Result of scan is index into final array
Only write an element if temporary array has a 1

0

1

3

1

2

3

4

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

Final Array:

0

1

2

3

70 of 135

Stream Compaction

Step 3: Scatter

Result of scan is index into final array
Only write an element if temporary array has a 1

0

1

3

1

2

3

4

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

Final Array:

a

0

1

2

3

71 of 135

Stream Compaction

Step 3: Scatter

Result of scan is index into final array
Only write an element if temporary array has a 1

0

1

3

1

2

3

4

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

Final Array:

a

c

0

1

2

3

72 of 135

Stream Compaction

Step 3: Scatter

Result of scan is index into final array
Only write an element if temporary array has a 1

0

1

3

1

2

3

4

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

Final Array:

a

c

d

0

1

2

3

73 of 135

Stream Compaction

Step 3: Scatter

Result of scan is index into final array
Only write an element if temporary array has a 1

0

1

3

1

2

3

4

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

Final Array:

a

c

d

g

0

1

2

3

74 of 135

Stream Compaction

Step 3: Scatter

Result of scan is index into final array
Only write an element if temporary array has a 1

0

1

3

1

2

3

4

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

Final Array:

a

c

d

g

0

1

2

3

75 of 135

Stream Compaction

Step 3: Scatter – Runs in parallel !!

0

1

3

1

2

3

4

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

Final Array:

0

1

2

3

76 of 135

Stream Compaction

Step 3: Scatter – Runs in parallel !!

0

1

3

1

2

3

4

a

b

f

c

d

e

g

h

1

0

1

0

1

0

Scan Result:

Final Array:

a

c

d

g

0

1

2

3

77 of 135

Summed Area Table (SAT)

78 of 135

Summed Area Table

Summed Area Table (SAT): 2D table where each element stores the sum of all elements in an input image between the lower left corner and the entry location.

79 of 135

Summed Area Table

Example

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

2

5

6

8

2

6

9

11

4

9

12

14

SAT

(1 + 1 + 0) + (1 + 2 + 1) + (0 + 1 + 2) = 9

80 of 135

Summed Area Table

Benefit

Used to perform different width filters at every pixel in the image in constant time per pixel
Just sample four pixels in SAT:

Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

81 of 135

Summed Area Table

Uses

Approximate depth of field
Glossy environment reflections and refractions

Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

82 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

SAT

83 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

SAT

84 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

SAT

85 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

SAT

86 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

SAT

87 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

2

SAT

88 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

2

5

SAT

89 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

2

5

6

SAT

90 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

2

5

6

8

SAT

91 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

2

5

6

8

2

6

9

11

4

9

12

SAT

92 of 135

Summed Area Table

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

2

5

6

8

2

6

9

11

4

9

12

14

SAT

93 of 135

Summed Area Table

How would implement

this on the GPU?

94 of 135

Summed Area Table

How would implement

this on the GPU?

Hint: Inclusive Scan

95 of 135

Summed Area Table

Step 1 of 2: Row wise inclusive scan

One inclusive scan for each row

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

1

3

4

0

1

3

2

3

Partial SAT

96 of 135

Summed Area Table

Step 2 of 2: Column wise inclusive scan

One inclusive scan for each column

1

0

2

1

2

1

0

1

2

0

2

1

0

Input image

1

2

4

2

5

6

8

2

6

9

11

4

9

12

14

SAT

97 of 135

Radix Sort

98 of 135

Radix Sort

Efficient for small sort keys

k-bit keys require k passes

99 of 135

Radix Sort

Each radix sort pass partitions its input based on one bit
First pass starts with the least significant bit (LSB). Subsequent passes move towards the most significant bit (MSB)

0110

LSB

MSB

100 of 135

Radix Sort

Example input:

Example from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

100

111

010

110

011

101

001

000

101 of 135

Radix Sort

First pass: partition based on LSB

LSB == 0

LSB == 1

100

111

010

110

011

101

001

000

100

010

110

000

111

011

101

001

102 of 135

Radix Sort

Second pass: partition based on middle bit

100

000

101

001

010

110

111

011

Bit == 0

Bit == 1

100

111

010

110

011

101

001

000

100

010

110

000

111

011

101

001

103 of 135

Radix Sort

Final pass: partition based on MSB

100

000

101

001

010

110

111

011

MSB == 0

MSB == 1

100

111

010

110

011

101

001

000

100

010

110

000

111

011

101

001

000

001

010

011

100

101

110

111

104 of 135

Radix Sort

Completed

100

000

101

001

010

110

111

011

100

111

010

110

011

101

001

000

100

010

110

000

111

011

101

001

000

001

010

011

100

101

110

111

105 of 135

Radix Sort

Completed

4

0

5

1

2

6

7

3

4

7

2

6

3

5

1

0

4

2

6

0

7

3

5

1

0

1

2

3

4

5

6

7

106 of 135

Parallel Radix Sort

Where is the parallelism?

107 of 135

Parallel Radix Sort

Break input arrays into tiles

Each tile fits into shared memory for an SM

Sort tiles in parallel with radix sort
Merge pairs of tiles using a parallel bitonic merge until all tiles are merged.

Our focus is on Step 2

108 of 135

Parallel Radix Sort

Where is the parallelism?

Each tile is sorted in parallel
Where is the parallelism within a tile?

109 of 135

Parallel Radix Sort

Where is the parallelism?

Each tile is sorted in parallel
Where is the parallelism within a tile?

Each pass is done in sequence after the previous pass. No parallelism
Can we parallelize an individual pass? How?

Merge also has parallelism

110 of 135

Parallel Radix Sort

Implement spilt. Given:

Array, i, at pass n:

Array, b, which is true/false for bit n:

Output array with false keys before true keys:

100

111

010

110

011

101

001

000

0

1

0

1

0

100

010

110

000

111

011

101

001

111 of 135

Parallel Radix Sort

Step 1: Compute e array

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

112 of 135

Parallel Radix Sort

Step 2: Exclusive scan e array and store it in f

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

113 of 135

Parallel Radix Sort

Step 3: Compute totalFalses

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

totalFalses = e[n – 1] + f[n – 1]

totalFalses = 1 + 3

totalFalses = 4

114 of 135

Parallel Radix Sort

Step 4: Compute t array

t array is address for writing true keys

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

totalFalses = 4

t[i] = i – f[i] + totalFalses

t array

115 of 135

Parallel Radix Sort

Step 4: Compute t array

t array is address for writing true keys

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

4

totalFalses = 4

t[i] = i – f[i] + totalFalses ==> t[0] = 0 – 0 + 4

t array

116 of 135

Parallel Radix Sort

Step 4: Compute t array

t array is address for writing true keys

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

4

totalFalses = 4

t[i] = i – f[i] + totalFalses ==> t[1] = 1 – 1 + 4

t array

117 of 135

Parallel Radix Sort

Step 4: Compute t array

t array is address for writing true keys

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

4

5

totalFalses = 4

t[i] = i – f[i] + totalFalses ==> t[2] = 2 – 1 + 4

t array

118 of 135

Parallel Radix Sort

Step 4: Compute t array

t array is address for writing true keys

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

4

5

6

7

8

totalFalses = 4

t[i] = i – f[i] + totalFalses

t array

119 of 135

Parallel Radix Sort

Step 5: Scatter based on address d

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

4

5

6

7

8

t array

d[i] = b[i] ? t[i] : f[i]

120 of 135

Parallel Radix Sort

Step 5: Scatter based on address d

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

4

5

6

7

8

t array

0

d[i] = b[i] ? t[i] : f[i]

121 of 135

Parallel Radix Sort

Step 5: Scatter based on address d

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

4

5

6

7

8

t array

0

4

d[i] = b[i] ? t[i] : f[i]

122 of 135

Parallel Radix Sort

Step 5: Scatter based on address d

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

4

5

6

7

8

t array

0

4

1

d[i] = b[i] ? t[i] : f[i]

123 of 135

Parallel Radix Sort

Step 5: Scatter based on address d

0

1

0

1

0

i array

b array

1

0

1

0

1

e array ( = !b)

100

111

010

110

011

101

001

000

0

1

2

3

f array

4

5

6

7

8

t array

0

4

1

2

5

6

7

3

d[i] = b[i] ? t[i] : f[i]

124 of 135

Parallel Radix Sort

Step 5: Scatter based on address d

i array

100

111

010

110

011

101

001

000

0

4

1

2

5

6

7

3

d array

output

125 of 135

Parallel Radix Sort

Step 5: Scatter based on address d

i array

100

111

010

110

011

101

001

000

0

4

1

2

5

6

7

3

d array

output

0

1

2

3

4

5

6

7

126 of 135

Parallel Radix Sort

Given k-bit keys, how do we sort using our new split function?

Once each tile is sorted, how do we merge tiles to provide the final sorted array?

127 of 135

Scan Revisited

128 of 135

Scan Limitations

Requires power-of-two length
Executes in one block (unless only using global memory)

Length up to twice the number of threads in a block

129 of 135

Scan on Arrays of Arbitrary Length

Divide the array into blocks

0	1	2	3	4	5	6	7	8	9	10	11


Block 0			Block 1			Block 2			Block 3
0	1	2	3	4	5	6	7	8	9	10	11

130 of 135

Scan on Arrays of Arbitrary Length

Run scan on each block

Run in parallel over many SMs

Block 0			Block 1			Block 2			Block 3
0	1	2	3	4	5	6	7	8	9	10	11

Block 0			Block 1			Block 2			Block 3
0	1	3	3	7	12	6	13	21	9	19	30

131 of 135

Scan on Arrays of Arbitrary Length

Write total sum of each block into a new array

Block 0			Block 1			Block 2			Block 3
0	1	3	3	7	12	6	13	21	9	19	30

3	12	21	30

132 of 135

Scan on Arrays of Arbitrary Length

Exclusive scan block sums to compute block increments

3	12	21	30

0	3	15	36

Block sums:

Block increments:

133 of 135

Scan on Arrays of Arbitrary Length

Add block increments to each element in the corresponding block

+

Block 0			Block 1			Block 2			Block 3
0	1	3	3	7	12	6	13	21	9	19	30

Block 0			Block 1			Block 2			Block 3
0	1	3	6	10	15	21	28	36	45	55	66

3	12	21	30

0	3	15	36

Block sums:

Block increments:

134 of 135

Scan on Arrays of Arbitrary Length

Non-power-of-two length:
Pad last block with zeros up until the block size

135 of 135

Summary

Parallel reduction, scan, and sort are building blocks for many algorithms
An understanding of parallel programming and GPU architecture yields efficient GPU implementations