Parallel Algorithms
Shehzan Mohammed
CIS 5650 - Fall 2025
Parallel Reduction
Agenda - Parallel Algorithms
Parallel Reduction
Parallel Reduction
Parallel Reduction
Parallel Reduction
0
1
5
2
3
4
6
7
Parallel Reduction
1
9
5
13
0
1
5
2
3
4
6
7
Parallel Reduction
6
22
1
9
5
13
0
1
5
2
3
4
6
7
Parallel Reduction
28
6
22
1
9
5
13
0
1
5
2
3
4
6
7
Parallel Reduction
Parallel Reduction
for d = 0 to log2n - 1
for all k = 0 to n – 1 by 2d+1 in parallel
x[k + 2d+1 – 1] += x[k + 2d – 1];
// In this pass, for k = (0, 2, 4, 6)
// x[k + 1] += x[k];
1
9
5
13
0
1
5
2
3
4
6
7
Parallel Reduction
6
22
1
9
5
13
0
1
5
2
3
4
6
7
for d = 0 to log2n - 1
for all k = 0 to n – 1 by 2d+1 in parallel
x[k + 2d+1 – 1] += x[k + 2d – 1];
// In this pass, for k = (0, 4)
// x[k + 3] += x[k + 1];
Parallel Reduction
28
6
22
1
9
5
13
0
1
5
2
3
4
6
7
for d = 0 to log2n - 1
for all k = 0 to n – 1 by 2d+1 in parallel
x[k + 2d+1 – 1] += x[k + 2d – 1];
// In this pass, for k = (0)
// x[k + 7] += x[k + 3];
Parallel Reduction
28
1
9
2
6
4
6
0
6
22
1
9
5
13
0
1
5
2
3
4
6
7
for d = 0 to log2n - 1
for all k = 0 to n – 1 by 2d+1 in parallel
x[k + 2d+1 – 1] += x[k + 2d – 1]
Scan
All-Prefix-Sums
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
All-Prefix-Sums
Scan
Scan
Scan
Scan
out[0] = in[0]; // assuming n > 0
for (int k = 1; k < n; ++k)
out[k] = out[k – 1] + in[k];
Scan
Image from http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
Scan
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
3
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
3
5
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
3
5
7
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
9
3
5
7
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
9
3
5
7
11
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
After d=1
Scan
if (7 >= 2^2-1)
x[7] = x[7 - 2^2-1] + x[7]
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
22
After d=1
Scan
if (7 >= 2^(2-1))
x[7] = x[7 - 2^(2-1)] + x[7]
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
After d=1
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
0
1
14
3
6
10
18
22
After d=2
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
After d=1
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
0
1
14
3
6
10
18
22
After d=2
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
After d=1
After d=2
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
0
1
14
3
6
10
18
22
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
0
1
14
3
6
10
18
22
28
if (7 >= 2^(3-1))
x[7] = x[7 - 2^(3-1)] + x[7]
Scan
for d = 1 to log2n
for all k in parallel
if (k >= 2d-1)
x[k] = x[k – 2d-1] + x[k];
After d=1
After d=2
0
1
9
3
5
7
11
13
0
1
5
2
3
4
6
7
0
1
14
3
6
10
18
22
0
1
15
3
6
10
21
28
After d=3
Naive Parallel Scan
Naive Parallel Scan
Work-Efficient Parallel Scan
Work-Efficient Parallel Scan
Level 0
Level 1
Level 2
Level 3
1 node
2 nodes
4 nodes
8 nodes
Work-Efficient Parallel Scan
Work-Efficient Parallel Scan
// Same code as our Parallel Reduction
for d = 0 to log2n - 1
for all k = 0 to n – 1 by 2d+1 in parallel
x[k + 2d+1 – 1] += x[k + 2d – 1];
0
1
5
2
3
4
6
7
1
9
5
13
6
22
28
Level 0
Level 1
Level 2
Level 3
Work-Efficient Parallel Scan
0
2
4
6
1
9
6
28
28
0
2
4
6
1
9
6
Work-Efficient Parallel Scan
// Same code as our Parallel Reduction
for d = 0 to log2n - 1
for all k = 0 to n – 1 by 2d+1 in parallel
x[k + 2d+1 – 1] += x[k + 2d – 1];
0
1
5
2
3
4
6
7
1
9
5
13
6
22
28
After d=2
After d=1
After d=0
0
2
4
6
1
9
6
In-place reduction
On input array
Work-Efficient Parallel Scan
0
1
5
2
3
4
6
7
1
9
5
13
6
22
28
0
2
4
6
1
9
6
Work-Efficient Parallel Scan
x[n - 1] = 0
for d = log2n – 1 to 0
for all k = 0 to n – 1 by 2d+1 in parallel
t = x[k + 2d – 1]; // Save left child
x[k + 2d – 1] = x[k + 2d+1 – 1]; // Set left child to this node’s value
x[k + 2d+1 – 1] += t; // Set right child to old left value +
// this node’s value
Work-Efficient Parallel Scan
0
28
0
2
4
6
1
9
6
Replace with 0
Work-Efficient Parallel Scan
1
9
0
6
6
0
28
0
2
4
6
1
9
6
Replace with 0
After d=0
At each level
Remember to think of this as a tree, not as array
Orange Arrow = Copy
Green Arrow + Black Arrow = Add
Work-Efficient Parallel Scan
0
0
6
2
1
4
6
15
1
9
0
6
6
0
28
0
2
4
6
1
9
6
Replace with 0
After d=0
After d=1
At each level
Remember to think of this as a tree, not as array
Orange Arrow = Copy
Green Arrow + Black Arrow = Add
Work-Efficient Parallel Scan
0
0
6
2
1
4
6
15
1
9
0
6
6
0
28
0
2
4
6
1
9
6
0
0
10
1
3
6
15
21
Replace with 0
After d=0
After d=1
After d=2
At each level
Remember to think of this as a tree, not as array
Orange Arrow = Copy
Green Arrow + Black Arrow = Add
Work-Efficient Parallel Scan
Stream Compaction
Stream Compaction
a
b
f
c
d
e
g
h
Stream Compaction
a
b
f
c
d
e
g
h
a
c
d
g
Stream Compaction
a
b
f
c
d
e
g
h
a
c
d
g
Stream Compaction
a
b
f
c
d
e
g
h
1
Stream Compaction
a
b
f
c
d
e
g
h
1
0
Stream Compaction
a
b
f
c
d
e
g
h
1
0
1
Stream Compaction
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Stream Compaction
a
b
f
c
d
e
g
h
Stream Compaction
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Stream Compaction
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Stream Compaction
0
1
3
1
2
3
3
4
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Stream Compaction
0
1
3
1
2
3
3
4
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Final Array:
0
1
2
3
Stream Compaction
0
1
3
1
2
3
3
4
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Final Array:
a
0
1
2
3
Stream Compaction
0
1
3
1
2
3
3
4
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Final Array:
a
c
0
1
2
3
Stream Compaction
0
1
3
1
2
3
3
4
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Final Array:
a
c
d
0
1
2
3
Stream Compaction
0
1
3
1
2
3
3
4
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Final Array:
a
c
d
g
0
1
2
3
Stream Compaction
0
1
3
1
2
3
3
4
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Final Array:
a
c
d
g
0
1
2
3
Stream Compaction
0
1
3
1
2
3
3
4
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Final Array:
0
1
2
3
Stream Compaction
0
1
3
1
2
3
3
4
a
b
f
c
d
e
g
h
1
0
0
1
1
0
1
0
Scan Result:
Final Array:
a
c
d
g
0
1
2
3
Summed Area Table (SAT)
Summed Area Table
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
2
5
6
8
2
6
9
11
4
9
12
14
SAT
(1 + 1 + 0) + (1 + 2 + 1) + (0 + 1 + 2) = 9
Summed Area Table
Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
Summed Area Table
Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
2
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
2
5
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
2
5
6
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
2
5
6
8
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
2
5
6
8
2
6
9
11
4
9
12
SAT
Summed Area Table
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
2
5
6
8
2
6
9
11
4
9
12
14
SAT
Summed Area Table
How would implement
this on the GPU?
Summed Area Table
How would implement
this on the GPU?
Hint: Inclusive Scan
Summed Area Table
One inclusive scan for each row
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
1
3
4
4
0
1
3
3
2
3
3
3
Partial SAT
Summed Area Table
One inclusive scan for each column
1
1
0
2
1
2
1
0
0
1
2
0
2
1
0
0
Input image
1
2
2
4
2
5
6
8
2
6
9
11
4
9
12
14
SAT
Radix Sort
Radix Sort
Radix Sort
0110
LSB
MSB
Radix Sort
Example from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
100
111
010
110
011
101
001
000
Radix Sort
LSB == 0
LSB == 1
100
111
010
110
011
101
001
000
100
010
110
000
111
011
101
001
Radix Sort
100
000
101
001
010
110
111
011
Bit == 0
Bit == 1
100
111
010
110
011
101
001
000
100
010
110
000
111
011
101
001
Radix Sort
100
000
101
001
010
110
111
011
MSB == 0
MSB == 1
100
111
010
110
011
101
001
000
100
010
110
000
111
011
101
001
000
001
010
011
100
101
110
111
Radix Sort
100
000
101
001
010
110
111
011
100
111
010
110
011
101
001
000
100
010
110
000
111
011
101
001
000
001
010
011
100
101
110
111
Radix Sort
4
0
5
1
2
6
7
3
4
7
2
6
3
5
1
0
4
2
6
0
7
3
5
1
0
1
2
3
4
5
6
7
Parallel Radix Sort
Parallel Radix Sort
Our focus is on Step 2
Parallel Radix Sort
Parallel Radix Sort
Parallel Radix Sort
100
111
010
110
011
101
001
000
0
1
0
0
1
1
1
0
100
010
110
000
111
011
101
001
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
totalFalses = e[n – 1] + f[n – 1]
totalFalses = 1 + 3
totalFalses = 4
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
totalFalses = 4
t[i] = i – f[i] + totalFalses
t array
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
4
totalFalses = 4
t[i] = i – f[i] + totalFalses ==> t[0] = 0 – 0 + 4
t array
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
4
4
totalFalses = 4
t[i] = i – f[i] + totalFalses ==> t[1] = 1 – 1 + 4
t array
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
4
4
5
totalFalses = 4
t[i] = i – f[i] + totalFalses ==> t[2] = 2 – 1 + 4
t array
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
4
4
5
5
5
6
7
8
totalFalses = 4
t[i] = i – f[i] + totalFalses
t array
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
4
4
5
5
5
6
7
8
t array
d[i] = b[i] ? t[i] : f[i]
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
4
4
5
5
5
6
7
8
t array
0
d[i] = b[i] ? t[i] : f[i]
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
4
4
5
5
5
6
7
8
t array
0
4
d[i] = b[i] ? t[i] : f[i]
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
4
4
5
5
5
6
7
8
t array
0
4
1
d[i] = b[i] ? t[i] : f[i]
Parallel Radix Sort
0
1
0
0
1
1
1
0
i array
b array
1
0
1
1
0
0
0
1
e array ( = !b)
100
111
010
110
011
101
001
000
0
1
1
2
3
3
3
3
f array
4
4
5
5
5
6
7
8
t array
0
4
1
2
5
6
7
3
d[i] = b[i] ? t[i] : f[i]
Parallel Radix Sort
i array
100
111
010
110
011
101
001
000
0
4
1
2
5
6
7
3
d array
output
Parallel Radix Sort
i array
100
111
010
110
011
101
001
000
0
4
1
2
5
6
7
3
d array
output
0
1
2
3
4
5
6
7
Parallel Radix Sort
Scan Revisited
Scan Limitations
Scan on Arrays of Arbitrary Length
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| | | | | | | | | | | |
| | | | | | | | | | | |
Block 0 | Block 1 | Block 2 | Block 3 | ||||||||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Scan on Arrays of Arbitrary Length
Run in parallel over many SMs
Block 0 | Block 1 | Block 2 | Block 3 | ||||||||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| | | | | | | | | | | |
Block 0 | Block 1 | Block 2 | Block 3 | ||||||||
0 | 1 | 3 | 3 | 7 | 12 | 6 | 13 | 21 | 9 | 19 | 30 |
Scan on Arrays of Arbitrary Length
Block 0 | Block 1 | Block 2 | Block 3 | ||||||||
0 | 1 | 3 | 3 | 7 | 12 | 6 | 13 | 21 | 9 | 19 | 30 |
3 | 12 | 21 | 30 |
Scan on Arrays of Arbitrary Length
3 | 12 | 21 | 30 |
0 | 3 | 15 | 36 |
Block sums:
Block increments:
Scan on Arrays of Arbitrary Length
+
Block 0 | Block 1 | Block 2 | Block 3 | ||||||||
0 | 1 | 3 | 3 | 7 | 12 | 6 | 13 | 21 | 9 | 19 | 30 |
Block 0 | Block 1 | Block 2 | Block 3 | ||||||||
0 | 1 | 3 | 6 | 10 | 15 | 21 | 28 | 36 | 45 | 55 | 66 |
3 | 12 | 21 | 30 |
0 | 3 | 15 | 36 |
Block sums:
Block increments:
Scan on Arrays of Arbitrary Length
Summary