1 of 12

https://medium.com/nerd-for-tech/an-overview-of-pipeline-parallelism-and-its-research-progress-7934e5e6d5b8

Pipeline Parallelism

Reza Jahadi Aye Sandar Thwe Xueyun Ye

Professor: Dr. Ehsan Atoofian

Narayanan, Deepak, et al. “PipeDream: generalized pipeline parallelism for DNN training.” Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019.

2 of 12

MNIST Dataset

handwritten digits

0

1

2

3

4

5

6

7

8

9

60,000 training images

10,000 Test images

3 of 12

Convolution

Dense Layer

Output

Input

1@28×28

16@14×14

32@7×7

1×200

1×10

1×200

Convolutional Neural Network

4 of 12

Model Parallelism

Batch size = 1

under-utilization of computing resources

Narayanan, Deepak, et al. “PipeDream: generalized pipeline parallelism for DNN training.” Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019.

Not clear model optimal splitting among different machines

5 of 12

Convolution

Dense Layer

Output

Input

1@28×28

16@14×14

32@7×7

1×200

1×10

1×200

Task Channel Model

6 of 12

Pipeline Parallelism

Accuracy

Number of Processors = Number of Layers

7 of 12

Pipeline Parallelism

Conv 1 (P₀)

Conv 2 (P₁)

Full 2 (P₃)

Full 1 (P₂)

Output (P₄)

1^st image

2^nd image

3^rd image

2^nd image

3^rd image

4^th image

5^th image

1^st image

3^rd image

4^th image

2^nd image

1^st image

2^nd image

1^st image

step 1

step 2

step 3

step 4

step 5

…

Batch size (i)=

5*Test dataset size

P

1^st image

i^th image

step i

If P is divisible by 5, 1 layer is assigned to 1 CPU in a cyclic manner
MPI_Send to send the outputs to next CPU
MPI_Receive to receive outputs from previous CPU
MPI_Reduce to compute total number of correct predictions

8 of 12

Pipeline Parallelism

If P is not divisible by 5

9 of 12

Data Parallelism

10 of 12

Results

	Sequential Code Exe Time (s)	Parallel Code Exe Time with 20 CPU (s)	Speedup with 20 CPU
Pipelining	43.715269	5.89032	7.421543991
Data Parallelism	45.51055	2.647105	17.19257453

11 of 12

Communication Overhead

(Pipeline MPI_Send / MPI_Receive)

12 of 12

Communication Overhead

(MPI_Reduce)

Total Communication delay for MPI_Reduce (both pipelining and data parallelism):

log p, where:

p = number of processors