1 of 12

https://medium.com/nerd-for-tech/an-overview-of-pipeline-parallelism-and-its-research-progress-7934e5e6d5b8

Pipeline Parallelism

Reza Jahadi Aye Sandar Thwe Xueyun Ye

Professor: Dr. Ehsan Atoofian

Narayanan, Deepak, et al. “PipeDream: generalized pipeline parallelism for DNN training.” Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019.

2 of 12

MNIST Dataset

handwritten digits

 

0

1

2

3

4

5

6

7

8

9

60,000 training images

10,000 Test images

3 of 12

Convolution

Convolution

Dense Layer

Dense Layer

Output

Input

1@28×28

16@14×14

32@7×7

1×200

1×10

1×200

Convolutional Neural Network

4 of 12

Model Parallelism

Batch size = 1

under-utilization of computing resources

Narayanan, Deepak, et al. “PipeDream: generalized pipeline parallelism for DNN training.” Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019.

Not clear model optimal splitting among different machines

5 of 12

Convolution

Convolution

Dense Layer

Dense Layer

Output

Input

1@28×28

16@14×14

32@7×7

1×200

1×10

1×200

Task Channel Model

6 of 12

Pipeline Parallelism

Accuracy

Number of Processors = Number of Layers

7 of 12

Pipeline Parallelism

Conv 1 (P0)

Conv 2 (P1)

Full 2 (P3)

Full 1 (P2)

Output (P4)

1st image

2nd image

3rd image

2nd image

3rd image

4th image

5th image

1st image

3rd image

4th image

2nd image

1st image

2nd image

1st image

step 1

step 2

step 3

step 4

step 5

Batch size (i)=

5*Test dataset size

P

1st image

ith image

step i

  • If P is divisible by 5, 1 layer is assigned to 1 CPU in a cyclic manner
  • MPI_Send to send the outputs to next CPU
  • MPI_Receive to receive outputs from previous CPU
  • MPI_Reduce to compute total number of correct predictions

8 of 12

Pipeline Parallelism

  • If P is not divisible by 5

 

9 of 12

Data Parallelism

 

10 of 12

Results

Sequential Code Exe Time (s)

Parallel Code Exe Time with 20 CPU (s)

Speedup with 20 CPU

Pipelining

43.715269

5.89032

7.421543991

Data Parallelism

45.51055

2.647105

17.19257453

11 of 12

Communication Overhead

(Pipeline MPI_Send / MPI_Receive)

 

12 of 12

Communication Overhead

(MPI_Reduce)

  • Total Communication delay for MPI_Reduce (both pipelining and data parallelism):

log p, where:

p = number of processors