1 of 36

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

Chaoyang He

PhD student (2018-present), CS, USC�Former R&D Manager at Tencent�Researcher, Tencent AI Lab

Salman Avestimehr�Professor, ECE&CS, USC�Director, USC-Amazon ML Center

Shen Li�Research Scientist, Facebook AI�Team Lead, PyTorch Distributed�CS PhD, UIUC

Mahdi Soltanolkotabi�Associate Professor�CS, ECE, USC

https://DistML.ai�https://chaoyanghe.com/pipetransformer/

2 of 36

Outline

Background and Related Works�
Motivation and Ideas�
Overall Design (Animation)�
AutoFreeze: Freeze Algorithm�
AutoPipe: Elastic Pipelining�
AutoDP: Spawning More Pipeline Replicas�
AutoCache: Cross-pipeline Caching�
Experimental Results�
Future Works

3 of 36

Background

4 of 36

Background

After 2021-06: 100 Trillion?

The parameter number of deep neural networks (Transformers) is dramatically increasing!

5 of 36

Background

*Note: This page is from AAAI 2021 Tutorial (https://sites.google.com/view/aaai-2021-tutorial-ah9/home)

6 of 36

Background

*Note: This page is from AAAI 2021 Tutorial (https://sites.google.com/view/aaai-2021-tutorial-ah9/home)

7 of 36

Background

*Note: This page is from AAAI 2021 Tutorial (https://sites.google.com/view/aaai-2021-tutorial-ah9/home)

8 of 36

Related Works

Data and Model Parallel	Data Inter-Batch: ByteScheduler (SOSP’19) Crossbow (VLDB’19)	Data Intra-Batch: PT Pipe PT DDP (VLDB’20) GPipe (NeurlPS’19)
Model Inter-Operator PT RPC TF + gRPC (EuroSys’19)	PipeDream (SOSP’19) HetPipe (ATC’20)	PT RPC + DDP PT RPC + Pipe Parallax (EuroSys’19) BytePS (OSDI’20)
Model Intra-Operator Mesh-TF (NeurlPS’18) TouFu (EuroSys’19)		FlexFlow (MLSys’19) GPT-3 (NeurlPS’20) ZeRO/DeepSpeed (SC’20)

1. System-wise: Distributed Training System Design and Optimization

9 of 36

Related Works

1. System-wise: Distributed Training System Design and Optimization

Pipeline Parallelism

Hybrid of Data Parallelism and Pipeline Parallelism

10 of 36

Related Works

Architecture Optimization Manually: �LinFormer (AAAI’2021, Best Paper Award)�
Automated Architecture Design: �Neural Architecture Search: FBNet (CVPR’ 2019, 400+ citations)�
Spare Training: pruning, quantization, etc�Lottery Ticket Hypothesis (ICLR 2019, Best Paper Award)�
Progressive Training

SGD-based Distributed Optimization:�LARS (ICLR 2020): Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

...

2. ML-wise: Model Architecture and Training Algorithm

11 of 36

Our Motivation and Idea

Elastic Distributed Training System!

1. (Sys) Distributed Training System

2. (ML) Model Architecture and Training Algorithm

Pro:�Efficiency in Computation/Communication/Memory��Con:�View the model/SGD optimization as black box

Pro:�Improve the efficiency mathematically, fundamentally��Con:�1. Lack of system design to amplify the algorithmic advantages�2. the model-wise optimization is not friendly to distributed training

What if we co-design?

Progressive Training
Dynamic Neural Networks (https://arxiv.org/pdf/2102.04906.pdf)

Hybrid of Pipeline and Data Parallelism

12 of 36

Progressive Training

[1] Freeze Training: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. NeurIPS 2017

[2] Efficient Training of BERT by Progressively Stacking. ICML 2019

[3] Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. NeurIPS 2020. Minjia Zhang�[4] On the Transformer Growth for Progressive BERT Training. NACCL 2021

Freeze Training [1]

Progressive Stacking [2]

13 of 36

Distributed Training System

Pipeline Parallelism

Hybrid of Data Parallelism and Pipeline Parallelism

Key observations when applying progressive training (e.g., freeze training) to the above training systems:�1. The memory cost is reduced gradually

2. The communication cost among DP workers is reduced gradually

3. The computation cost becomes unbalanced in pipeline-parallelism

14 of 36

Overall Design

The process of PipeTransformer’s automated and elastic pipelining

To amplify these propertie, we propose PipeTransformer, which can automatically adjust the pipeline and data parallelism to speedup the training. ��This figure shows the process of that PipeTransformer applies to freeze training algorithm. During the training, freeze algorithm will make some layers frozen so that we don’t need to compute the back-propagation for it, which largely reduces the memory cost. Due the memory cost reduction, the 2nd component AutoPipe will shrink the pipeline length, so that the system can load the model with fewer GPUs. Then the released GPUs can hold a new pipeline. As the GREEN color pipeline shows, the AutoDP module creates more pipeline replicas to accelerate the throughput of the training system. The newly added pipelines act as data parallel workers and then enlarges worker number in data-parallelism. Finally, the fourth component AutoCache will further speedup the training by using cross-process caching mechanism. ��With the help of this design, our experiment shows that it can accelerate the training 2.8x in a cluster that each machine has 8 GPUs.��To show this training process vividly, we make some animations for PipeTransformer. Now let’s see how it works.

15 of 36

PipeTransformer Animation

16 of 36

PipeTransformer Animation

17 of 36

PipeTransformer Animation

18 of 36

PipeTransformer Animation

19 of 36

PipeTransformer Animation

20 of 36

PipeTransformer Animation

21 of 36

PipeTransformer Animation

22 of 36

Overall Design

https://DistML.ai�https://chaoyanghe.com/pipetransformer/

23 of 36

Freeze Algorithm

24 of 36

AutoPipe: Elastic Pipelining

trade-off of computational load, communication cost, and memory consumption among partitions (each partition is loaded to one GPU)

(1) Pipeline Partitioning Strategy

25 of 36

AutoPipe: Elastic Pipelining

To avoid extensive memory profiling, the compression algorithm uses the parameter size as a proxy for the training memory footprint.

(2) Pipeline Compression

26 of 36

AutoPipe: Elastic Pipelining

When the pipeline compressed (K becomes smaller), the bubble is shrunk (left figure)��However, we find that the micro-batches size (M) also needs to be adjusted accordingly (right figure shows the optimal M in different K).

(3) Dynamic Number of Micro Batches

27 of 36

AutoPipe: Elastic Pipelining

A trade-off in Pipeline partition
Pipeline compression
optimal micro-batches chunk number (M)

(4) AutoPipe algorithm: put all together

28 of 36

AutoDP: �Spawning More Pipeline Replicas

29 of 36

AutoDP: �Spawning More Pipeline Replicas

Key challenges when adding more pipeline on the fly of training:�

DDP Communication: Collective communications in PyTorch DDP requires static membership, which prevents new pipelines from connecting with existing ones; �
State Synchronization: newly activated processes must be consistent with existing pipelines in the training progress (e.g., epoch number and learning rate), weights and optimizer states, the boundary of frozen layers, and pipeline GPU range; �
Dataset Redistribution: the dataset should be re-balanced to match a dynamic number of pipelines. This not only avoids stragglers but also ensures that gradients from all DDP processes are equally weighted.

30 of 36

AutoDP: �Spawning More Pipeline Replicas

Our idea:�1. creating two process groups. Each process handles one pipeline

2. the active training process group (yellow) handles the training.

3. the message process group (purple) handles State Synchronization and Dataset Redistribution by messaging passing between two groups with MPI communication.

31 of 36

AutoCache: �Cross-pipeline Caching

In this example, the first 3 layers (purple) at two time steps T1 and T2(epochs) are the same computation, so T2 can reuse the caching from T1.

32 of 36

Experimental Results

1. Overall Speedup

Evaluation on Various datasets and models, including tasks in both CV and NLP.

33 of 36

Experimental Results

Key takeaway:

1. the main speedup is the result of elastic pipelining which is achieved through the joint use of AutoPipe and AutoDP (purple)��2. AutoCache’s contribution is amplified by AutoDP (green v.s. blue: more parallel DP workers can use caching) ��3. freeze training alone without system-wise adjustment even downgrades the training speed (yellow)�(the underlying mechanism of PyTorch is not tailored for freeze training, forcing CUDACachingAllocator to split blocks or launch new memory allocations)

2. Breakdown for speedup

34 of 36

Experimental Results

Communication Infrastructure: InfiniBand CX353A where cross-machine bandwidth is 5GB/s, and GPU-to-GPU bandwidth within a machine (PCI 3.0, 16 lanes) is 15.754GB/s.

Key takeaway:�1. Communication cost is not the main bottleneck when we use InfiniBand for medium-scale models (< 500M such as ViT-base and BERT-large), but it is still non-trivial even under freeze training.�2. Recent progress in NLP and CV has scaled up the model size to billion/trillion-level (GPT-3 - 175B [1], Switch Transformer - 1.7T [2]), which will make the ratio of communication much higher than our experimental results.

BERT-large (340M)

ViT-Base (87M)

[1] Language Models are Few-Shot Learners. 2020

[2] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. 2021

3. Breakdown for communication v.s. computation

35 of 36

Experimental Results

Optimal chunk number in dynamic pipeline is different.
Timing of caching is important.

�We automate these two optimization strategies.

The trade-off between accuracy and efficiency.

4. Performance Analysis

36 of 36

Future works

Distributed training tasks that can be elastic:

Elastic cloud-based distributed training system�
Accelerating NAS in extremely large search space�
Federated AutoML (NAS, HPO)�
Cross-silo Federated Learning �
Pruning-based distributed training �
IoT device-based elastic edge training