PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
Chaoyang He
PhD student (2018-present), CS, USC�Former R&D Manager at Tencent�Researcher, Tencent AI Lab
Salman Avestimehr�Professor, ECE&CS, USC�Director, USC-Amazon ML Center
Shen Li�Research Scientist, Facebook AI�Team Lead, PyTorch Distributed�CS PhD, UIUC
Mahdi Soltanolkotabi�Associate Professor�CS, ECE, USC
Outline
Background
Background
After 2021-06: 100 Trillion?
The parameter number of deep neural networks (Transformers) is dramatically increasing!
Background
*Note: This page is from AAAI 2021 Tutorial (https://sites.google.com/view/aaai-2021-tutorial-ah9/home)
Background
*Note: This page is from AAAI 2021 Tutorial (https://sites.google.com/view/aaai-2021-tutorial-ah9/home)
Background
*Note: This page is from AAAI 2021 Tutorial (https://sites.google.com/view/aaai-2021-tutorial-ah9/home)
Related Works
Data and Model Parallel | Data Inter-Batch: ByteScheduler (SOSP’19) Crossbow (VLDB’19) | Data Intra-Batch: PT Pipe PT DDP (VLDB’20) GPipe (NeurlPS’19) |
Model Inter-Operator PT RPC TF + gRPC (EuroSys’19) | PipeDream (SOSP’19) HetPipe (ATC’20) | PT RPC + DDP PT RPC + Pipe Parallax (EuroSys’19) BytePS (OSDI’20) |
Model Intra-Operator Mesh-TF (NeurlPS’18) TouFu (EuroSys’19) | | FlexFlow (MLSys’19) GPT-3 (NeurlPS’20) ZeRO/DeepSpeed (SC’20) |
1. System-wise: Distributed Training System Design and Optimization
Related Works
1. System-wise: Distributed Training System Design and Optimization
Pipeline Parallelism
Hybrid of Data Parallelism and Pipeline Parallelism
Related Works
...
2. ML-wise: Model Architecture and Training Algorithm
Our Motivation and Idea
Elastic Distributed Training System!
1. (Sys) Distributed Training System
2. (ML) Model Architecture and Training Algorithm
Pro:�Efficiency in Computation/Communication/Memory��Con:�View the model/SGD optimization as black box
Pro:�Improve the efficiency mathematically, fundamentally��Con:�1. Lack of system design to amplify the algorithmic advantages�2. the model-wise optimization is not friendly to distributed training
What if we co-design?
Hybrid of Pipeline and Data Parallelism
Progressive Training
[1] Freeze Training: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. NeurIPS 2017
[2] Efficient Training of BERT by Progressively Stacking. ICML 2019
[3] Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. NeurIPS 2020. Minjia Zhang�[4] On the Transformer Growth for Progressive BERT Training. NACCL 2021
Freeze Training [1]
Progressive Stacking [2]
Distributed Training System
Pipeline Parallelism
Hybrid of Data Parallelism and Pipeline Parallelism
Key observations when applying progressive training (e.g., freeze training) to the above training systems:�1. The memory cost is reduced gradually
2. The communication cost among DP workers is reduced gradually
3. The computation cost becomes unbalanced in pipeline-parallelism
Overall Design
The process of PipeTransformer’s automated and elastic pipelining
PipeTransformer Animation
PipeTransformer Animation
PipeTransformer Animation
PipeTransformer Animation
PipeTransformer Animation
PipeTransformer Animation
PipeTransformer Animation
Overall Design
Freeze Algorithm
AutoPipe: Elastic Pipelining
trade-off of computational load, communication cost, and memory consumption among partitions (each partition is loaded to one GPU)
(1) Pipeline Partitioning Strategy
AutoPipe: Elastic Pipelining
To avoid extensive memory profiling, the compression algorithm uses the parameter size as a proxy for the training memory footprint.
(2) Pipeline Compression
AutoPipe: Elastic Pipelining
When the pipeline compressed (K becomes smaller), the bubble is shrunk (left figure)��However, we find that the micro-batches size (M) also needs to be adjusted accordingly (right figure shows the optimal M in different K).
(3) Dynamic Number of Micro Batches
AutoPipe: Elastic Pipelining
(4) AutoPipe algorithm: put all together
AutoDP: �Spawning More Pipeline Replicas
AutoDP: �Spawning More Pipeline Replicas
Key challenges when adding more pipeline on the fly of training:�
AutoDP: �Spawning More Pipeline Replicas
Our idea:�1. creating two process groups. Each process handles one pipeline
2. the active training process group (yellow) handles the training.
3. the message process group (purple) handles State Synchronization and Dataset Redistribution by messaging passing between two groups with MPI communication.
AutoCache: �Cross-pipeline Caching
In this example, the first 3 layers (purple) at two time steps T1 and T2(epochs) are the same computation, so T2 can reuse the caching from T1.
Experimental Results
1. Overall Speedup
Evaluation on Various datasets and models, including tasks in both CV and NLP.
Experimental Results
Key takeaway:
1. the main speedup is the result of elastic pipelining which is achieved through the joint use of AutoPipe and AutoDP (purple)��2. AutoCache’s contribution is amplified by AutoDP (green v.s. blue: more parallel DP workers can use caching) ��3. freeze training alone without system-wise adjustment even downgrades the training speed (yellow)�(the underlying mechanism of PyTorch is not tailored for freeze training, forcing CUDACachingAllocator to split blocks or launch new memory allocations)
2. Breakdown for speedup
Experimental Results
Communication Infrastructure: InfiniBand CX353A where cross-machine bandwidth is 5GB/s, and GPU-to-GPU bandwidth within a machine (PCI 3.0, 16 lanes) is 15.754GB/s.
Key takeaway:�1. Communication cost is not the main bottleneck when we use InfiniBand for medium-scale models (< 500M such as ViT-base and BERT-large), but it is still non-trivial even under freeze training.�2. Recent progress in NLP and CV has scaled up the model size to billion/trillion-level (GPT-3 - 175B [1], Switch Transformer - 1.7T [2]), which will make the ratio of communication much higher than our experimental results.
BERT-large (340M)
ViT-Base (87M)
[1] Language Models are Few-Shot Learners. 2020
[2] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. 2021
3. Breakdown for communication v.s. computation
Experimental Results
�We automate these two optimization strategies.
The trade-off between accuracy and efficiency.
4. Performance Analysis
Future works
Distributed training tasks that can be elastic: