Distributed Large Batch Training
Swetha Mandava
Deep Learning Models Increasing in Complexity
Next-Level Use-Cases Require Gigantic Models
Project Megatron
Nvidia, 2019
8.3B parameters
Turing-NLG�Microsoft, 2020
17B parameters
BERT
Google, 2018
340M parameters
ResNet-152
Microsoft Research, 2015
60.3M parameters
AlexNet�UToronto, 2012
62M parameters
GPT
OpenAI, 2018
110M parameters
GPT-3�OpenAI, 2020
175B parameters
PAIN�POINTS
DL model training�
Scaling is necessary but hard
Agenda
Deep Learning at Scale
Desired outcomes
Optimizations for Scaling
Recommender System:
Neural Collaborative Filtering
Learnings
Productivity matters : teams with better tools/scaling can try out more ideas
SGD�BS = 4096�~1230 sec
+ LR Scaling�BS *= 16�~140 sec
+ WARMUP�BS *= 192�~100 sec
+LARS
BS *= 192�~110 sec
+AMP�BS *= 192�~ 55 sec
7
Q & A
Before we move on to multi-GPU/multi-node...
Multi-Node Experiment
Multi-node Requirements
10
System Design
Data Center
Management
Algorithm optimizations
Multi-Node Success
Everything that goes to design a single node with Multiple GPUs - PCIe vs NVLink, GPU to CPU ratio, GPU to NIC cards ratio
Everything that goes in designing a cluster - Internode connectivity, Software stack to run on cluster - SLURM, K8s, Network Storage
Optimizations that we could do within a single GPU - E.g. - Kernel fusions, AMP, data sharding, optimizer choice
10
LAMB - extension of LARS
11
NVLAMB
Gradient Pre-normalization + Bias Correction
12
Overlap I/O With Computations
Fuse Kernels
gelu(x) = a * x * (1 + tanh(b * (x + c * x ^ 3) ) )
Kernel 1: result = x^3
Kernel 2: result = c * result
Kernel 3: result = x + result
Kernel 4: result = b * result
Kernel 5: result = tanh(result) + 1
Kernel 6: result = x * result
Kernel 7: result = a * result
For reduced kernel launch overhead and increased memory locality
14
Single GPU (T4) Optimization
1
2
3
1000
3000
5000
2000
4000
Non Optimized
FP16 + Tensor Core
FP16 + Tensor Core
+ Kernel Fusion
3x Speedup
3.75x Speedup
Scale to multiple GPUs
Under the hood
Reductions After Back-Propagation
Forward
Backward
Main Stream
All-Reduce Stream
NCCL All-Reduce
Synchronization Barrier
Weight Update
Reductions Overlap with Back-Propagation
Forward
Backward
Main Stream
All-Reduce Stream
All-Reduce
Weight Update
Reductions With DDP
Easiest option is to use DistributedDataParallel, i.e.
from torch.nn.parallel import DistributedDataParallel as DDP
model = DDP(model, device_ids=[args.local_rank],
output_device=args.local_rank)))
Automatically performs reductions across ranks during propagation.
19
Slow Interconnect bottlenecks
Forward
Backward
Main Stream
All-Reduce Stream
NCCL All-Reduce
Synchronization Barrier
Weight Update
Gradient Accumulation
Forward
Backward
All-Reduce
Weight Update
Forward
Backward
...
Main Stream
All-Reduce Stream
Before Gradient Accumulation
Gradient Accumulation - by step of 2
Inefficient Input Pipeline: Single Input File
Input File
Efficient Input Pipeline: One Shard Per GPU
Shard 1
Shard 2
Shard 3
Shard 4
Scale to multiple nodes
Multi-node Requirements
27
System Design
Data Center
Management
Algorithm optimizations
Multi-Node Success
Everything that goes into designing a single node with Multiple GPUs - PCIe vs NVLink, GPU to CPU ratio, GPU to NIC cards ratio
Everything that goes in designing a cluster - Internode connectivity, Software stack to run on cluster - SLURM, K8s, Network Storage
Every optimizations that we could do within a single GPU - E.g. - Kernel fusions, AMP, data sharding, optimizer choice
27
Summary
28
NVIDIA’s Optimized Deep Learning Examples
Open sourced at NVIDIA Deep Learning Examples in multiple frameworks.
Qualified to run with NGC containers��Configurations for convergence at a variety of scales, from 8 to 1500 GPUs
BERT in TensorFlow and PyTorch:
Software Stack for Multiple Nodes
Qualified end-to-end implementations
29
Contributors
You can replicate the Multi-Node results in a less resource-intensive environment as well.
Thank you!
Chris Forster, Sharath TS, Thor Johnsen, Purnendu Mukherjee
Q & A
33