1 of 33

Distributed Large Batch Training

Swetha Mandava

2 of 33

Deep Learning Models Increasing in Complexity

Next-Level Use-Cases Require Gigantic Models

https://github.com/NVIDIA/Megatron-LM

Project Megatron

Nvidia, 2019

8.3B parameters

Turing-NLG�Microsoft, 2020

17B parameters

BERT

Google, 2018

340M parameters

ResNet-152

Microsoft Research, 2015

60.3M parameters

AlexNet�UToronto, 2012

62M parameters

GPT

OpenAI, 2018

110M parameters

GPT-3�OpenAI, 2020

175B parameters

3 of 33

PAIN�POINTS

DL model training�

Time consuming: can take days, weeks…
Capability is limited not only by hardware capacity but also by algorithmic capacity

Scaling is necessary but hard

4 of 33

Agenda

Deep Learning at Scale

Scaling NCF with 1 GPU
Discussing BERT with multi gpu and multi node

Desired outcomes

Understand when to and how to scale
Know the typical concepts to apply
Incorporate low-effort techniques into your stack

5 of 33

Optimizations for Scaling

Convergence and Stability

Warmup
Linear Scaling Rule
LARS

Computation and Scaling efficiency

Automatic Mixed Precision
Efficient Data Pipeline
Fusing Kernels�

Hardware limits on Dataset and Model Size

Data Parallelism
Model Parallelism

6 of 33

Recommender System:

Neural Collaborative Filtering

7 of 33

Learnings

Productivity matters : teams with better tools/scaling can try out more ideas

SGD�BS = 4096�~1230 sec

+ LR Scaling�BS *= 16�~140 sec

+ WARMUP�BS *= 192�~100 sec

+LARS

BS *= 192�~110 sec

+AMP�BS *= 192�~ 55 sec

7

8 of 33

Q & A

Before we move on to multi-GPU/multi-node...

9 of 33

Multi-Node Experiment

Used 1472 V100 GPUs in a scale out experiment.

BERT Large Trained in 47 minutes.

10 of 33

Multi-node Requirements

10

System Design

Data Center

Management

Algorithm optimizations

Multi-Node Success

Everything that goes to design a single node with Multiple GPUs - PCIe vs NVLink, GPU to CPU ratio, GPU to NIC cards ratio

Everything that goes in designing a cluster - Internode connectivity, Software stack to run on cluster - SLURM, K8s, Network Storage

Optimizations that we could do within a single GPU - E.g. - Kernel fusions, AMP, data sharding, optimizer choice

10

11 of 33

LAMB - extension of LARS

LARS performs poorly for attention models like BERT
Layer-wise Adaptive Moments Based (LAMB) optimizer can be seen as the application of LARS to the AdamW optimizer, which adds a per weight normalization with respect to the square root of the second moment to compute the update

11

12 of 33

NVLAMB

Gradient pre-normalization step such that gradients on the entire model combined (all individual layers / weight matrices) are unit L2 norm.
Beneficial in large batch settings where the direction on the gradient is largely preserved

Gradient Pre-normalization + Bias Correction

Initializing the moving averages m and v to zero has an implicit bias of (1 – β1) and (1 – β2) on the subsequent gradients

12

13 of 33

Overlap I/O With Computations

14 of 33

Fuse Kernels

gelu(x) = a * x * (1 + tanh(b * (x + c * x ^ 3) ) )

Kernel 1: result = x^3

Kernel 2: result = c * result

Kernel 3: result = x + result

Kernel 4: result = b * result

Kernel 5: result = tanh(result) + 1

Kernel 6: result = x * result

Kernel 7: result = a * result

For reduced kernel launch overhead and increased memory locality

14

15 of 33

Single GPU (T4) Optimization

1

2

3

1000

3000

5000

2000

4000

Non Optimized

FP16 + Tensor Core

+ Kernel Fusion

3x Speedup

3.75x Speedup

16 of 33

Scale to multiple GPUs

Multiple GPU
Data parallel training

Under the hood

Allreduce algorithm
NCCL: NVIDIA Collective Communication Library

17 of 33

Reductions After Back-Propagation

Forward

Backward

Main Stream

All-Reduce Stream

NCCL All-Reduce

Synchronization Barrier

Weight Update

18 of 33

Reductions Overlap with Back-Propagation

Forward

Backward

Main Stream

All-Reduce Stream

All-Reduce

Weight Update

19 of 33

Reductions With DDP

Easiest option is to use DistributedDataParallel, i.e.

from torch.nn.parallel import DistributedDataParallel as DDP

model = DDP(model, device_ids=[args.local_rank],

output_device=args.local_rank)))

Automatically performs reductions across ranks during propagation.

Reductions overlap with back propagation computations
Remember that reductions are done with FP16 math when AMP opt_level >= O2. This can exacerbate problems with numerical instability
Call torch.cuda.set_device(local rank) early in each process to avoid accidentally creating extra contexts on other GPUs.
When grads are communicated, all processes must wait for the slowest. Load-balance work across processes.

19

20 of 33

Slow Interconnect bottlenecks

Forward

Backward

Main Stream

All-Reduce Stream

NCCL All-Reduce

Synchronization Barrier

Weight Update

21 of 33

Gradient Accumulation

Gradient accumulation reduces the number of times the model needs to communicate. Accumulate locally for several iterations, and communicate for once.
Essentially gradient accumulation is enlarging the batch size.

Forward

Backward

All-Reduce

Weight Update

Forward

Backward

...

Main Stream

All-Reduce Stream

22 of 33

Before Gradient Accumulation

23 of 33

Gradient Accumulation - by step of 2

Use gradient accumulation to reduce synchronization frequency
Reduced time for both intra-node communication and inter-node communication.
Gained more speedups from skipping weight update

24 of 33

Inefficient Input Pipeline: Single Input File

Input File

25 of 33

Efficient Input Pipeline: One Shard Per GPU

Shard 1

Shard 2

Shard 3

Shard 4

26 of 33

Scale to multiple nodes

Ensure effective inter-node communication
Move data close to compute
Develop model in container to facilitate scaling out as well as use latest GPU libraries
Consider full application & system software stack

27 of 33

Multi-node Requirements

27

System Design

Data Center

Management

Algorithm optimizations

Multi-Node Success

Everything that goes into designing a single node with Multiple GPUs - PCIe vs NVLink, GPU to CPU ratio, GPU to NIC cards ratio

Everything that goes in designing a cluster - Internode connectivity, Software stack to run on cluster - SLURM, K8s, Network Storage

Every optimizations that we could do within a single GPU - E.g. - Kernel fusions, AMP, data sharding, optimizer choice

27

28 of 33

Summary

Deep Learning Models will continue to grow in size and training them will require massive scale out.�
Productivity matters for encouraging experimentation culture.

Scaling requires careful considerations on multiple aspects. Most require low effort and can be easily utilized in your stack.

Nvidia provides support for Multi-Node model training across the entire stack.

28

29 of 33

NVIDIA’s Optimized Deep Learning Examples

Open sourced at NVIDIA Deep Learning Examples in multiple frameworks.

Qualified to run with NGC containers��Configurations for convergence at a variety of scales, from 8 to 1500 GPUs

BERT in TensorFlow and PyTorch:

Allows you to train BERT for yourself, from data prep through to fine tuning

Pre-training dataset drawn from BooksCorpus and Wikipedia

Includes pre-trained weights for BERT

Software Stack for Multiple Nodes

Enroot and Pyxis set up on a SLURM cluster
Developer Blog on how to set up and run Multi-node.

Qualified end-to-end implementations

29

30 of 33

Contributors

You can replicate the Multi-Node results in a less resource-intensive environment as well.

31 of 33

Thank you!

Chris Forster, Sharath TS, Thor Johnsen, Purnendu Mukherjee

32 of 33

Q & A

33 of 33

33