1 of 33

Introduction to Large Language Models System

Zirui “Ray” Liu

University of Minnesota, Twin Cities

Overview Parallelism +

Distributed Data Parallelism

2 of 33

Guest Lecture 1: Byson Hsu from xAI

Next Monday (Tentative)

3 of 33

Outline

Recap of GPU Programming
Parallelism Overview
Distributed Data Parallelism

4 of 33

Recap

5 of 33

A = 2 * A - 1

A₀₀ =-1+2*A₀₀

A₀₁ =-1+2*A₀₁

……

Recap

6 of 33

Recap

I/O from HBM: 400~600 Cycles
I/O from L2 Cache: ~50 Cycles
Multiply-and-Add: 4 Cycles

7 of 33

Recap

GPU Architecture + CUDA API

Key Notes for GPUs:

For MLSys, the key is to keep GPU cores busy
When impl. sth, think about its mem. access pattern

8 of 33

Why Distributed Training?

Faster training
Larger Unified Memory

9 of 33

Parallel over threads

Parallel over GPUs in the same server

Parallel over servers

Overview of “Parallelism”

11 of 33

How to partition the workload?

12 of 33

How to partition the workload?

Model/Pipeline Parallelism

Tensor Parallelism

13 of 33

Why partition in this way?

14 of 33

Takeaway Notes for Parallelism

For Training/Fine-tuning

Data Parallelism + Sharding is all you need for training < 10B models

For Inference

Tensor parallel between GPUs within the server
Pipeline parallel over servers

15 of 33

Infrastructure of Llama3

16 of 33

Takeaway Notes for Parallelism

For Training

Data Parallelism + Sharding is all you need for training < 10B models

For Inference

Tensor parallel between GPUs within the server
Pipeline parallel over servers

17 of 33

Today’s Topic

Multi-GPU communication
Distributed Data Parallel Training

18 of 33

Distributed Data Parallel

Basic Idea:

Create replicas of models over multiple GPUs
Each model performs the forward pass and the backward pass independently
Synchronize gradients before the optimizer step

19 of 33

Distributed Data Parallel

Basic Idea:

Create replicas of a model on multiple GPUs
Each model performs the forward pass and the backward pass independently
Synchronize gradients before the optimizer step

20 of 33

Distributed Data Parallel

Basic Idea:

Create replicas of a model on multiple GPUs
Each model performs the forward pass and the backward pass independently
Synchronize gradients before the optimizer step with AllReduce

21 of 33

Design Goal of DDP

Develops should be able to reuse the local training script with minimal modifications.
The API needs to allow the implementation to receive various signals and trigger appropriate algorithms. The API must expose as many optimization opportunities as possible to the internal implementation.

22 of 33

Distributed Data Parallel

Minimal code change

23 of 33

How to Implement Distributed Data Parallel

Naïve solution: synchronize gradients after the entire

backward pass finishes

What can be improved?

24 of 33

Implementing Distributed Data Parallel

Naïve solution: synchronize gradients after the entire backward pass finishes

We can overlap gradient computation and synchronization!

But how often should we synchronize? Per parameter?

Too much synchronization slows down execution

25 of 33

Gradient Bucketing

26 of 33

Gradient Bucketing

Bucket size can be configured by setting

the bucket_cap_mb argument in DDP constructor.

The mapping from parameter gradients to buckets is determined at the construction time

27 of 33

Gradient Bucketing

Model parameters are allocated into buckets in (roughly) the reverse order of Model.parameters() from the given model.
DDP expects gradients to become ready during the backward pass in approximately that order.

28 of 33

Gradient Bucketing

When gradients in one bucket are all ready, the Reducer kicks off an asynchronous allReduce on that bucket to calculate average of gradients across all processes.
Overlapping computation (backward) with communication (AllReduce)

29 of 33

Gradient Reduction

30 of 33

DDP Scalability

31 of 33

DDP Reduces Latency by Overlapping Communication and Computation

bwd & comm

32 of 33

Fully Shared Data Parallel

Motivation: Large models cannot fit into one GPU

33 of 33

Takeaway Notes for Parallelism

For Training

Data Parallelism + Sharding is all you need for training < 10B models

For Inference

Tensor parallel between GPUs within the server
Pipeline parallel over servers

1 of 33

2 of 33

3 of 33

4 of 33

5 of 33

6 of 33

7 of 33

8 of 33

9 of 33

10 of 33

11 of 33

12 of 33

13 of 33

14 of 33

15 of 33

16 of 33

17 of 33

18 of 33

19 of 33

20 of 33

21 of 33

22 of 33

23 of 33

24 of 33

25 of 33

26 of 33

27 of 33

28 of 33

29 of 33

30 of 33

31 of 33

32 of 33

33 of 33