Introduction to Large Language Models System
Zirui “Ray” Liu
University of Minnesota, Twin Cities
Overview Parallelism +
Distributed Data Parallelism
Guest Lecture 1: Byson Hsu from xAI
Next Monday (Tentative)
3
Outline
4
Recap
5
A = 2 * A - 1
A00 =-1+2*A00
A01 =-1+2*A01
……
Recap
6
Recap
7
Recap
8
Why Distributed Training?
9
Parallel over threads
Parallel over GPUs in the same server
Parallel over servers
Overview of “Parallelism”
10
11
How to partition the workload?
12
How to partition the workload?
Model/Pipeline Parallelism
Tensor Parallelism
13
Why partition in this way?
14
Takeaway Notes for Parallelism
15
Infrastructure of Llama3
16
Takeaway Notes for Parallelism
Today’s Topic
17
Distributed Data Parallel
18
Distributed Data Parallel
19
Distributed Data Parallel
20
Design Goal of DDP
28
Distributed Data Parallel
22
How to Implement Distributed Data Parallel
23
backward pass finishes
Implementing Distributed Data Parallel
24
Gradient Bucketing
25
Gradient Bucketing
26
the bucket_cap_mb argument in DDP constructor.
Gradient Bucketing
27
Gradient Bucketing
28
Gradient Reduction
29
DDP Scalability
42
DDP Reduces Latency by Overlapping Communication and Computation
bwd & comm
43
Fully Shared Data Parallel
44
33
Takeaway Notes for Parallelism