1 of 33

Introduction to Large Language Models System

Zirui “Ray” Liu

University of Minnesota, Twin Cities

Overview Parallelism +

Distributed Data Parallelism

2 of 33

Guest Lecture 1: Byson Hsu from xAI

Next Monday (Tentative)

3 of 33

3

Outline

  • Recap of GPU Programming
  • Parallelism Overview
  • Distributed Data Parallelism

4 of 33

4

Recap

5 of 33

5

A = 2 * A - 1

A00 =-1+2*A00

A01 =-1+2*A01

……

Recap

6 of 33

6

Recap

  • I/O from HBM: 400~600 Cycles
  • I/O from L2 Cache: ~50 Cycles
  • Multiply-and-Add: 4 Cycles

7 of 33

7

Recap

  • GPU Architecture + CUDA API
    • Key Notes for GPUs:
      • For MLSys, the key is to keep GPU cores busy
      • When impl. sth, think about its mem. access pattern

8 of 33

8

Why Distributed Training?

  • Faster training
  • Larger Unified Memory

9 of 33

9

Parallel over threads

Parallel over GPUs in the same server

Parallel over servers

Overview of “Parallelism”

10 of 33

10

11 of 33

11

How to partition the workload?

12 of 33

12

How to partition the workload?

Model/Pipeline Parallelism

Tensor Parallelism

13 of 33

13

Why partition in this way?

14 of 33

14

Takeaway Notes for Parallelism

  • For Training/Fine-tuning
    • Data Parallelism + Sharding is all you need for training < 10B models
  • For Inference
    • Tensor parallel between GPUs within the server
    • Pipeline parallel over servers

15 of 33

15

Infrastructure of Llama3

16 of 33

16

Takeaway Notes for Parallelism

  • For Training
    • Data Parallelism + Sharding is all you need for training < 10B models
  • For Inference
    • Tensor parallel between GPUs within the server
    • Pipeline parallel over servers

17 of 33

Today’s Topic

17

  • Multi-GPU communication
  • Distributed Data Parallel Training

18 of 33

Distributed Data Parallel

18

  • Basic Idea:
    • Create replicas of models over multiple GPUs
    • Each model performs the forward pass and the backward pass independently
    • Synchronize gradients before the optimizer step

19 of 33

Distributed Data Parallel

19

  • Basic Idea:
    • Create replicas of a model on multiple GPUs
    • Each model performs the forward pass and the backward pass independently
    • Synchronize gradients before the optimizer step

20 of 33

Distributed Data Parallel

20

  • Basic Idea:
    • Create replicas of a model on multiple GPUs
    • Each model performs the forward pass and the backward pass independently
    • Synchronize gradients before the optimizer step with AllReduce

21 of 33

Design Goal of DDP

  • Develops should be able to reuse the local training script with minimal modifications.
  • The API needs to allow the implementation to receive various signals and trigger appropriate algorithms. The API must expose as many optimization opportunities as possible to the internal implementation.

28

22 of 33

Distributed Data Parallel

22

  • Minimal code change

23 of 33

How to Implement Distributed Data Parallel

23

  • Naïve solution: synchronize gradients after the entire

backward pass finishes

    • What can be improved?

24 of 33

Implementing Distributed Data Parallel

24

  • Naïve solution: synchronize gradients after the entire backward pass finishes
    • We can overlap gradient computation and synchronization!
  • But how often should we synchronize? Per parameter?
    • Too much synchronization slows down execution

25 of 33

Gradient Bucketing

25

26 of 33

Gradient Bucketing

26

  • Bucket size can be configured by setting

the bucket_cap_mb argument in DDP constructor.

  • The mapping from parameter gradients to buckets is determined at the construction time

27 of 33

Gradient Bucketing

27

  • Model parameters are allocated into buckets in (roughly) the reverse order of Model.parameters() from the given model.
  • DDP expects gradients to become ready during the backward pass in approximately that order.

28 of 33

Gradient Bucketing

28

  • When gradients in one bucket are all ready, the Reducer kicks off an asynchronous allReduce on that bucket to calculate average of gradients across all processes.
  • Overlapping computation (backward) with communication (AllReduce)

29 of 33

Gradient Reduction

29

30 of 33

DDP Scalability

42

31 of 33

DDP Reduces Latency by Overlapping Communication and Computation

bwd & comm

43

32 of 33

Fully Shared Data Parallel

44

  • Motivation: Large models cannot fit into one GPU

33 of 33

33

Takeaway Notes for Parallelism

  • For Training
    • Data Parallelism + Sharding is all you need for training < 10B models
  • For Inference
    • Tensor parallel between GPUs within the server
    • Pipeline parallel over servers