1 of 87

Pretraining Large Language Models

Leandro von Werra

2 of 87

Plan for today

  1. State of LLMs
  2. Scaling Laws
  3. Datasets
  4. Distributed Training

3 of 87

State of LLMs

scaling large and smol

4 of 87

LLM families

closed model APIs

open model weights

fully open model

model weights not available

  • can’t run the model locally
  • no access to model’s internals
  • limits fine-tuning abilities

no access to training data or code

  • who’s data is in the dataset?
  • can’t remove data on request
  • benchmark contamination
  • limits scientific reproducibility

full access to model/code/data

  • competitive edge
  • liability issues
  • maintenance

5 of 87

Trends: train longer

6 of 87

Trends: train larger

7 of 87

Trends: more context

8 of 87

Trends: more compute

compute data x model size

Dataset

(Billion Tokens)

Model size

(Billion Parameter)

GPT 1:

1-2

0.11

GPT 2:

10-20

1.4

GPT 3:

300

175

GPT 4:

10’000

1’800

100x

2000x

300x

GPT-4 cost: ~$100M Dollars

Compute:

9 of 87

Trends: more compute

10 of 87

Trends: smol models

11 of 87

Trends: why? Scaling Laws!

Can we extrapolate to the performance of …

  • a larger model
  • more data
  • more compute?

12 of 87

Scaling Laws

predictable scaling returns

13 of 87

Scaling laws: Predictable returns

Model size

Compute

Data

Loss

https://arxiv.org/abs/2001.08361

14 of 87

Scaling laws: Compute optimal

Compute

Compute Budget

Too small: loss already flattened out

Optimal: lowest loss at current compute budget

Too large: not yet through steep loss zone

15 of 87

Scaling laws: Downstream performance

https://arxiv.org/abs/2303.08774

16 of 87

Scaling laws: Chinchilla fix

17 of 87

Scaling laws: Chinchilla fix

WAT?!

Llama-3 8B trained on 15T tokens

18 of 87

Scaling laws: Inference

Chinchilla optimal models are only training compute optimal and ignore inference compute

19 of 87

Scaling laws: Harm’s law

20 of 87

Dataset

aka the secret sauce

aka 90% of all the work

21 of 87

Dataset: the secret workhorse of LLMs

https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/

22 of 87

Dataset: goal of pretraining

Train a general-purpose model → maximal coverage

Requires:

  • train on massive quantities of text, at least 1 trillion of tokens nowadays moving towards 10-20T

Challenges:

  • maximizing diversity and coverage
  • maximising quality, robustness
  • data quality evaluation: how to measure data quality at the billion tokens scale

23 of 87

Dataset: where to find data

  • Very large (> 100B tokens):
    • Common crawl: everyone starts from here
    • Code: Github and Software Heritage

  • Curated:
    • Websites: Wikipedia, StackExchange, Arxiv
    • Books: public-domain vs. copyrighted

  • More recent trends
    • Synthetic data

24 of 87

Dataset: FineWeb

  • Based on CommonCrawl
  • 44TB disk space
  • 15T tokens
  • Transparent pipeline

https://hf.co/spaces/HuggingFaceFW/blogpost-fineweb-v1

25 of 87

Dataset: the average web

… is mostly garbage:

  • Ads/SEO
  • Obituaries
  • Porn
  • Sport News

If we want a high quality model we need to clean it up!

26 of 87

Dataset: filtering pipeline

27 of 87

Dataset: filtering pipeline

28 of 87

Dataset: general advice

  • Gather as much data as possible

  • Filter as much as necessary

  • Look at the data you keep (and throw away)

(manually, clustering, tokenizing etc)

  • Don’t trust your intuition → evaluate!

29 of 87

Dataset: language filtering

30 of 87

Dataset: quality heuristics

31 of 87

Dataset: quality heuristics

Advantages:

  • controlled
  • robust
  • deterministic
  • rather clear priors

Drawbacks:

  • rely entirely on surface level
  • danger of removing too much
  • hyper-parameter tuning

32 of 87

Dataset: quality filtering - ML

Given a set of examples of good/bad documents:

  • classifier-based quality filtering:�fastText classification with an n-gram size of 2
  • perplexity based filtering:�5-gram Kneser-Ney model on Wikipedia

(see https://github.com/kpu/kenlm)

→ Filter based on a threshold

  • more “quality/content based filtering”
  • harder to estimate the impact →may introduce “bias”

33 of 87

Dataset: FineWeb-Edu

Below is an extract from a web page. Evaluate whether the page has a high educational value and could be useful in an educational setting for teaching from primary school to grade school levels using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

  • Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and promotional material.

  • Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style.

  • Award a third point if the extract is appropriate for educational use and introduces key concepts relevant to school curricula. It is coherent though it may not be comprehensive or could include some extraneous information. It may resemble an introductory section of a textbook or a basic tutorial that is suitable for learning but has notable limitations like treating concepts that are too complex for grade school students.

  • Grant a fourth point if the extract highly relevant and beneficial for educational purposes for a level not higher than grade school, exhibiting a clear and consistent writing style. It could be similar to a chapter from a textbook or a tutorial, offering substantial educational content, including exercises and solutions, with minimal irrelevant information, and the concepts aren't too advanced for grade school students. The content is coherent, focused, and valuable for structured learning.

  • Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for teaching either at primary school or grade school. It follows detailed reasoning, the writing style is easy to follow and offers profound and thorough insights into the subject matter, devoid of any non-educational or complex content.

The extract:

<EXAMPLE>.

After examining then extract:

  • Briefly justify your total score, up to 100 words.
  • Conclude with the score using the format: "Educational score: <total points>"

Llama 3 70B

500K samples

Small Transformer

FineWeb-Edu

Annotate

Train

Infer

34 of 87

Dataset: FineWeb-Edu

35 of 87

Dataset: notes on filtering

Taking care of domains specificities

  • important to inspect the effect on domain specific data
  • extract 10 documents per domains (e.g. top urls)
  • manually inspect the results
  • craft domain specific filters/hyper-parameters
  • same for multiple languages�

Deterministic vs. stochastic selection

  • hard threshold are strong decision points
  • stochastic smoothing of rules

36 of 87

Dataset: deduplication

Fuzzy

  • BLOOM filters (hashing and fixed size vector)
  • MinHash (hashing and sorting)

Exact

  • Exact substring with suffix array
  • Sentence dedup�

time/memory consumption

  • MinHash offers a good trade–off of speed/memory

counter intuitive results

  • more deduplication may lead to keeping only bad data

37 of 87

Dataset: evaluate data quality

Small models trainings: train 1-2B size models on 30GT (chinchilla optimal)�

  • Use a set of “high-signal” benchmarks (in NLP):
    • commonsense QA
    • hellaswag
    • openbook QA
    • PiQA�
  • High-signal?
    • monotonicity: monotonically increasing during training
    • low variance:
      • when comparing two known reference datasets (e.g. The Pile versus C4)
      • when comparing with various sub-parts of data and seeds
      • above random baseline�
  • Tricky details to maximize signal:
    • Small models like “normalized loglikelihood” better
    • Larger models like “letter answers” better

38 of 87

Dataset: The Stack

39 of 87

Dataset: The Stack v2

  • 3B+ files in 658 programming languages
  • created as part of the BigCode Project�pre-training dataset for Code LLMs
  • Derived from the Software Heritage�archive: largest public archive of�software source code

40 of 87

Dataset: Cosmopedia

  • A synthetic dataset of 30M�samples
  • generated by Mixtral-8x7B-Instruct-v0.1
  • 8 splits:
    • various sources for seed samples:�Stanford, OpenStax and�KhanAcademy, FineWeb, instruction-tuning datasets
    • model is asked to generate content

41 of 87

Distributed Training

simple things get complicated

42 of 87

Training: strategy

Compute budget is external constraint

  1. use Chinchilla + Harm’s law to determine model size
  2. Global batch size is a function of model size
    1. small models ~1-4M tokens (=gbs x seqlen, seqlen typically 2-8k)
    2. Large models ~4-40M tokens

Compute cluster and models size determine training topology

  • Large models require distribution over multiple GPUs
    • 4D parallelism
    • ZeRO
  • We are limited by scaling beyond a certain batch size

43 of 87

Training: basic training step

44 of 87

Training: anatomy of memory

45 of 87

Training: activation recomputation

46 of 87

Training: activation recomputation

Sequence length

Selective: store activations of specific operations → 2-3% slowdown

Full: only store activations at layer level → 30% slowdown

47 of 87

Training: gradient accumulation

Split global batch into micro batches to save memory

Now let’s add more GPUs!

48 of 87

Training: Data Parallelism - 1D

Distribute micro batches across GPUs

all_reduce

49 of 87

Training: Overlap Communication +

Computation

https://siboehm.com/articles/22/data-parallel-training

50 of 87

Training: Tensor Parallel - 2D

What if the model still doesn’t fit? Split matrix multiplications:

51 of 87

Training: Tensor Parallel - 2D

What if the model still doesn’t fit? Split matrix multiplications:

52 of 87

Training: Tensor Parallel - 2D

What if the model still doesn’t fit? Split matrix multiplications:

53 of 87

Training: Tensor Parallel - 2D

Which one to use? Let’s look at the feedforward layers:

54 of 87

Training: Tensor Parallel - 2D

Which one to use? Let’s look at the feedforward layers:

We can save two communication steps!

55 of 87

Training: Tensor Parallel - 2D

What about multi-head attention?

Column-parallel ←→ Each worker processes a subset of heads

56 of 87

Training: Sequence Parallel - 3D

57 of 87

Training: Going beyond 1 node

Intraconnect:

NVSwitch: 900 GB/s

Interconnect:

Infiniband: 50 GB/s

Node: 4-8 GPUs

Cluster: thousands of nodes

→ TP generally doesn’t scale beyond one node

58 of 87

Training: Pipeline Parallelism - 3D

Layer 1-4

Idle time bubble

Share layers across GPUs. Naive PP:

Naive PP is very inefficient with GPUs idling most of the time

59 of 87

Training: Pipeline Parallelism - 3D

Microbatches

AFAB: All Forward - All Backward

60 of 87

Training: Pipeline Parallelism - 3D

Microbatches

1F1B: 1 Forward - 1 Backward

61 of 87

Training: Pipeline Parallelism - 3D

Microbatches

Interleaved 1F1B

62 of 87

Training: Context Parallelism - 4D

How do you train with 1M context?

Ring Attention!

63 of 87

GPU: 1

Q1, K1, V1

GPU: 3

Q3, K3, V3

GPU: 2

Q2, K2, V2

GPU: 4

Q4, K4, V4

64 of 87

GPU: 1

Q1

GPU: 3

Q3

GPU: 2

Q2

GPU: 4

Q4

K1, V1

K2, V2

K3, V3

K4, V4

65 of 87

GPU: 1

Q1

GPU: 3

Q3

GPU: 2

Q2

GPU: 4

Q4

K1, V1

K2, V2

K3, V3

K4, V4

66 of 87

GPU: 1

Q1

GPU: 3

Q3

GPU: 2

Q2

GPU: 4

Q4

K1, V1

K2, V2

K3, V3

K4, V4

67 of 87

GPU: 1

Q1, K4, V4

GPU: 3

Q3, K2, V2

GPU: 2

Q2, K1, V1

GPU: 4

Q4, K3, V3

68 of 87

GPU: 1

Q1

GPU: 3

Q3

GPU: 2

Q2

GPU: 4

Q4

K4, V4

K1, V1

K2, V2

K3, V3

69 of 87

GPU: 1

Q1

GPU: 3

Q3

GPU: 2

Q2

GPU: 4

Q4

K4, V4

K1, V1

K2, V2

K3, V3

70 of 87

GPU: 1

Q1

GPU: 3

Q3

GPU: 2

Q2

GPU: 4

Q4

K4, V4

K1, V1

K2, V2

K3, V3

71 of 87

GPU: 1

Q1, K3, V3

GPU: 3

Q3, K1, V1

GPU: 2

Q2, K4, V4

GPU: 4

Q4, K2, V2

72 of 87

GPU: 1

Q1

GPU: 3

Q3

GPU: 2

Q2

GPU: 4

Q4

K3, V3

K4, V4

K1, V1

K2, V2

73 of 87

GPU: 1

Q1

GPU: 3

Q3

GPU: 2

Q2

GPU: 4

Q4

K3, V3

K4, V4

K1, V1

K2, V2

74 of 87

GPU: 1

Q1

GPU: 3

Q3

GPU: 2

Q2

GPU: 4

Q4

K3, V3

K4, V4

K1, V1

K2, V2

75 of 87

GPU: 1

Q1, K2, V2

GPU: 3

Q3, K4, V4

GPU: 2

Q2, K3, V3

GPU: 4

Q4, K1, V1

76 of 87

Training: Context Parallelism - 4D

ZigZag Ring attention: making sure all GPUs do equal work!

77 of 87

Training: 4D parallelism

All 4D parallel approaches are combinable and complimentary:

  1. Data Parallelism – along the batch dimension
  2. Tensor Parallelism - along the hidden-state dimension
  3. Sequence/Context Parallelism - along the sequence dimension
  4. Pipeline Parallelism - along the model layers dimension

78 of 87

Training: ZeRO

ZeRO (Zero Redundancy Optimizer):

K=12 for Adam

has 50% more comms

79 of 87

Training: Putting all together

80 of 87

Training: Flash Attention + Fused Kernels

Standard Attention

81 of 87

Training: Flash Attention + Fused Kernels

https://arxiv.org/pdf/2205.14135

82 of 87

Training: Mixed Precision Training

83 of 87

Training: Mixed Precision Training

Recipe for BF16/FP16 mixed precision training:

  • FP32 copy of weights
  • Loss scaling
  • Accumulation

Speed: Operations in lower precision are faster!

FP8: still experimental but we have some promising approaches

84 of 87

Training: Learning rate schedules

Moving from Cosine to Warmup-Stable-Decay (WSD):

More flexibility, e.g. data stages!

85 of 87

Training: Data Stages

86 of 87

Hugging Face: Tools

87 of 87

Questions?

GitHub/HF Hub/X: lvwerra