1 of 87

Pretraining Large Language Models

Leandro von Werra

2 of 87

Plan for today

State of LLMs
Scaling Laws
Datasets
Distributed Training

3 of 87

State of LLMs

scaling large and smol

4 of 87

LLM families

closed model APIs

open model weights

fully open model

model weights not available

can’t run the model locally
no access to model’s internals
limits fine-tuning abilities

no access to training data or code

who’s data is in the dataset?
can’t remove data on request
benchmark contamination
limits scientific reproducibility

full access to model/code/data

competitive edge
liability issues
maintenance

5 of 87

Trends: train longer

6 of 87

Trends: train larger

7 of 87

Trends: more context

8 of 87

Trends: more compute

compute ≈ data x model size

	Dataset (Billion Tokens)	Model size (Billion Parameter)
GPT 1:	1-2	0.11
GPT 2:	10-20	1.4
GPT 3:	300	175
GPT 4:	10’000	1’800

100x

2000x

300x

GPT-4 cost: ~$100M Dollars

Compute:

9 of 87

Trends: more compute

10 of 87

Trends: smol models

11 of 87

Trends: why? Scaling Laws!

Can we extrapolate to the performance of …

a larger model
more data
more compute?

12 of 87

Scaling Laws

predictable scaling returns

13 of 87

Scaling laws: Predictable returns

Model size

Compute

Data

Loss

https://arxiv.org/abs/2001.08361

14 of 87

Scaling laws: Compute optimal

Compute

Compute Budget

Too small: loss already flattened out

Optimal: lowest loss at current compute budget

Too large: not yet through steep loss zone

15 of 87

Scaling laws: Downstream performance

https://arxiv.org/abs/2303.08774

16 of 87

Scaling laws: Chinchilla fix

https://arxiv.org/abs/2203.15556

17 of 87

Scaling laws: Chinchilla fix

WAT?!

Llama-3 8B trained on 15T tokens

https://arxiv.org/abs/2203.15556

18 of 87

Scaling laws: Inference

Chinchilla optimal models are only training compute optimal and ignore inference compute

19 of 87

Scaling laws: Harm’s law

20 of 87

Dataset

aka the secret sauce

aka 90% of all the work

21 of 87

Dataset: the secret workhorse of LLMs

https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/

22 of 87

Dataset: goal of pretraining

Train a general-purpose model → maximal coverage

Requires:

train on massive quantities of text, at least 1 trillion of tokens nowadays moving towards 10-20T

Challenges:

maximizing diversity and coverage
maximising quality, robustness
data quality evaluation: how to measure data quality at the billion tokens scale

23 of 87

Dataset: where to find data

Very large (> 100B tokens):

Common crawl: everyone starts from here
Code: Github and Software Heritage

Curated:

Websites: Wikipedia, StackExchange, Arxiv
Books: public-domain vs. copyrighted

More recent trends

Synthetic data

24 of 87

Dataset: FineWeb

Based on CommonCrawl
44TB disk space
15T tokens
Transparent pipeline

https://hf.co/spaces/HuggingFaceFW/blogpost-fineweb-v1

25 of 87

Dataset: the average web

… is mostly garbage:

Ads/SEO
Obituaries
Porn
Sport News

If we want a high quality model we need to clean it up!

26 of 87

Dataset: filtering pipeline

27 of 87

Dataset: filtering pipeline

28 of 87

Dataset: general advice

Gather as much data as possible

Filter as much as necessary

Look at the data you keep (and throw away)

(manually, clustering, tokenizing etc)

Don’t trust your intuition → evaluate!

29 of 87

Dataset: language filtering

30 of 87

Dataset: quality heuristics

31 of 87

Dataset: quality heuristics

Advantages:

controlled
robust
deterministic
rather clear priors

Drawbacks:

rely entirely on surface level
danger of removing too much
hyper-parameter tuning

32 of 87

Dataset: quality filtering - ML

Given a set of examples of good/bad documents:

classifier-based quality filtering:�fastText classification with an n-gram size of 2
perplexity based filtering:�5-gram Kneser-Ney model on Wikipedia

(see https://github.com/kpu/kenlm)

→ Filter based on a threshold

more “quality/content based filtering”
harder to estimate the impact →may introduce “bias”

33 of 87

Dataset: FineWeb-Edu

Below is an extract from a web page. Evaluate whether the page has a high educational value and could be useful in an educational setting for teaching from primary school to grade school levels using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and promotional material.

Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style.

Award a third point if the extract is appropriate for educational use and introduces key concepts relevant to school curricula. It is coherent though it may not be comprehensive or could include some extraneous information. It may resemble an introductory section of a textbook or a basic tutorial that is suitable for learning but has notable limitations like treating concepts that are too complex for grade school students.

Grant a fourth point if the extract highly relevant and beneficial for educational purposes for a level not higher than grade school, exhibiting a clear and consistent writing style. It could be similar to a chapter from a textbook or a tutorial, offering substantial educational content, including exercises and solutions, with minimal irrelevant information, and the concepts aren't too advanced for grade school students. The content is coherent, focused, and valuable for structured learning.

Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for teaching either at primary school or grade school. It follows detailed reasoning, the writing style is easy to follow and offers profound and thorough insights into the subject matter, devoid of any non-educational or complex content.

The extract:

<EXAMPLE>.

After examining then extract:

Briefly justify your total score, up to 100 words.
Conclude with the score using the format: "Educational score: <total points>"

Llama 3 70B

500K samples

Small Transformer

FineWeb-Edu

Annotate

Train

Infer

34 of 87

Dataset: FineWeb-Edu

35 of 87

Dataset: notes on filtering

Taking care of domains specificities

important to inspect the effect on domain specific data
extract 10 documents per domains (e.g. top urls)
manually inspect the results
craft domain specific filters/hyper-parameters
same for multiple languages�

Deterministic vs. stochastic selection

hard threshold are strong decision points
stochastic smoothing of rules

36 of 87

Dataset: deduplication

Fuzzy

BLOOM filters (hashing and fixed size vector)
MinHash (hashing and sorting)

Exact

Exact substring with suffix array
Sentence dedup�

time/memory consumption

MinHash offers a good trade–off of speed/memory

counter intuitive results

more deduplication may lead to keeping only bad data

37 of 87

Dataset: evaluate data quality

Small models trainings: train 1-2B size models on 30GT (chinchilla optimal)�

Use a set of “high-signal” benchmarks (in NLP):

commonsense QA
hellaswag
openbook QA
PiQA�

High-signal?

monotonicity: monotonically increasing during training
low variance:

when comparing two known reference datasets (e.g. The Pile versus C4)
when comparing with various sub-parts of data and seeds
above random baseline�

Tricky details to maximize signal:

Small models like “normalized loglikelihood” better
Larger models like “letter answers” better

38 of 87

Dataset: The Stack

39 of 87

Dataset: The Stack v2

3B+ files in 658 programming languages
created as part of the BigCode Project�pre-training dataset for Code LLMs
Derived from the Software Heritage�archive: largest public archive of�software source code

40 of 87

Dataset: Cosmopedia

A synthetic dataset of 30M�samples
generated by Mixtral-8x7B-Instruct-v0.1
8 splits:

various sources for seed samples:�Stanford, OpenStax and�KhanAcademy, FineWeb, instruction-tuning datasets
model is asked to generate content

41 of 87

Distributed Training

simple things get complicated

42 of 87

Training: strategy

Compute budget is external constraint

use Chinchilla + Harm’s law to determine model size
Global batch size is a function of model size

small models ~1-4M tokens (=gbs x seqlen, seqlen typically 2-8k)
Large models ~4-40M tokens

Compute cluster and models size determine training topology

Large models require distribution over multiple GPUs

4D parallelism
ZeRO

We are limited by scaling beyond a certain batch size

43 of 87

Training: basic training step

44 of 87

Training: anatomy of memory

45 of 87

Training: activation recomputation

46 of 87

Training: activation recomputation

Sequence length

Selective: store activations of specific operations → 2-3% slowdown

Full: only store activations at layer level → 30% slowdown

47 of 87

Training: gradient accumulation

Split global batch into micro batches to save memory

Now let’s add more GPUs!

48 of 87

Training: Data Parallelism - 1D

Distribute micro batches across GPUs

all_reduce

49 of 87

Training: Overlap Communication +

Computation

https://siboehm.com/articles/22/data-parallel-training

50 of 87