CS458 Natural language Processing
Lecture B
Large Language Models
Krishnendu Ghosh
Department of Computer Science & Engineering
Indian Institute of Information Technology Dharwad
Introduction to Large Language Models
Language models
Large language models
Three architectures for large language models
Decoders Encoders Encoder-decoders
GPT, Claude, BERT family, Flan-T5, Whisper
Llama HuBERT
Mixtral
Encoders
Many varieties!
Encoder-Decoders
Large Language Models: What tasks can they do?
Big idea
Many tasks can be turned into tasks of predicting words!
This lecture: decoder-only models
Also called:
Conditional Generation: Generating text conditioned on previous text!
Many practical NLP tasks can be cast as word prediction!
Sentiment analysis: “I like Jackie Chan”
Framing lots of tasks as conditional generation
QA: “Who wrote The Origin of Species”
Summarization
Original
Summary
LLMs for summarization (using tl;dr)
Sampling for LLM Generation
Decoding and Sampling
This task of choosing a word to generate based on the model’s probabilities is called decoding.
The most common method for decoding in LLMs: sampling.
Sampling from a model’s distribution over words:
After each token we’ll sample words to generate according to their probability conditioned on our previous choices,
Random sampling
Random sampling doesn't work very well
Even though random sampling mostly generate sensible, high-probable words,
There are many odd, low- probability words in the tail of the distribution
Each one is low- probability but added up they constitute a large portion of the distribution
So they get picked enough to generate weird sentences
Factors in word sampling: quality and diversity
Emphasize high-probability words
+ quality: more accurate, coherent, and factual,
- diversity: boring, repetitive.
Emphasize middle-probability words
+ diversity: more creative, diverse,
- quality: less factual, incoherent
Top-k sampling
Top-p sampling (= nucleus sampling)
Problem with top-k: k is fixed so may cover very different amounts of probability mass in different situations
Idea: Instead, keep the top p percent of the probability mass
Given a distribution P(wt |w<t ), the top-p vocabulary V ( p) is the smallest set of words such that
Holtzman et al., 2020
Temperature sampling
Reshape the distribution instead of truncating it
Intuition from thermodynamics,
In low-temperature sampling, (τ ≤ 1) we smoothly
Temperature sampling
Divide the logit by a temperature parameter τ before passing it through the softmax.
Instead of
We do
Temperature sampling
Why does this work?
0 ≤ τ ≤ 1
Pretraining Large Language Models: Algorithm
Pretraining
The big idea that underlies all the amazing performance of language models
First pretrain a transformer model on enormous amounts of text
Then apply it to new tasks.
Self-supervised training algorithm
We just train them to predict the next word!
"Self-supervised" because it just uses the next word as the label!
Intuition of language model training: loss
Cross-entropy loss for language modeling
CE loss: difference between the correct probability distribution and the predicted distribution
The correct distribution yt knows the next word, so is 1 for the actual next word and 0 for the others.
So in this sum, all terms get multiplied by zero except one: the logp the model assigns to the correct next word, so:
Teacher forcing
Training a transformer language model
Pretraining data for LLMs
LLMs are mainly trained on the web
Common crawl, snapshots of the entire web produced by the non- profit Common Crawl with billions of pages
Colossal Clean Crawled Corpus (C4; Raffel et al. 2020), 156 billion tokens of English, filtered
What's in it? Mostly patent text documents, Wikipedia, and news sites
The Pile: a pretraining corpus
web
academics
books
dialog
Filtering for quality and safety
Quality is subjective
Safety also subjective
What does a model learn from pretraining?
Big idea
Text contains enormous amounts of knowledge
Pretraining on lots of text with all that knowledge is what gives language models their ability to do so much
But there are problems with scraping from the web
Copyright: much of the text in these datasets is copyrighted
Data consent
Privacy:
Finetuning
Finetuning for daptation to new domains
What happens if we need our LLM to work well on a domain it didn't see in pretraining?
Perhaps some specific medical or legal domain?
Or maybe a multilingual LM needs to see more data on some language that was rare in pretraining?
Finetuning
"Finetuning" means 4 different things
We'll discuss 1 here, and 3 in later lectures
In all four cases, finetuning means:
taking a pretrained model and further adapting some or all of its parameters to some new data
1. Finetuning as "continued pretraining" on new data
Evaluating Large Language Models
Perplexity
Just as for n-gram grammars, we use perplexity to measure how well the LM predicts unseen text
The perplexity of a model θ on an unseen test set is the inverse probability that θ assigns to the test set, normalized by the test set length.
For a test set of n tokens w1:n the perplexity is :
Why perplexity instead of raw probability of the test set?
(The inverse comes from the original definition of perplexity from cross-entropy rate in information theory)
Probability range is [0,1], perplexity range is [1,∞]
Perplexity
Also: perplexity is sensitive to length/tokenization so best used when comparing LMs that use the same tokenizer.
Many other factors that we evaluate, like:
Size
Big models take lots of GPUs and time to train, memory to store
Energy usage
Can measure kWh or kilograms of CO2 emitted
Fairness
Benchmarks measure gendered and racial stereotypes, or decreased performance for language from or about some groups.
Dealing with Scale
Scaling Laws
LLM performance depends on
Can improve a model by adding parameters (more layers, wider contexts), more data, or training for more iterations
The performance of a large language model (the loss) scales as a power-law with each of these three
Scaling Laws
Loss L as a function of # parameters N, dataset size D, compute budget C (if other two are held constant)
Scaling laws can be used early in training to predict what the loss would be if we were to add more data or increase model size.
Number of non-embedding parameters N
Thus GPT-3, with n = 96 layers and dimensionality d = 12288, has 12 × 96 × 122882 ≈ 175 billion parameters.
KV Cache
In training, we can compute attention very efficiently in parallel:
But not at inference! We generate the next tokens one at a time!
For a new token x, need to multiply by WQ, WK, and WV to get query, key, values
But don't want to recompute the key and value vectors for all the prior tokens x<i
Instead, store key and value vectors in memory in the KV cache, and then we can just grab them from the cache
KV Cache
Parameter-Efficient Finetuning
Adapting to a new domain by continued pretraining (finetuning) is a problem with huge LLMs.
Instead, parameter-efficient fine tuning (PEFT)
LoRA (Low-Rank Adaptation)
LoRA
Forward pass: instead of
h = xW
We do
h = xW + xAB
LoRA
Harms of Large Language Models
Hallucination
Copyright
Privacy
Toxicity and Abuse
Misinformation
Thank You