1 of 66

CS458 Natural language Processing

Lecture B

Large Language Models

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 66

Introduction to Large Language Models

3 of 66

Language models

Remember the simple n-gram language model

Assigns probabilities to sequences of words
Generate text by sampling possible next words
Is trained on counts computed from lots of text

Large language models are similar and different:

Assigns probabilities to sequences of words
Generate text by sampling possible next words
Are trained by learning to guess the next word

4 of 66

Large language models

Even through pretrained only to predict words
Learn a lot of useful language knowledge
Since training on a lot of text

5 of 66

Three architectures for large language models

Decoders Encoders Encoder-decoders

GPT, Claude, BERT family, Flan-T5, Whisper

Llama HuBERT

Mixtral

6 of 66

Encoders

Many varieties!

Popular: Masked Language Models (MLMs)
BERT family

Trained by predicting words from surrounding words on both sides
Are usually finetuned (trained on supervised data) for classification tasks.

7 of 66

Encoder-Decoders

Trained to map from one sequence to another
Very popular for:

machine translation (map from one language to another)
speech recognition (map from acoustics to words)

8 of 66

Large Language Models: What tasks can they do?

9 of 66

Big idea

Many tasks can be turned into tasks of predicting words!

10 of 66

This lecture: decoder-only models

Also called:

Causal LLMs
Autoregressive LLMs
Left-to-right LLMs

Predict words left to right

11 of 66

Conditional Generation: Generating text conditioned on previous text!

12 of 66

Many practical NLP tasks can be cast as word prediction!

Sentiment analysis: “I like Jackie Chan”

We give the language model this string:�The sentiment of the sentence "I like Jackie Chan" is:
And see what word it thinks comes next:

13 of 66

Framing lots of tasks as conditional generation

QA: “Who wrote The Origin of Species”

We give the language model this string:�

And see what word it thinks comes next:

And iterate:

14 of 66

Summarization

Original

Summary

15 of 66

LLMs for summarization (using tl;dr)

16 of 66

Sampling for LLM Generation

17 of 66

Decoding and Sampling

This task of choosing a word to generate based on the model’s probabilities is called decoding.

The most common method for decoding in LLMs: sampling.

Sampling from a model’s distribution over words:

choose random words according to their probability assigned by the model.

After each token we’ll sample words to generate according to their probability conditioned on our previous choices,

A transformer language model will give the probability

18 of 66

Random sampling

19 of 66

Random sampling doesn't work very well

Even though random sampling mostly generate sensible, high-probable words,

There are many odd, low- probability words in the tail of the distribution

Each one is low- probability but added up they constitute a large portion of the distribution

So they get picked enough to generate weird sentences

20 of 66

Factors in word sampling: quality and diversity

Emphasize high-probability words

+ quality: more accurate, coherent, and factual,

- diversity: boring, repetitive.

Emphasize middle-probability words

+ diversity: more creative, diverse,

- quality: less factual, incoherent

21 of 66

Top-k sampling

Choose # of words k
For each word in the vocabulary V , use the language model to compute the likelihood of this word given the context p(wt |w_<_t )
Sort the words by likelihood, keep only the top k most probable words.
Renormalize the scores of the k words to be a legitimate probability distribution.
Randomly sample a word from within these remaining k most-probable words according to its probability.

22 of 66

Top-p sampling (= nucleus sampling)

Problem with top-k: k is fixed so may cover very different amounts of probability mass in different situations

Idea: Instead, keep the top p percent of the probability mass

Given a distribution P(w_t |w_<_t ), the top-p vocabulary V ( p) is the smallest set of words such that

Holtzman et al., 2020

23 of 66

Temperature sampling

Reshape the distribution instead of truncating it

Intuition from thermodynamics,

a system at high temperature is flexible and can explore many possible states,
a system at lower temperature is likely to explore a subset of lower energy (better) states.

In low-temperature sampling, (τ ≤ 1) we smoothly

increase the probability of the most probable words
decrease the probability of the rare words.

24 of 66

Temperature sampling

Divide the logit by a temperature parameter τ before passing it through the softmax.

Instead of

We do

25 of 66

Temperature sampling

Why does this work?

When τ is close to 1 the distribution doesn’t change much.
The lower τ is, the larger the scores being passed to the softmax
Softmax pushes high values toward 1 and low values toward 0.
Large inputs pushes high-probability words higher and low probability word lower, making the distribution more greedy.
As τ approaches 0, the probability of most likely word approaches 1

0 ≤ τ ≤ 1

26 of 66

Pretraining Large Language Models: Algorithm

27 of 66

Pretraining

The big idea that underlies all the amazing performance of language models

First pretrain a transformer model on enormous amounts of text

Then apply it to new tasks.

28 of 66

Self-supervised training algorithm

We just train them to predict the next word!

Take a corpus of text
At each time step t

ask the model to predict the next word
train the model using gradient descent to minimize the error in this prediction

"Self-supervised" because it just uses the next word as the label!

29 of 66

Intuition of language model training: loss

Same loss function: cross-entropy loss

We want the model to assign a high probability to true word w
= want loss to be high if the model assigns too low a probability to w

CE Loss: The negative log probability that the model assigns to the true next word w

If the model assigns too low a probability to w
We move the model weights in the direction that assigns a higher probability to w

30 of 66

Cross-entropy loss for language modeling

CE loss: difference between the correct probability distribution and the predicted distribution

The correct distribution y_t knows the next word, so is 1 for the actual next word and 0 for the others.

So in this sum, all terms get multiplied by zero except one: the logp the model assigns to the correct next word, so:

31 of 66

Teacher forcing

At each token position t, model sees correct tokens w_1:_t,

Computes loss (–log probability) for the next token w_t₊₁

At next token position t+1 we ignore what model predicted for w_t₊₁

Instead we take the correct word w_t+1, add it to context, move on

32 of 66

Training a transformer language model

33 of 66

Pretraining data for LLMs

34 of 66

LLMs are mainly trained on the web

Common crawl, snapshots of the entire web produced by the non- profit Common Crawl with billions of pages

Colossal Clean Crawled Corpus (C4; Raffel et al. 2020), 156 billion tokens of English, filtered

What's in it? Mostly patent text documents, Wikipedia, and news sites

35 of 66

The Pile: a pretraining corpus

web

academics

books

dialog

36 of 66

Filtering for quality and safety

Quality is subjective

Many LLMs attempt to match Wikipedia, books, particular websites
Need to remove boilerplate, adult content
Deduplication at many levels (URLs, documents, even lines)

Safety also subjective

Toxicity detection is important, although that has mixed results
Can mistakenly flag data written in dialects like African American English

37 of 66

What does a model learn from pretraining?

There are canines everywhere! One dog in the front room, and two dogs
It wasn't just big it was enormous
The author of "A Room of One's Own" is Virginia Woolf
The doctor told me that he
The square root of 4 is 2

38 of 66

Big idea

Text contains enormous amounts of knowledge

Pretraining on lots of text with all that knowledge is what gives language models their ability to do so much

39 of 66

But there are problems with scraping from the web

Copyright: much of the text in these datasets is copyrighted

Not clear if fair use doctrine in US allows for this use
This remains an open legal question

Data consent

Website owners can indicate they don't want their site crawled

Privacy:

Websites can contain private IP addresses and phone numbers

40 of 66

Finetuning

41 of 66

Finetuning for daptation to new domains

What happens if we need our LLM to work well on a domain it didn't see in pretraining?

Perhaps some specific medical or legal domain?

Or maybe a multilingual LM needs to see more data on some language that was rare in pretraining?

42 of 66

Finetuning

43 of 66

"Finetuning" means 4 different things

We'll discuss 1 here, and 3 in later lectures

In all four cases, finetuning means:

taking a pretrained model and further adapting some or all of its parameters to some new data

44 of 66

1. Finetuning as "continued pretraining" on new data

Further train all the parameters of model on new data

using the same method (word prediction) and loss function (cross-entropy loss) as for pretraining.
as if the new data were at the tail end of the pretraining data

Hence sometimes called continued pretraining

45 of 66

Evaluating Large Language Models

46 of 66

Perplexity

Just as for n-gram grammars, we use perplexity to measure how well the LM predicts unseen text

The perplexity of a model θ on an unseen test set is the inverse probability that θ assigns to the test set, normalized by the test set length.

For a test set of n tokens w_1:_n the perplexity is :

47 of 66

Why perplexity instead of raw probability of the test set?

Probability depends on size of test set

Probability gets smaller the longer the text
Better: a metric that is per-word, normalized by length

Perplexity is the inverse probability of the test set, normalized by the number of words

(The inverse comes from the original definition of perplexity from cross-entropy rate in information theory)

Probability range is [0,1], perplexity range is [1,∞]

48 of 66

Perplexity

The higher the probability of the word sequence, the lower the perplexity.
Thus the lower the perplexity of a model on the data, the better the model.
Minimizing perplexity is the same as maximizing probability

Also: perplexity is sensitive to length/tokenization so best used when comparing LMs that use the same tokenizer.

49 of 66

Many other factors that we evaluate, like:

Size

Big models take lots of GPUs and time to train, memory to store

Energy usage

Can measure kWh or kilograms of CO2 emitted

Fairness

Benchmarks measure gendered and racial stereotypes, or decreased performance for language from or about some groups.

50 of 66

Dealing with Scale

51 of 66

Scaling Laws

LLM performance depends on

Model size: the number of parameters not counting embeddings
Dataset size: the amount of training data
Compute: Amount of compute (in FLOPS or etc

Can improve a model by adding parameters (more layers, wider contexts), more data, or training for more iterations

The performance of a large language model (the loss) scales as a power-law with each of these three

52 of 66

Scaling Laws

Loss L as a function of # parameters N, dataset size D, compute budget C (if other two are held constant)

Scaling laws can be used early in training to predict what the loss would be if we were to add more data or increase model size.

53 of 66

Number of non-embedding parameters N

Thus GPT-3, with n = 96 layers and dimensionality d = 12288, has 12 × 96 × 122882 ≈ 175 billion parameters.

54 of 66

KV Cache

In training, we can compute attention very efficiently in parallel:

But not at inference! We generate the next tokens one at a time!

For a new token x, need to multiply by W^Q, W^K, and W^V to get query, key, values

But don't want to recompute the key and value vectors for all the prior tokens x_<_i

Instead, store key and value vectors in memory in the KV cache, and then we can just grab them from the cache

55 of 66

KV Cache

56 of 66

Parameter-Efficient Finetuning

Adapting to a new domain by continued pretraining (finetuning) is a problem with huge LLMs.

Enormous numbers of parameters to train
Each pass of batch gradient descent has to backpropagate through many many huge layers.
Expensive in processing power, in memory, and in time.

Instead, parameter-efficient fine tuning (PEFT)

Efficiently select a subset of parameters to update when finetuning.
E.g., freeze some of the parameters (don’t change them),
And only update some a few parameters.

57 of 66

LoRA (Low-Rank Adaptation)

Trransformers have many dense matrix multiply layers

Like W^Q, W^K, W^V, W^O layers in attention

Instead of updating these layers during finetuning,

Freeze these layers
Update a low-rank approximation with fewer parameters.

58 of 66

LoRA

Consider a matrix W (shape [N × d]) that needs to be updated during finetuning via gradient descent.

Normally updates are ∆W (shape [N × d])

In LoRA, we freeze W and update instead a low-rank decomposition of W:

A of shape [N×r],
B of shape [r×d], r is very small (like 1 or 2)
That is, during finetuning we update A and B instead of W.
Replace W + ∆W with W + BA.

Forward pass: instead of

h = xW

We do

h = xW + xAB

59 of 66

LoRA

60 of 66

Harms of Large Language Models

61 of 66

Hallucination

62 of 66

Copyright

63 of 66

Privacy

64 of 66

Toxicity and Abuse

65 of 66

Misinformation

66 of 66

Thank You