1 of 66

CS458 Natural language Processing

Lecture B

Large Language Models

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 66

Introduction to Large Language Models

3 of 66

Language models

  • Remember the simple n-gram language model
    • Assigns probabilities to sequences of words
    • Generate text by sampling possible next words
    • Is trained on counts computed from lots of text
  • Large language models are similar and different:
    • Assigns probabilities to sequences of words
    • Generate text by sampling possible next words
    • Are trained by learning to guess the next word

4 of 66

Large language models

  • Even through pretrained only to predict words
  • Learn a lot of useful language knowledge
  • Since training on a lot of text

5 of 66

Three architectures for large language models

Decoders Encoders Encoder-decoders

GPT, Claude, BERT family, Flan-T5, Whisper

Llama HuBERT

Mixtral

6 of 66

Encoders

Many varieties!

  • Popular: Masked Language Models (MLMs)
  • BERT family

  • Trained by predicting words from surrounding words on both sides
  • Are usually finetuned (trained on supervised data) for classification tasks.

7 of 66

Encoder-Decoders

  • Trained to map from one sequence to another
  • Very popular for:
    • machine translation (map from one language to another)
    • speech recognition (map from acoustics to words)

8 of 66

Large Language Models: What tasks can they do?

9 of 66

Big idea

Many tasks can be turned into tasks of predicting words!

10 of 66

This lecture: decoder-only models

Also called:

  • Causal LLMs
  • Autoregressive LLMs
  • Left-to-right LLMs

  • Predict words left to right

11 of 66

Conditional Generation: Generating text conditioned on previous text!

12 of 66

Many practical NLP tasks can be cast as word prediction!

Sentiment analysis: “I like Jackie Chan”

  1. We give the language model this string:�The sentiment of the sentence "I like Jackie Chan" is:
  2. And see what word it thinks comes next:

13 of 66

Framing lots of tasks as conditional generation

QA: “Who wrote The Origin of Species”

  1. We give the language model this string:�

  • And see what word it thinks comes next:

  • And iterate:

14 of 66

Summarization

Original

Summary

15 of 66

LLMs for summarization (using tl;dr)

16 of 66

Sampling for LLM Generation

17 of 66

Decoding and Sampling

This task of choosing a word to generate based on the model’s probabilities is called decoding.

The most common method for decoding in LLMs: sampling.

Sampling from a model’s distribution over words:

  • choose random words according to their probability assigned by the model.

After each token we’ll sample words to generate according to their probability conditioned on our previous choices,

  • A transformer language model will give the probability

18 of 66

Random sampling

19 of 66

Random sampling doesn't work very well

Even though random sampling mostly generate sensible, high-probable words,

There are many odd, low- probability words in the tail of the distribution

Each one is low- probability but added up they constitute a large portion of the distribution

So they get picked enough to generate weird sentences

20 of 66

Factors in word sampling: quality and diversity

Emphasize high-probability words

+ quality: more accurate, coherent, and factual,

- diversity: boring, repetitive.

Emphasize middle-probability words

+ diversity: more creative, diverse,

- quality: less factual, incoherent

21 of 66

Top-k sampling

  1. Choose # of words k
  2. For each word in the vocabulary V , use the language model to compute the likelihood of this word given the context p(wt |w<t )
  3. Sort the words by likelihood, keep only the top k most probable words.
  4. Renormalize the scores of the k words to be a legitimate probability distribution.
  5. Randomly sample a word from within these remaining k most-probable words according to its probability.

22 of 66

Top-p sampling (= nucleus sampling)

Problem with top-k: k is fixed so may cover very different amounts of probability mass in different situations

Idea: Instead, keep the top p percent of the probability mass

Given a distribution P(wt |w<t ), the top-p vocabulary V ( p) is the smallest set of words such that

Holtzman et al., 2020

23 of 66

Temperature sampling

Reshape the distribution instead of truncating it

Intuition from thermodynamics,

  • a system at high temperature is flexible and can explore many possible states,
  • a system at lower temperature is likely to explore a subset of lower energy (better) states.

In low-temperature sampling, (τ ≤ 1) we smoothly

  • increase the probability of the most probable words
  • decrease the probability of the rare words.

24 of 66

Temperature sampling

Divide the logit by a temperature parameter τ before passing it through the softmax.

Instead of

We do

25 of 66

Temperature sampling

Why does this work?

  • When τ is close to 1 the distribution doesn’t change much.
  • The lower τ is, the larger the scores being passed to the softmax
  • Softmax pushes high values toward 1 and low values toward 0.
  • Large inputs pushes high-probability words higher and low probability word lower, making the distribution more greedy.
  • As τ approaches 0, the probability of most likely word approaches 1

0 ≤ τ ≤ 1

26 of 66

Pretraining Large Language Models: Algorithm

27 of 66

Pretraining

The big idea that underlies all the amazing performance of language models

First pretrain a transformer model on enormous amounts of text

Then apply it to new tasks.

28 of 66

Self-supervised training algorithm

We just train them to predict the next word!

  1. Take a corpus of text
  2. At each time step t
    1. ask the model to predict the next word
    2. train the model using gradient descent to minimize the error in this prediction

"Self-supervised" because it just uses the next word as the label!

29 of 66

Intuition of language model training: loss

  • Same loss function: cross-entropy loss
    • We want the model to assign a high probability to true word w
    • = want loss to be high if the model assigns too low a probability to w
  • CE Loss: The negative log probability that the model assigns to the true next word w
    • If the model assigns too low a probability to w
    • We move the model weights in the direction that assigns a higher probability to w

30 of 66

Cross-entropy loss for language modeling

CE loss: difference between the correct probability distribution and the predicted distribution

The correct distribution yt knows the next word, so is 1 for the actual next word and 0 for the others.

So in this sum, all terms get multiplied by zero except one: the logp the model assigns to the correct next word, so:

31 of 66

Teacher forcing

  • At each token position t, model sees correct tokens w1:t,
    • Computes loss (–log probability) for the next token wt+1
  • At next token position t+1 we ignore what model predicted for wt+1
    • Instead we take the correct word wt+1, add it to context, move on

32 of 66

Training a transformer language model

33 of 66

Pretraining data for LLMs

34 of 66

LLMs are mainly trained on the web

Common crawl, snapshots of the entire web produced by the non- profit Common Crawl with billions of pages

Colossal Clean Crawled Corpus (C4; Raffel et al. 2020), 156 billion tokens of English, filtered

What's in it? Mostly patent text documents, Wikipedia, and news sites

35 of 66

The Pile: a pretraining corpus

web

academics

books

dialog

36 of 66

Filtering for quality and safety

Quality is subjective

  • Many LLMs attempt to match Wikipedia, books, particular websites
  • Need to remove boilerplate, adult content
  • Deduplication at many levels (URLs, documents, even lines)

Safety also subjective

  • Toxicity detection is important, although that has mixed results
  • Can mistakenly flag data written in dialects like African American English

37 of 66

What does a model learn from pretraining?

  • There are canines everywhere! One dog in the front room, and two dogs
  • It wasn't just big it was enormous
  • The author of "A Room of One's Own" is Virginia Woolf
  • The doctor told me that he
  • The square root of 4 is 2

38 of 66

Big idea

Text contains enormous amounts of knowledge

Pretraining on lots of text with all that knowledge is what gives language models their ability to do so much

39 of 66

But there are problems with scraping from the web

Copyright: much of the text in these datasets is copyrighted

  • Not clear if fair use doctrine in US allows for this use
  • This remains an open legal question

Data consent

  • Website owners can indicate they don't want their site crawled

Privacy:

  • Websites can contain private IP addresses and phone numbers

40 of 66

Finetuning

41 of 66

Finetuning for daptation to new domains

What happens if we need our LLM to work well on a domain it didn't see in pretraining?

Perhaps some specific medical or legal domain?

Or maybe a multilingual LM needs to see more data on some language that was rare in pretraining?

42 of 66

Finetuning

43 of 66

"Finetuning" means 4 different things

We'll discuss 1 here, and 3 in later lectures

In all four cases, finetuning means:

taking a pretrained model and further adapting some or all of its parameters to some new data

44 of 66

1. Finetuning as "continued pretraining" on new data

  • Further train all the parameters of model on new data
    • using the same method (word prediction) and loss function (cross-entropy loss) as for pretraining.
    • as if the new data were at the tail end of the pretraining data
  • Hence sometimes called continued pretraining

 

45 of 66

Evaluating Large Language Models

46 of 66

Perplexity

Just as for n-gram grammars, we use perplexity to measure how well the LM predicts unseen text

The perplexity of a model θ on an unseen test set is the inverse probability that θ assigns to the test set, normalized by the test set length.

For a test set of n tokens w1:n the perplexity is :

47 of 66

Why perplexity instead of raw probability of the test set?

  • Probability depends on size of test set
    • Probability gets smaller the longer the text
    • Better: a metric that is per-word, normalized by length
  • Perplexity is the inverse probability of the test set, normalized by the number of words

(The inverse comes from the original definition of perplexity from cross-entropy rate in information theory)

Probability range is [0,1], perplexity range is [1,∞]

48 of 66

Perplexity

  • The higher the probability of the word sequence, the lower the perplexity.
  • Thus the lower the perplexity of a model on the data, the better the model.
  • Minimizing perplexity is the same as maximizing probability

Also: perplexity is sensitive to length/tokenization so best used when comparing LMs that use the same tokenizer.

49 of 66

Many other factors that we evaluate, like:

Size

Big models take lots of GPUs and time to train, memory to store

Energy usage

Can measure kWh or kilograms of CO2 emitted

Fairness

Benchmarks measure gendered and racial stereotypes, or decreased performance for language from or about some groups.

50 of 66

Dealing with Scale

51 of 66

Scaling Laws

LLM performance depends on

  • Model size: the number of parameters not counting embeddings
  • Dataset size: the amount of training data
  • Compute: Amount of compute (in FLOPS or etc

Can improve a model by adding parameters (more layers, wider contexts), more data, or training for more iterations

The performance of a large language model (the loss) scales as a power-law with each of these three

52 of 66

Scaling Laws

Loss L as a function of # parameters N, dataset size D, compute budget C (if other two are held constant)

Scaling laws can be used early in training to predict what the loss would be if we were to add more data or increase model size.

53 of 66

Number of non-embedding parameters N

Thus GPT-3, with n = 96 layers and dimensionality d = 12288, has 12 × 96 × 122882 ≈ 175 billion parameters.

54 of 66

KV Cache

In training, we can compute attention very efficiently in parallel:

But not at inference! We generate the next tokens one at a time!

For a new token x, need to multiply by WQ, WK, and WV to get query, key, values

But don't want to recompute the key and value vectors for all the prior tokens x<i

Instead, store key and value vectors in memory in the KV cache, and then we can just grab them from the cache

55 of 66

KV Cache

56 of 66

Parameter-Efficient Finetuning

Adapting to a new domain by continued pretraining (finetuning) is a problem with huge LLMs.

  • Enormous numbers of parameters to train
  • Each pass of batch gradient descent has to backpropagate through many many huge layers.
  • Expensive in processing power, in memory, and in time.

Instead, parameter-efficient fine tuning (PEFT)

  • Efficiently select a subset of parameters to update when finetuning.
  • E.g., freeze some of the parameters (don’t change them),
  • And only update some a few parameters.

57 of 66

LoRA (Low-Rank Adaptation)

  • Trransformers have many dense matrix multiply layers
    • Like WQ, WK, WV, WO layers in attention
  • Instead of updating these layers during finetuning,
    • Freeze these layers
    • Update a low-rank approximation with fewer parameters.

58 of 66

LoRA

  • Consider a matrix W (shape [N × d]) that needs to be updated during finetuning via gradient descent.
    • Normally updates are ∆W (shape [N × d])
  • In LoRA, we freeze W and update instead a low-rank decomposition of W:
    • A of shape [N×r],
    • B of shape [r×d], r is very small (like 1 or 2)
    • That is, during finetuning we update A and B instead of W.
    • Replace W + ∆W with W + BA.

Forward pass: instead of

h = xW

We do

h = xW + xAB

59 of 66

LoRA

60 of 66

Harms of Large Language Models

61 of 66

Hallucination

62 of 66

Copyright

63 of 66

Privacy

64 of 66

Toxicity and Abuse

65 of 66

Misinformation

66 of 66

Thank You