1 of 56

Lecture 29:

Advanced: LLMs

CS 136: Spring 2024

Katie Keith

2 of 56

Record on Zoom

3 of 56

  • Lab 8 due Wednesday or Thursday
  • Final project presentation schedule here (“Schedule” tab)
    • You’ll demo the code you have done so far (but not expected to be completely done)
  • In Gradescope:
    • All labs through Lab 7 returned
    • Comments on project proposals
  • Lab this week:
    • Feel free to coordinate with your project partner to attend the same lab
    • I will check-in with groups about their project proposals during lab

📣 Announcements

4 of 56

This week: Advanced Topics

The last day of content that is included on the final exam is last Friday (May 2)

  • Today (Monday) Advanced: A* Search Algorithm
  • Wednesday Advanced: LLMs
  • Friday – Review for final exam and wrap-up
    • Friday will NOT be recorded, so if you’d like final exam hints, I’d encourage you to show up

5 of 56

Two AI Paradigms

Symbolic AI

Statistical AI

  • Top-down
  • Logic, symbols, rules, databases
  • Search, planning, expert systems
  • Bottom-up
  • Learns from data, pattern recognition, uncertainty modeling
  • Machine learning, deep learning/neural networks

We’ll explore:

A* Search

We’ll explore:

LLMs

6 of 56

  • LLM Concept 1: Non-linear classification
  • LLM Concept 2: Self-supervision
  • LLMs for data science

🎯 Today’s Learning Objectives

7 of 56

Mainstream optimism for AI and LLMs

in policy and healthcare

8 of 56

Recent LLMs

  • Open AI’s GPT-4
    • 1.76 trillion weights (parameters)
    • Over $100 million to train
  • Google’s Gemini
    • 1.56 trillion weights (parameters)
    • Cost $30-191 million to train (before engineer salaries)

9 of 56

Mainstream optimism for AI and LLMs

in policy and healthcare

10 of 56

Mainstream optimism for AI and LLMs

in policy and healthcare

Decisions

Policy

Treatment

Outcomes

Causal Inference!

11 of 56

In proximal causal inference:

Katie’s Research

LLMs (and other ML models) are helpful but not sufficient subroutines of data science and causal inference.

LLM-1

LLM-2

12 of 56

ChatGPT in 3 slides

.

13 of 56

Recite

Asimov's

first

law

.

A

robot

must

not

Recite

Asimov's

first

law

.

A

robot

must

User’s Prompt

Generated Text

Next

Prediction

not

0.81

fulfill

0.13

adhere

0.02

...

...

sample

P(next word | previous words)

Where does this come from?

Previous Words

14 of 56

2. Optimize

P(next word | previous words)������

Gradient descent on non-linear function w/ billions of parameters!

1. Data from the internet

3. Data: Humans rank multiple output

#1

...

#2

#2 > #1

Loss function: Increases if incorrect prediction for a masked word

We use python to analyze our data and make visualizations.

The python slithered silently through the jungle underbrush.

Sum loss for billions of masked words

We use python to analyze our data and make visualizations.

The python slithered silently through the jungle underbrush.

We use python to analyze our data and make visualizations.

The camels slithered silently through the jungle underbrush.

We use python to analyze our data and make visualizations.

The python slithered silently through the jungle underbrush.

We help train ChatGPT too...

15 of 56

  • LLM Concept 1: Non-linear classification
  • LLM Concept 2: Self-supervision
  • LLMs for data science

🎯 Today’s Learning Objectives

16 of 56

A line in 2D space

y = mx + b

m: slope

b: intercept

x

y

17 of 56

Linear regression, i.e. fitting lines

y = mx + b

m: slope

b: intercept

Linear regression:

Find m that minimizes sum of residuals squared

18 of 56

Linear regression beyond 2D

19 of 56

Classification: Logistic regression

We want to learn a decision boundary between discrete classes.

Example: Classifying movie genres from their reviews

20 of 56

Warm-up. Draw me a decision boundary (straight line) that perfectly separates the two classes (colors).

💡Think-pair-share

21 of 56

History of deep learning: XOR problem

Minsky and Papert (1969) proved that a perceptron (a simple linear classifier with no non-linear activation) cannot solve the logical operation XOR.

Exclusive-or (XOR) outputs true (1) only when the inputs are different from each other

OR

x1=1

x2=1

x2=0

x1=0

x1=1

x2=1

x2=0

x1=0

XOR is not linearly separable

1

1

1

0

0

0

1

1

OR is linearly separable

XOR

22 of 56

Models

Input

Deep Learning Network

Output

Many deep learning “architecture” options

Metaphor: Stacking lego pieces

23 of 56

Logistic regression as a “shallow” network

Input

Output

Non-linear activation

Linear layer

Feature vector

*also called “fully connected” or “affine” layer

24 of 56

Feedforward deep learning network

Non-linear activation

Linear layer

Linear layer

Non-linear activation

“Architecture” design decision: we can stack as many layers as we want!

Input

Output

In a feedforward network, the computation proceeds iteratively from one layer of units to the next (and there are no cycles).

25 of 56

Deep networks can learn non-linear relationships

26 of 56

  • LLM Concept 1: Non-linear classification
  • LLM Concept 2: Self-supervision
  • LLMs for data science

🎯 Today’s Learning Objectives

27 of 56

Linear regression, aka fitting lines

y = mx + b

m: slope

b: intercept

Linear regression:

Find m that minimizes sum of residuals (d_i) squared

Loss function:

Mean squared error (MSE)

28 of 56

Training linear regression

For linear regression, we can derive an explicit formula for the optimal weights using the least-squares approach,

However, for logistic regression, there is no closed-form (analytical) solution for finding the optimal weights. We will need to use find the approximately optimal weights via computation.

29 of 56

Gradient Descent (intuition)

Thought experiment: How would you get to the bottom of a crater if you were blindfolded?

Coconino County, AZ

Meteor Impact Site

30 of 56

Gradient descent in two dimensions: surface plot

Take small steps in the “steepest downhill direction”

Loss function

Weight/Slope (Dim 2)

Weight/Slope (Dim 1)

31 of 56

Classification: Logistic regression

We want to learn a decision boundary between discrete classes.

Expensive to get classification labels

32 of 56

Definition by example

Guess. What is the meaning of the word tezgüino?

Word in context:

  1. A bottle of tezgüino is on the table.
  2. Everybody likes tezgüino.
  3. Don’t have tezgüino before you drive.
  4. We make tezgüino out of corn.

Example credit: Lin 1998, Eisenstein ANLP, 14.1

33 of 56

Distributional hypothesis (linguistics)

A word’s meaning can be derived from its context.

Example word in context:

  • A bottle of tezgüino is on the table.
  • Everybody likes tezgüino.
  • Don’t have tezgüino before you drive.
  • We make tezgüino out of corn.

J.R. Firth

Linguist, 1890-1960

“A word is characterized by the company it keeps.”

34 of 56

Self-supervision: Predict next words

Self-supervision is the process by which a model learns to predict part of the data using other parts of the same data as implicit labels.

Advantage: Cheap and abundant training data! No need for manually labeled training examples.

Model

35 of 56

Next word prediction ➡️ many tasks!

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2+2=”) > P(“5” | “2+2=”)

P(“1 star” | “That movie was terrible. I’d give it ”) > P(“5 starts” | “That movie was terrible. I’d give it ”)

Grammaticality; Subject-verb agreement

World Knowledge

Addition

Sentiment analysis

Examples adapted from Alec Radford

36 of 56

Next-word Prediction for LLMs

LLMs are pre-trained using a next word prediction self-supervised task.

Pseudocode:

Input: Corpus of text

For each masked token t:

Model predicts

Model weights trained via (variant of) gradient descent using true

(unmasked) word

Lots of variation in masking selection.

e.g., BERT: Randomly select 15%

37 of 56

LLMs require large amounts of compute

Apple M1 Pro 16-Core-GPU

5.3 x 10^12 FLOPS

GPT-3

(total train compute)

3.14 x 10^23 FLOPS

Factor of

~60 billion

Floating point operations per second

38 of 56

Autoregressive generation

Figure credit: Jay Alamar

An autoregressive model generates a token and then adds that generated token to its’ input sequence and repeats.

LLM

39 of 56

AI2 training OLMo “scare” 😱

When training OLMo, the AI2 team was monitoring the loss function and saw “fast spikes” in the figure below.

Loss

Gradient descent iteration

Slide credit: Hannaneh Hajishirzi, COLM, 2024.

Q: Guesses why? 🍬

40 of 56

“Bug” found in the training data

Slide credit: Hannaneh Hajishirzi, COLM, 2024.

Takeaway: “Garbage in, garbage out”

41 of 56

  • LLM Concept 1: Non-linear classification
  • LLM Concept 2: Self-supervision
  • LLMs for data science

🎯 Today’s Learning Objectives

42 of 56

Unstructured text can be more abundant, expressive, and flexible than structured data

… but difficult to directly incorporate text data into existing causal methods

And only increasing…

43 of 56

For many applications, text data may be a summary (or record) of structured confounding variables

43

Blood thinner

Clot buster

e.g., clinical notes in an

electronic health record

C

Y

A

U

Text

atrial fibrillation

age, sex, severity,

family history…

44 of 56

Zero-shot classifiers perform an unseen task with no supervised examples

44

Figures credit: Wei et al. “Finetuned language models are zero-shot learners.” ICLR, 2022.

45 of 56

Zero-shot predictions from LLMs

45

Context: {Text} \n

Is it likely the patient has {U}?\n

Constraint: Even if you are uncertain, you must pick either “Yes” or “No” without using any other words.

Prompt template

Context: ​Patient reports intermittent episodes of rapid, fluttering heartbeats over the past week, often accompanied by lightheadedness and shortness of breath \n

Is it likely the patient has atrial fibrillation?\n

Constraint: Even if you are uncertain, you must pick either “Yes” or “No” without using any other words.

Example instance (contrived for this talk)

LLM

FLAN-T5 XXL

(Chung et al. 2024)

or

OLMo-7B-Instruct (Groeneveld et al. 2024)

“Yes”

Generated text output

Deterministic answer extraction

If “Yes” in output, W=1

Else W=0

*Same for Z

Simple set-up for our proof-of-concept. Likely could be engineered for improvement.

46 of 56

Our pipeline: proximal causal inference with text

46

MIMIC-III

clinical notes

1

Remove discharge summaries

Split via

metadata

Clinical note categories

2

3

3

Echocardiogram

Nursing notes

LLM-1

LLM-2

FLAN-T5

OLMo

5. Zero-shot

Odds ratio heuristic

Fails

Passes

7

8

6

Proximal g-formula

Estimate of

9

10

5. Zero-shot

4

4

47 of 56

Our pipeline: proximal causal inference with text

47

MIMIC-III

clinical notes

1

Remove discharge summaries

Split via

metadata

Clinical note categories

2

3

3

Echocardiogram

Nursing notes

LLM-1

LLM-2

FLAN-T5

OLMo

5. Zero-shot

Odds ratio heuristic

Fails

Passes

7

8

6

Proximal g-formula

Estimate of

9

10

5. Zero-shot

4

4

48 of 56

Our pipeline: proximal causal inference with text

48

MIMIC-III

clinical notes

1

Remove discharge summaries

Split via

metadata

Clinical note categories

2

3

3

Echocardiogram

Nursing notes

LLM-1

LLM-2

FLAN-T5

OLMo

5. Zero-shot

Odds ratio heuristic

Fails

Passes

7

8

6

Proximal g-formula

Estimate of

9

10

5. Zero-shot

49 of 56

Our pipeline: proximal causal inference with text

49

MIMIC-III

clinical notes

1

Remove discharge summaries

Split via

metadata

Clinical note categories

2

3

3

Echocardiogram

Nursing notes

LLM-1

LLM-2

FLAN-T5

OLMo

5. Zero-shot

Odds ratio heuristic

Fails

Passes

7

8

6

Proximal g-formula

Estimate of

9

10

5. Zero-shot

4

4

50 of 56

Our pipeline: proximal causal inference with text

50

MIMIC-III

clinical notes

1

Remove discharge summaries

Split via

metadata

Clinical note categories

2

3

3

Echocardiogram

Nursing notes

LLM-1

LLM-2

FLAN-T5

OLMo

5. Zero-shot

Odds ratio heuristic

Fails

Passes

7

8

6

Proximal g-formula

Estimate of

10

5. Zero-shot

4

4

51 of 56

Our pipeline: proximal causal inference with text

51

MIMIC-III

clinical notes

1

Remove discharge summaries

Split via

metadata

Clinical note categories

2

3

3

Echocardiogram

Nursing notes

LLM-1

LLM-2

FLAN-T5

OLMo

5. Zero-shot

Odds ratio heuristic

Fails

Passes

7

8

6

Proximal g-formula

Estimate of

9

10

5. Zero-shot

4

4

52 of 56

Results Highlights: Semi-synthetic for ACE estimates

One LLM (FLAN)

Two LLMs (FLAN, Olmo)

Estimated

ACE

W directly in Backdoor

One LLM

(FLAN)

Two LLMs

(FLAN, OLMo)

U: A-sis

(coronary atherosclerosis of the native coronary artery)

T1pre:

Echocardiogram

T2pre:

Radiology

LLM-1

LLM-2

53 of 56

Results Highlights: Semi-synthetic for ACE estimates

One LLM (FLAN)

Two LLMs (FLAN, Olmo)

Estimated

ACE

W directly in Backdoor

One LLM

(FLAN)

Two LLMs

(FLAN, OLMo)

U: A-sis

(coronary atherosclerosis of the native coronary artery)

T1pre:

Echocardiogram

T2pre:

Radiology

95% confidence intervals via bootstrap resampling

LLM-1

LLM-2

54 of 56

Results Highlights: Semi-synthetic for ACE estimates

One LLM (FLAN)

Two LLMs (FLAN, Olmo)

Estimated

ACE

W directly in Backdoor

One LLM

(FLAN)

Two LLMs

(FLAN, OLMo)

U: A-sis

(coronary atherosclerosis of the native coronary artery)

T1pre:

Echocardiogram

T2pre:

Radiology

95% confidence intervals via bootstrap resampling

Blue: Passed odds ratio heuristic

Red: Failed odds ratio heuristic

Takeaways:

  • Value of using two zero-shot models in practice
  • Odds ratio heuristic can stop before biased estimates

LLM-1

LLM-2

55 of 56

LLMs helpful but not sufficient subroutine for causal inference with text

LLMs Helpful:

  • Design requires zero-shot classifiers with “decent predictive performance”
  • Improved performance of LLMs should narrow confidence intervals of ACE estimates

LLMs Not Sufficient:

  • No unmeasured confounding and conditional independence of text instances are untestable assumptions, difficult to satisfy, and often require domain expertise
  • We still need causal identification, estimation, sensitivity analysis, e.g., our odds ratio heuristic effectively flags when to stop or proceed

LLM-1

LLM-2

56 of 56

  • LLM Concept 1: Non-linear classification
  • LLM Concept 2: Self-supervision
  • LLMs for data science

🎯 Today’s Learning Objectives