1 of 25

Generative AI and LLMs

Dr. Savannah Thais

ENGI E4800 Lecture 6

2 of 25

Is ChatGPT Trustworthy?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

3 of 25

Discussion on Red Teaming Exercises

What was the ChatGPT generated essay like?

How would it compare to your own essays?
Was it factually correct and well argued?
Were the sources credible?

What happened in the red teaming experiments?

Was performance consistent?
Were you able to change the model’s behavior?

Was this behavior what you expected?

4 of 25

What are your biggest concerns around generative AI?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

5 of 25

What do you think generative AI can safely be used for?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

6 of 25

How Do LLMs Work?

7 of 25

Attention

Transformers consist of an encoder and decoder with attention

Originally developed for sequence-to-sequence learning

Attention allows models to learn what to ‘focus’ on

8 of 25

Is All You Need

Transformers remove recurrent structure and add multiheaded-attention

Add a positional encoding to account for ordering

Output embedding is trained to predict sequence shifted by one word

Q=vector representation of one word, K=vector rep of all words, V=(different) vector rep of all words

Inference is run iteratively

Attention Is All You Need: Vaswani et al

9 of 25

(Except for Humans)

To enable ‘chat’ behavior and (theoretically) prevent some negative uses, LLM developers rely on Reinforcement Learning with Human Feedback

Humans review and rate model behavior, feedback loops discourage model from creating similar content
Can also include hard-coded rules

Red teaming experiments help researchers understand model edge cases and characterize performance

From OpenAI

10 of 25

What Can LLMs Do?

11 of 25

Maybe a lot…

12 of 25

How should we evaluate an LLM's capabilities?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

13 of 25

ChatGPT Website and GPT4 Research Page

14 of 25

Prompt Sensitivity

Construct Validity

Contamination

3 Challenges with LLM Evaluations

Are you measuring something intrinsic about the model or is an artifact of your prompt?

Model behavior is not a construct that exists independently of users and prompting

What exists in the training data? Is the model demonstrating behavior or memorization?

Narayanan & Kapoor

15 of 25

Performance Degradation?

Paper claims GPT performance is degrading over time
Compare two model snapshots from March and June
Test on four tasks: code generation, checking primes. Answering sensitive questions, and visual reasoning

How is ChatGPT’s Behavior Changing Over Time?: Chen et al

16 of 25

Performance Degradation?

But, the tasks were not necessarily designed well…
All numbers selected for math test were prime. Behavior likely depends on fine tuning
For coding ability, they only tested if full response was directly executable

Is GPT-4 Getting Worse Over Time?: Narayanan & Kapoor

17 of 25

Liberal Bias?

A paper found that RLHF results in ChatGPT having a strong liberal/Democratic bias
Prompt ChatGPT to respond to political statements while impersonating people from a side of the political spectrum and compare to neutral responses
Collect answers to the same question 100 times to reduce variability

More human than human: measuring ChatGPT political bias: Motoki et al

18 of 25

Liberal Bias?

However, the paper had many scientific flaws
Questions were asked as multiple choice + with prompting to try to force the model to opine (no construct validity)
Generated politically neutral questions with ChatGPT and asked the model how a democrat or republican would answer
Results depend on question ordering, and asking all questions in the same session

Does ChatGPT have a liberal bias?: Narayanan and Kapoor

19 of 25

Exam Performance

OpenAI highlighted ChatGPT’s ability to pass a variety of exams in its technical report
Bar, USMLE, SATs, APs, etc
Claim to test for data contamination by searching for identical substrings of exam questions in the training data set

GPT4 Technical Report: OpenAI

20 of 25

Exam Performance

Again, a variety of scientific concerns with the results
Contamination test is brittle
Model demonstrates memorization on coding exams (cannot answer questions after training data cut off date)
Construct validity: professional exams are designed for humans, don’t demonstrate generalization

GPT-4 and professional benchmarks: the wrong answer to the wrong question: Narayanan + Kapoor

21 of 25

Real World Exams

22 of 25

Case Study: AI Theory of Mind and Creativity

23 of 25

Discussion

Do these evaluations meet our three criteria?

Understanding of prompt sensitivity, construct validity, and lack of contamination

Are these tests executed in a convincing way?

24 of 25

Training Data

25 of 25

Societal Context