1 of 25

Generative AI and LLMs

Dr. Savannah Thais

ENGI E4800 Lecture 6

2 of 25

Is ChatGPT Trustworthy?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

3 of 25

Discussion on Red Teaming Exercises

  • What was the ChatGPT generated essay like?
    • How would it compare to your own essays?
    • Was it factually correct and well argued?
    • Were the sources credible?
  • What happened in the red teaming experiments?
    • Was performance consistent?
    • Were you able to change the model’s behavior?
  • Was this behavior what you expected?

4 of 25

What are your biggest concerns around generative AI?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

5 of 25

What do you think generative AI can safely be used for?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

6 of 25

How Do LLMs Work?

7 of 25

Attention

  • Transformers consist of an encoder and decoder with attention
    • Originally developed for sequence-to-sequence learning
  • Attention allows models to learn what to ‘focus’ on

8 of 25

Is All You Need

  • Transformers remove recurrent structure and add multiheaded-attention
    • Add a positional encoding to account for ordering
  • Output embedding is trained to predict sequence shifted by one word
    • Q=vector representation of one word, K=vector rep of all words, V=(different) vector rep of all words
  • Inference is run iteratively

9 of 25

(Except for Humans)

  • To enable ‘chat’ behavior and (theoretically) prevent some negative uses, LLM developers rely on Reinforcement Learning with Human Feedback
    • Humans review and rate model behavior, feedback loops discourage model from creating similar content
    • Can also include hard-coded rules
  • Red teaming experiments help researchers understand model edge cases and characterize performance

10 of 25

What Can LLMs Do?

11 of 25

Maybe a lot…

12 of 25

How should we evaluate an LLM's capabilities?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

13 of 25

14 of 25

Prompt Sensitivity

Construct Validity

Contamination

3 Challenges with LLM Evaluations

Are you measuring something intrinsic about the model or is an artifact of your prompt?

Model behavior is not a construct that exists independently of users and prompting

What exists in the training data? Is the model demonstrating behavior or memorization?

15 of 25

Performance Degradation?

  • Paper claims GPT performance is degrading over time
  • Compare two model snapshots from March and June
  • Test on four tasks: code generation, checking primes. Answering sensitive questions, and visual reasoning

16 of 25

Performance Degradation?

  • But, the tasks were not necessarily designed well…
  • All numbers selected for math test were prime. Behavior likely depends on fine tuning
  • For coding ability, they only tested if full response was directly executable

17 of 25

Liberal Bias?

  • A paper found that RLHF results in ChatGPT having a strong liberal/Democratic bias
  • Prompt ChatGPT to respond to political statements while impersonating people from a side of the political spectrum and compare to neutral responses
  • Collect answers to the same question 100 times to reduce variability

18 of 25

Liberal Bias?

  • However, the paper had many scientific flaws
  • Questions were asked as multiple choice + with prompting to try to force the model to opine (no construct validity)
  • Generated politically neutral questions with ChatGPT and asked the model how a democrat or republican would answer
  • Results depend on question ordering, and asking all questions in the same session

Does ChatGPT have a liberal bias?: Narayanan and Kapoor

19 of 25

Exam Performance

  • OpenAI highlighted ChatGPT’s ability to pass a variety of exams in its technical report
  • Bar, USMLE, SATs, APs, etc
  • Claim to test for data contamination by searching for identical substrings of exam questions in the training data set

20 of 25

Exam Performance

  • Again, a variety of scientific concerns with the results
  • Contamination test is brittle
  • Model demonstrates memorization on coding exams (cannot answer questions after training data cut off date)
  • Construct validity: professional exams are designed for humans, don’t demonstrate generalization

21 of 25

Real World Exams

22 of 25

23 of 25

Discussion

  • Do these evaluations meet our three criteria?
    • Understanding of prompt sensitivity, construct validity, and lack of contamination
  • Are these tests executed in a convincing way?

24 of 25

Training Data

25 of 25

Societal Context