1 of 45

The Technical AI Safety Research Landscape

Magdalena Wache

1

Get these slides

(for hyperlinks):

background image from aisafety.world

2 of 45

Outline

  • Approaches + Actors Overview (10 min)
  • Concrete Technical Research (25 min)
  • Summary + How to get started (5 min)
  • Q&A (10 min)

2

3 of 45

About You

Which of you…

  • … think AI safety is one of the world’s most important problems?
  • … are working on AI safety research?
  • … are planning to work on AI safety research?
  • … are considering working on AI safety research but still unsure?

3

4 of 45

About this Talk

  • Goal
    • Rough idea of what kind of research exists
    • Structure to think about technical AIS research
      • “Technical” ≈ “to do with computer science, math or similar”
      • (non-technical research is not covered)
  • Disclaimers
    • This is how I think about AI safety, not necessarily how other people think about it.
    • I myself will probably think about this very differently 1 year from now.
    • Not “the canonical view” (there is no canonical view)
    • It’s simplified.
    • It’s (very!) incomplete.
  • Want to talk more about specific research? → Fill in this form:
    • I can probably introduce you to relevant people.
    • Err on the side of filling this in! Talking to people can be really valuable!

4

5 of 45

Links

5

Slides

Intro Form

(I will also show them in the end again)

6 of 45

Overview: Research & Actors

6

background image from aisafety.world

7 of 45

Technical AI Safety Approaches

7

Alignment: make AI robustly do what we want

(clouds because there are no clear distinctions)

Deconfusion: Find better ways to think about AI safety

General goal: ensure AI systems don’t do really bad things

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Threat modeling: Figure out how AI might become dangerous

Evaluations: show current capabilities/ misalignment

8 of 45

8

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Evaluations: show current capabilities/ misalignment

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

9 of 45

What do we want to align with what?

9

What we want

What the AI does

Alignment

What the AI does

What we want

What the AI wants

Intent Alignment

Capability Robustness

What the AI is trained on

What the AI wants

What the AI does

What we want

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

very vague, inconsistent, who are “we” exactly?

what does that even mean?

(Very simplified. Doesn't account for multipolar AI. Also a lot of alignment research doesn't neatly fit into these categories. It’s also disputed if the inner/outer distinction is productive)

10 of 45

Mini-Reflection (1 minute)

Anything confusing?→ Tell your neighbor!

If not, anything you found interesting?

(There will be a few of these Mini Reflections throughout this talk)

10

11 of 45

Actors

11

12 of 45

Concrete Technical Research

12

13 of 45

13

What the AI is trained on

What the AI wants

What the AI does

What we want

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

very vague, inconsistent

what does that even mean?

14 of 45

Specification gaming

  • “Stack the Lego blocks” specified as “bottom of the red block as high as top of the blue block”
  • Problem: Missing some important aspect when specifying what we want.

14

15 of 45

Reinforcement Learning from Human Feedback (RLHF)

  • “Backflip” is hard to specify
  • Model is updated according to human feedback on its output
  • Also used in state of the art language models
  • Probably doesn’t scale to superhuman AI (then we can’t evaluate output anymore)
  • Sometimes still makes up facts
  • Potentially incentivizes dishonesty

15

Read more: blogpost, paper

(cooperation OpenAI + Deepmind)

16 of 45

16

What the AI is trained on

What the AI wants

What the AI does

What we want

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

very vague, inconsistent

what does that even mean?

17 of 45

Goal Misgeneralization

  • In training: coin is at the right end
  • In deployment: random coin position
  • Capability generalizes (successfully avoids obstacles etc.)
  • Goal does not generalize (goes right instead of to the coin)
    • True reward is underspecified from training-time reward. (“go to the right” and “get the coin” are both compatible)
  • → Correct reward specification is not enough!
  • Train in more diverse environments?

17

18 of 45

Adversarial Training

  • Search for examples of misgeneralization
  • Train on these examples
    • E.g. putting the coin in a random position during training
  • Example: Adversarial Training for High-Stakes Reliability paper (Redwood Research)
    • Use adversarial training to make LLM never mention injury
    • Some prompts still make it mention injury
  • Worry for more capable systems: Deceptive Alignment
    • AI realizes it’s in training and acts as if it is aligned, in order to prevent being updated
    • In that case, adversarial training may not be possible

18

19 of 45

19

What the AI is trained on

What the AI wants

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

What the AI does

Intent alignment

What we want

In many approaches it doesn’t really make sense to distinguish inner/outer alignment

20 of 45

Scalable Oversight

  • Idea: Humans supervise AI that gradually gets better
  • Use AI assistance to supervise
  • AI Safety via Debate: Let 2 AIs debate each other
    • I.e. make pro & contra arguments for a statement
    • Useful if the statement is hard to judge directly
    • humans can look at the arguments to figure out if it is correct
  • AI-written critiques help humans notice flaws in AI-written summaries
  • Externalized Reasoning Oversight: Make LLMs “think out loud”
  • Interpretability
    • Understand systems really well → easier to supervise
  • How to scale to very superhuman systems? → open question

20

21 of 45

Shard Theory

  • Reinforcement Learning agents aren’t actually optimizing the reward
  • Nor are humans
  • Instead, we have “shards of value” which depend on context
    • E.g. food, helping people in front of you, …
  • Understand this better in AIs
  • Hope: this helps us directly make AI what we want
  • Informs us how much to worry about e.g. power-seeking behavior

21

Read more on the Alignment Forum:

general idea, empirical research: Reinforcement Learning, Language Models

22 of 45

22

What the AI is trained on

What the AI wants

What the AI does

What we want

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

very vague, inconsistent

what does that even mean?

23 of 45

Adversarial Attack Robustness

  • Adversarial attacks: Deep Learning Systems have weird failures
  • Make them robust → open problem

23

Some image classifiers don’t recognize this as a stop sign. (source)

24 of 45

24

Deconfusion: Find better ways to think about AI safety

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Evaluations: show current capabilities/ misalignment

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Interpretability: Understanding current systems better

25 of 45

Interpretability

  • Figure out what current systems are doing
  • Needed for many proposals
  • Hands-on & practical, quick feedback loops
  • Worry about dual-use and publishing, see discussion: 1, 2, 3, 4
  • See also talk by Callum McDougall (tomorrow 11 am)

25

Feature Visualization: visualizing what things neurons are activated by

26 of 45

Mini-Reflection (1 minute)

Anything confusing?→ Tell your neighbor!

If not, anything you found interesting?

26

27 of 45

27

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

28 of 45

Model Evaluations

  • Show Capabilities and/or Misalignment
  • Informs/helps make better decisions in policy and research

28

Read more: ARC evals report

29 of 45

29

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

meta

research:

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

Governance: e.g. standards and regulation

30 of 45

Technical Work in AI Governance

  • Regulate compute to prevent (or slow down) development of dangerous capabilities
  • Enforcing Regulations → Technical Problem
  • Example: What does it take to catch a Chinchilla paper
    • Let chips log snapshots of the model during training
    • Protocol to verify only certain types of training happened
  • Probably particularly interesting for people interested in IT sec/crypto

30

31 of 45

31

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

meta

research:

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

Governance: e.g. standards and regulation

Automating Alignment: Make AI do AI safety research

32 of 45

Automating Alignment

  • OpenAI’s main plan
  • Make AI do alignment research
  • How to make sure the “artificial researchers” are aligned? → open problem
  • Example: Automating Interpretability

32

33 of 45

33

Interpretability: Understanding current systems better

meta

research:

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

Governance: e.g. standards and regulation

Automating Alignment: Make AI do AI safety research

Deconfusion: Find better ways to think about AI safety

34 of 45

Natural Abstractions

Plan: Understanding Abstractions helps find robust mathematical formalizations of

  • optimization/goals (useful for many approaches)
  • human values (because we care about abstract things)

34

Water bottle

Table

Water bottle + table’s right segment

Table’s left segment

Natural Abstractions

Unnatural Abstractions

Research agenda by John Wentworth (independent researcher)

Read more: Alignment Forum post

35 of 45

35

Interpretability: Understanding current systems better

meta

research:

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

Governance: e.g. standards and regulation

Automating Alignment: Make AI do AI safety research

Deconfusion: Find better ways to think about AI safety

Threat modeling: Figure out how AI might become dangerous

36 of 45

Threat modeling

  • Develop a detailed model of how AI becomes dangerous
  • Informs where to direct effort in research
  • Example: What Multipolar Failure looks like

36

Read more: Literature Review of different Threat Models

37 of 45

Disagreements

37

38 of 45

Disagreements

  • How will AI capabilities develop?
    • Long vs short timelines
    • Fast vs slow takeoff, how many warning shots
    • Predicting this is its own subfield, e.g. Biological Anchors Report (summary)
  • How difficult is alignment?
    • Maybe we get “alignment by default
    • Maybe it’s literally impossible
    • Feasibility of different approaches (e.g. automating alignment)
  • Safety research maybe advances capabilities
    • RLHF (e.g. ChatGPT getting people excited about AI)
    • Interpretability
    • How much to publish vs keep results in smaller circles

38

39 of 45

Mini-Reflection (1 minute)

Anything confusing?→ Tell your neighbor!

If not, anything you found interesting?

39

40 of 45

Summary & How to Get Started

40

41 of 45

Summary

41

Alignment: make AI robustly do what we want

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Threat modeling: Figure out how AI might become dangerous

Evaluations: show current capabilities/ misalignment

42 of 45

My Work

42

Alignment: make AI robustly do what we want

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Evaluating the QACI research agenda (currently, LTFF grant)

Finite Factored Sets in Pictures: deconfusion about agents and abstractions (LTFF grant)

Neural Network Theory for Interpretability (SERI MATS research program)

Meta/Field Building: �AI Safety Europe Retreat

43 of 45

Getting Started - My favorite resources

43

44 of 45

Q&A

44

Specific research direction you want to talk more about? → fill in the intro form and I can maybe make an introduction.

Slides

Intro Form

I’m looking for a job starting ~December! (research and/or organizing)

If you might want to work with me, please reach out!

magdalena.wache@mailbox.org

45 of 45

Thank You!

45

More Questions? → Office Hours: 17:30, Kepler

magdalena.wache@mailbox.org