1 of 45

The Technical AI Safety Research Landscape

Magdalena Wache

Get these slides

(for hyperlinks):

background image from aisafety.world

bit.ly/tais-research

2 of 45

Outline

Approaches + Actors Overview (10 min)
Concrete Technical Research (25 min)
Summary + How to get started (5 min)
Q&A (10 min)

3 of 45

About You

Which of you…

… think AI safety is one of the world’s most important problems?
… are working on AI safety research?
… are planning to work on AI safety research?
… are considering working on AI safety research but still unsure?

4 of 45

About this Talk

Goal

Rough idea of what kind of research exists
Structure to think about technical AIS research

“Technical” ≈ “to do with computer science, math or similar”
(non-technical research is not covered)

Disclaimers

This is how I think about AI safety, not necessarily how other people think about it.
I myself will probably think about this very differently 1 year from now.
Not “the canonical view” (there is no canonical view)
It’s simplified.
It’s (very!) incomplete.

Want to talk more about specific research? → Fill in this form:

I can probably introduce you to relevant people.
Err on the side of filling this in! Talking to people can be really valuable!

bit.ly/tais-intro

5 of 45

Links

Slides

bit.ly/tais-research

Intro Form

bit.ly/tais-intro

(I will also show them in the end again)

6 of 45

Overview: Research & Actors

background image from aisafety.world

7 of 45

Technical AI Safety Approaches

Alignment: make AI robustly do what we want

(clouds because there are no clear distinctions)

Deconfusion: Find better ways to think about AI safety

General goal: ensure AI systems don’t do really bad things

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Threat modeling: Figure out how AI might become dangerous

Evaluations: show current capabilities/ misalignment

8 of 45

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Evaluations: show current capabilities/ misalignment

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

9 of 45

What do we want to align with what?

What we want

What the AI does

Alignment

What the AI does

What we want

What the AI wants

Intent Alignment

Capability Robustness

What the AI is trained on

What the AI wants

What the AI does

What we want

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

very vague, inconsistent, who are “we” exactly?

what does that even mean?

(Very simplified. Doesn't account for multipolar AI. Also a lot of alignment research doesn't neatly fit into these categories. It’s also disputed if the inner/outer distinction is productive)

10 of 45

Mini-Reflection (1 minute)

Anything confusing?→ Tell your neighbor!

If not, anything you found interesting?

(There will be a few of these Mini Reflections throughout this talk)

11 of 45

Actors

Organizations

DeepMind, OpenAI, Anthropic, Redwood Research, MIRI, ARC, Conjecture, CAIS, CAIF, Encultured AI, FAR AI, AlignedAI, CLR, Ought, Apollo, …

Academia

CHAI, FHI, Jacob Steinhardt, David Krueger, Causal Incentives, Sam Bowman, STAI, ACS, Dylan Hadfield-Menell…

Independent Researchers

John Wentworth, Janus, Alex Flint… (often temporary)

Loose communities

EleutherAI discord, Cyborgism discord, … (overlapping with the above three)

12 of 45

Concrete Technical Research

13 of 45

What the AI is trained on

What the AI wants

What the AI does

What we want

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

very vague, inconsistent

what does that even mean?

14 of 45

Specification gaming

“Stack the Lego blocks” specified as “bottom of the red block as high as top of the blue block”
Problem: Missing some important aspect when specifying what we want.

15 of 45

Reinforcement Learning from Human Feedback (RLHF)

“Backflip” is hard to specify
Model is updated according to human feedback on its output
Also used in state of the art language models
Probably doesn’t scale to superhuman AI (then we can’t evaluate output anymore)
Sometimes still makes up facts
Potentially incentivizes dishonesty

16 of 45

What the AI is trained on

What the AI wants

What the AI does

What we want

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

very vague, inconsistent

what does that even mean?

17 of 45

Goal Misgeneralization

In training: coin is at the right end
In deployment: random coin position
Capability generalizes (successfully avoids obstacles etc.)
Goal does not generalize (goes right instead of to the coin)

True reward is underspecified from training-time reward. (“go to the right” and “get the coin” are both compatible)

→ Correct reward specification is not enough!
Train in more diverse environments?

18 of 45

Adversarial Training

Search for examples of misgeneralization
Train on these examples

E.g. putting the coin in a random position during training

Example: Adversarial Training for High-Stakes Reliability paper (Redwood Research)

Use adversarial training to make LLM never mention injury
Some prompts still make it mention injury

Worry for more capable systems: Deceptive Alignment

AI realizes it’s in training and acts as if it is aligned, in order to prevent being updated
In that case, adversarial training may not be possible

19 of 45

What the AI is trained on

What the AI wants

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

What the AI does

Intent alignment

What we want

In many approaches it doesn’t really make sense to distinguish inner/outer alignment

20 of 45

Scalable Oversight

Idea: Humans supervise AI that gradually gets better
Use AI assistance to supervise
AI Safety via Debate: Let 2 AIs debate each other

I.e. make pro & contra arguments for a statement
Useful if the statement is hard to judge directly
humans can look at the arguments to figure out if it is correct

AI-written critiques help humans notice flaws in AI-written summaries
Externalized Reasoning Oversight: Make LLMs “think out loud”
Interpretability

Understand systems really well → easier to supervise

How to scale to very superhuman systems? → open question

21 of 45

Shard Theory

Reinforcement Learning agents aren’t actually optimizing the reward
Nor are humans

Instead, we have “shards of value” which depend on context

E.g. food, helping people in front of you, …

Understand this better in AIs
Hope: this helps us directly make AI what we want
Informs us how much to worry about e.g. power-seeking behavior

22 of 45

What the AI is trained on

What the AI wants

What the AI does

What we want

Outer Alignment

Inner Alignment

Capability Robustness

mathematical function, e.g. loss or reward

very vague, inconsistent

what does that even mean?

23 of 45

Adversarial Attack Robustness

Adversarial attacks: Deep Learning Systems have weird failures
Make them robust → open problem

Some image classifiers don’t recognize this as a stop sign. (source)

24 of 45

Deconfusion: Find better ways to think about AI safety

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Evaluations: show current capabilities/ misalignment

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Interpretability: Understanding current systems better

25 of 45

Interpretability

Figure out what current systems are doing

Example 1: feature visualization (see image)
Example 2: Discovering Latent Knowledge Paper: Truth-representation in LLMs

Needed for many proposals
Hands-on & practical, quick feedback loops
Worry about dual-use and publishing, see discussion: 1, 2, 3, 4
See also talk by Callum McDougall (tomorrow 11 am)

Feature Visualization: visualizing what things neurons are activated by

26 of 45

Mini-Reflection (1 minute)

Anything confusing?→ Tell your neighbor!

If not, anything you found interesting?

27 of 45

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

28 of 45

Model Evaluations

Show Capabilities and/or Misalignment
Informs/helps make better decisions in policy and research

29 of 45

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

meta

research:

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

Governance: e.g. standards and regulation

30 of 45

Technical Work in AI Governance

Regulate compute to prevent (or slow down) development of dangerous capabilities
Enforcing Regulations → Technical Problem
Example: What does it take to catch a Chinchilla paper

Let chips log snapshots of the model during training
Protocol to verify only certain types of training happened

Probably particularly interesting for people interested in IT sec/crypto

31 of 45

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

meta

research:

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

Governance: e.g. standards and regulation

Automating Alignment: Make AI do AI safety research

32 of 45

Automating Alignment

OpenAI’s main plan
Make AI do alignment research
How to make sure the “artificial researchers” are aligned? → open problem
Example: Automating Interpretability

Language models can explain neurons in language models paper (OpenAI)
GPT-4 explains GPT-2’s neurons (sometimes successfully)

33 of 45

Interpretability: Understanding current systems better

meta

research:

Threat modeling: Figure out how AI might become dangerous

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

Governance: e.g. standards and regulation

Automating Alignment: Make AI do AI safety research

Deconfusion: Find better ways to think about AI safety

34 of 45

Natural Abstractions

Plan: Understanding Abstractions helps find robust mathematical formalizations of

optimization/goals (useful for many approaches)
human values (because we care about abstract things)

Water bottle

Table

Water bottle + table’s right segment

Table’s left segment

Natural Abstractions

Unnatural Abstractions

Research agenda by John Wentworth (independent researcher)

35 of 45

Interpretability: Understanding current systems better

meta

research:

Alignment: make AI robustly do what we want

Evaluations: show current capabilities/ misalignment

Governance: e.g. standards and regulation

Automating Alignment: Make AI do AI safety research

Deconfusion: Find better ways to think about AI safety

Threat modeling: Figure out how AI might become dangerous

36 of 45

Threat modeling

Develop a detailed model of how AI becomes dangerous
Informs where to direct effort in research
Example: What Multipolar Failure looks like

Read more: Literature Review of different Threat Models

37 of 45

Disagreements

38 of 45

Disagreements

How will AI capabilities develop?

Long vs short timelines
Fast vs slow takeoff, how many warning shots
Predicting this is its own subfield, e.g. Biological Anchors Report (summary)

How difficult is alignment?

Maybe we get “alignment by default”
Maybe it’s literally impossible
Feasibility of different approaches (e.g. automating alignment)

Safety research maybe advances capabilities

RLHF (e.g. ChatGPT getting people excited about AI)
Interpretability
How much to publish vs keep results in smaller circles

39 of 45

Mini-Reflection (1 minute)

Anything confusing?→ Tell your neighbor!

If not, anything you found interesting?

40 of 45

Summary & How to Get Started

41 of 45

Summary

Alignment: make AI robustly do what we want

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Automating Alignment: Make AI do AI safety research

Governance: e.g. standards and regulation

meta

research:

Threat modeling: Figure out how AI might become dangerous

Evaluations: show current capabilities/ misalignment

42 of 45

My Work

Alignment: make AI robustly do what we want

Deconfusion: Find better ways to think about AI safety

Interpretability: Understanding current systems better

Evaluating the QACI research agenda (currently, LTFF grant)

Finite Factored Sets in Pictures: deconfusion about agents and abstractions (LTFF grant)

Neural Network Theory for Interpretability (SERI MATS research program)

Meta/Field Building: �AI Safety Europe Retreat

43 of 45

Getting Started - My favorite resources

How pursue a career in technical AI alignment
Levelling Up in AI Safety Research Engineering
AI Safety Support’s Lots of Links page
a isafety.training → calendar with events & training programs
AI Safety Fundamentals course

44 of 45

Q&A

Specific research direction you want to talk more about? → fill in the intro form and I can maybe make an introduction.

Slides

bit.ly/tais-research

Intro Form

bit.ly/tais-intro

I’m looking for a job starting ~December! (research and/or organizing)

If you might want to work with me, please reach out!

magdalena.wache@mailbox.org

45 of 45

Thank You!

More Questions? → Office Hours: 17:30, Kepler

magdalena.wache@mailbox.org