The Technical AI Safety Research Landscape
Magdalena Wache
1
Get these slides
(for hyperlinks):
background image from aisafety.world
Outline
2
About You
Which of you…
3
About this Talk
4
Links
5
Slides
Intro Form
(I will also show them in the end again)
Overview: Research & Actors
6
background image from aisafety.world
Technical AI Safety Approaches
7
Alignment: make AI robustly do what we want
(clouds because there are no clear distinctions)
Deconfusion: Find better ways to think about AI safety
General goal: ensure AI systems don’t do really bad things
Interpretability: Understanding current systems better
Automating Alignment: Make AI do AI safety research
Governance: e.g. standards and regulation
meta
research:
Threat modeling: Figure out how AI might become dangerous
Evaluations: show current capabilities/ misalignment
8
Deconfusion: Find better ways to think about AI safety
Interpretability: Understanding current systems better
Automating Alignment: Make AI do AI safety research
Governance: e.g. standards and regulation
meta
research:
Evaluations: show current capabilities/ misalignment
Threat modeling: Figure out how AI might become dangerous
Alignment: make AI robustly do what we want
What do we want to align with what?
9
What we want
What the AI does
Alignment
What the AI does
What we want
What the AI wants
Intent Alignment
Capability Robustness
What the AI is trained on
What the AI wants
What the AI does
What we want
Outer Alignment
Inner Alignment
Capability Robustness
mathematical function, e.g. loss or reward
very vague, inconsistent, who are “we” exactly?
what does that even mean?
(Very simplified. Doesn't account for multipolar AI. Also a lot of alignment research doesn't neatly fit into these categories. It’s also disputed if the inner/outer distinction is productive)
Mini-Reflection (1 minute)
Anything confusing?→ Tell your neighbor!
If not, anything you found interesting?
(There will be a few of these Mini Reflections throughout this talk)
10
Actors
11
Concrete Technical Research
12
13
What the AI is trained on
What the AI wants
What the AI does
What we want
Outer Alignment
Inner Alignment
Capability Robustness
mathematical function, e.g. loss or reward
very vague, inconsistent
what does that even mean?
Specification gaming
14
Read more: DeepMind blogpost
Reinforcement Learning from Human Feedback (RLHF)
15
16
What the AI is trained on
What the AI wants
What the AI does
What we want
Outer Alignment
Inner Alignment
Capability Robustness
mathematical function, e.g. loss or reward
very vague, inconsistent
what does that even mean?
Goal Misgeneralization
17
Read more: Alignment Forum post, paper
Adversarial Training
18
19
What the AI is trained on
What the AI wants
Outer Alignment
Inner Alignment
Capability Robustness
mathematical function, e.g. loss or reward
What the AI does
Intent alignment
What we want
In many approaches it doesn’t really make sense to distinguish inner/outer alignment
Scalable Oversight
20
Shard Theory
21
Read more on the Alignment Forum:
general idea, empirical research: Reinforcement Learning, Language Models
22
What the AI is trained on
What the AI wants
What the AI does
What we want
Outer Alignment
Inner Alignment
Capability Robustness
mathematical function, e.g. loss or reward
very vague, inconsistent
what does that even mean?
Adversarial Attack Robustness
23
Some image classifiers don’t recognize this as a stop sign. (source)
24
Deconfusion: Find better ways to think about AI safety
Automating Alignment: Make AI do AI safety research
Governance: e.g. standards and regulation
meta
research:
Evaluations: show current capabilities/ misalignment
Threat modeling: Figure out how AI might become dangerous
Alignment: make AI robustly do what we want
Interpretability: Understanding current systems better
Interpretability
25
Feature Visualization: visualizing what things neurons are activated by
Mini-Reflection (1 minute)
Anything confusing?→ Tell your neighbor!
If not, anything you found interesting?
26
27
Deconfusion: Find better ways to think about AI safety
Interpretability: Understanding current systems better
Automating Alignment: Make AI do AI safety research
Governance: e.g. standards and regulation
meta
research:
Threat modeling: Figure out how AI might become dangerous
Alignment: make AI robustly do what we want
Evaluations: show current capabilities/ misalignment
Model Evaluations
28
Read more: ARC evals report
29
Deconfusion: Find better ways to think about AI safety
Interpretability: Understanding current systems better
Automating Alignment: Make AI do AI safety research
meta
research:
Threat modeling: Figure out how AI might become dangerous
Alignment: make AI robustly do what we want
Evaluations: show current capabilities/ misalignment
Governance: e.g. standards and regulation
Technical Work in AI Governance
30
31
Deconfusion: Find better ways to think about AI safety
Interpretability: Understanding current systems better
meta
research:
Threat modeling: Figure out how AI might become dangerous
Alignment: make AI robustly do what we want
Evaluations: show current capabilities/ misalignment
Governance: e.g. standards and regulation
Automating Alignment: Make AI do AI safety research
Automating Alignment
32
33
Interpretability: Understanding current systems better
meta
research:
Threat modeling: Figure out how AI might become dangerous
Alignment: make AI robustly do what we want
Evaluations: show current capabilities/ misalignment
Governance: e.g. standards and regulation
Automating Alignment: Make AI do AI safety research
Deconfusion: Find better ways to think about AI safety
Natural Abstractions
Plan: Understanding Abstractions helps find robust mathematical formalizations of
34
Water bottle
Table
Water bottle + table’s right segment
Table’s left segment
Natural Abstractions
Unnatural Abstractions
Research agenda by John Wentworth (independent researcher)
Read more: Alignment Forum post
35
Interpretability: Understanding current systems better
meta
research:
Alignment: make AI robustly do what we want
Evaluations: show current capabilities/ misalignment
Governance: e.g. standards and regulation
Automating Alignment: Make AI do AI safety research
Deconfusion: Find better ways to think about AI safety
Threat modeling: Figure out how AI might become dangerous
Threat modeling
36
Read more: Literature Review of different Threat Models
Disagreements
37
Disagreements
38
Mini-Reflection (1 minute)
Anything confusing?→ Tell your neighbor!
If not, anything you found interesting?
39
Summary & How to Get Started
40
Summary
41
Alignment: make AI robustly do what we want
Deconfusion: Find better ways to think about AI safety
Interpretability: Understanding current systems better
Automating Alignment: Make AI do AI safety research
Governance: e.g. standards and regulation
meta
research:
Threat modeling: Figure out how AI might become dangerous
Evaluations: show current capabilities/ misalignment
My Work
42
Alignment: make AI robustly do what we want
Deconfusion: Find better ways to think about AI safety
Interpretability: Understanding current systems better
Evaluating the QACI research agenda (currently, LTFF grant)
Finite Factored Sets in Pictures: deconfusion about agents and abstractions (LTFF grant)
Neural Network Theory for Interpretability (SERI MATS research program)
Meta/Field Building: �AI Safety Europe Retreat
Getting Started - My favorite resources
43
Q&A
44
Specific research direction you want to talk more about? → fill in the intro form and I can maybe make an introduction.
Slides
Intro Form
I’m looking for a job starting ~December! (research and/or organizing)
If you might want to work with me, please reach out!
magdalena.wache@mailbox.org
Thank You!
45
More Questions? → Office Hours: 17:30, Kepler
magdalena.wache@mailbox.org