Alignment components
Outer alignment
Inner alignment
Threat models
Forecasting
2. Corrigibility is anti-natural
5. Instrumental convergence
1. Value is fragile and hard to specify
4. Goals misgeneralize out of distribution
10. Humanlike minds/goals are not necessarily safe
7. Superintelligence can fool human supervisors
8. Superintelligence can hack software supervisors
9. Humans cannot be first-class parties to a superintelligent value handshake
12. A boxed AGI might exfiltrate itself by steganography, spearphishing
11. Someone else will deploy unsafe superintelligence first
13. Fair, sane pivotal processes
3. Pivotal processes require dangerous capabilities
6. Pivotal processes likely require incomprehensibly complex plans
Ready to go (develop a technique which just works)
Worst-case (techniques which would work for all models / all variations)
Gain of function research (toy demonstrations of some steps of catastrophic takeover)
Low-hanging fruit (do things which are cheap and easy to do, like finding activation directions)
Big if true (model internals)
prevent all catastrophic misalignment
provably prevent all catastrophic misalignment
push out the frontier of how advanced our agents need to be before we get catastrophic misalignment
Slow down capabilities
Intent alignment
(ambitious value learning)
Corrigibility
Respecting autonomy, boundaries
What humans would want if not cognitively constrained (CEV)
- Governance (some of this may instead be buying time)
- Developing policy proposals and researching their impact
- Compute governance
- Chip export restrictions
- Licensing
- Mandating evals results
- Developing safety cases for models
- Direct lobbying in Washington/Westminster/Brussels/Beijing
- Communicating with the public/academia
- Scary demos
- Model organisms (also in deconfusion)
- Evals — can't do evals-based regulation without good evals
- Black box stuff (eg ARC Evals)
- White box stuff (currently non-existent cause it's hard)
- {Another taxonomy: different capabilities (deception, planning, power-seeking, situational awareness, gradient hacking)}
- Build something safe-ish by default to make banning LLMs more politically feasible
- CoEms
- Galaxy-brained math things ("ALIGNMENT RESEARCHERS HATE HIM! Solve the alignment problem with this one 'simple' trick")
- QACI
- Infrabayesianism (probably fits here, I don't fuckin know what Vanessa is talking about)
- Corrigibility (RIP)
- {Probably some other stuff that I cba to look at; I don't think any of this will work cause of the law of leaky abstractions}
- Interp (ie figuring out what a trained model does)
- Methodological advances in bottom-up interp (ie manual stuff, eg Interp In The Wild)
- Automating interp (eg ACDC)
- Semantic interp automation (ie automatically proposing hypotheses instead of just automatically verifying them; currently non-existent cause it's hard)
- High-level/concept-based interp ("I may not know what the hell the model is doing, but I won't let that stop me from *thinking* I know what the hell the model is doing" — jokes aside I think this is acc pretty promising given timelines)
- Grounding interp in theory
- Causal abstractions
- Natural abstractions (also in deconfusion)
- Non-interp model internals ("I don't really know what's going on but fuck it, concept vector go brr")
- Activation engineering
- Understanding training better
- Science of DL
- Dev interp
- SLT
- Getting the model to learn what we want
- CHAI-esque IRL thingies
- Directly measuring and analyzing OOD
- Swiss cheese stuff ("it probably won't work but at least we get dignity points”; also maybe alignment is actually that easy, also also timelines might be short so this might be the least bad option if everything else is too hard to do in time)
- Making RLHF-style stuff better (incl constitutional AI, scalable oversight, debate, HCH etc)
- Brain-like AGI safety (imo fits here; my friends who work on this would crucify me for putting it in this category)
- Non-interp deception prevention
- ELK
- Externalized reasoning oversight
- Detecting sycophancy in CoTs
- Red teaming, adversarial stuff
- Automating alignment research
- Superalignment
- Cyborgism
- Buying time (some of this may instead be governance)
- Building coordination and trust between AGI labs so that they can collectively pause before crossing the "threshold"
- Non-perfect governance (see section above)
- Deconfusion/gathering strategic info
- Timelines forcasting (Ajeya, Epoch etc)
- Model organisms (also in scary demos)
- Some of agent foundations (probably)
- Natural abstractions (also in interp)
- Shard theory
- Scaling laws
- RL theory, goal misgeneralization
Solutions that require both knowing what failure looks like and how it happens
Solutions that require knowing what failure looks like
Solutions that require knowing how failure happens
Solutions that require neither
specification, robustness, and assurance
Addressing threat models
Agendas to build safe AGI
Robustly good approaches
Deconfusion
AI governance
AI - lab (intent alignment)
AI - its server (security)
lab - nation (governance)
nation - world (international governance)
- Bio
- CHAI stuff
- Shard theory
- Agent foundations
- Corrigibility
_____________________________________________________________________
Improving our understanding of space of possible agents and possible risks
Conceptual:
Causal folks
Empirical:
Model Organisms?
Improving our understanding of what is going on with models
Improving directing new models towards desirable directions
Improving ability to better coordinate on AI strategy