Published using Google Docs
Possible projections
Updated automatically every 5 minutes
  1. Krakovna’s paradigms

Alignment components
Outer alignment
Inner alignment

Alignment enablers
Mechanistic interpretability
Understanding bad incentives
Foundations


  1. Kirchner and co’s clustering

  1. [RL] Agent alignment is concerned with the problem of aligning agentic systems, i.e. those where an AI performs actions in an environment and is typically trained via reinforcement learning.
  2. Alignment foundations research is concerned with deconfusion research, i.e. the task of establishing formal and robust conceptual foundations for current and future AI Alignment research.
  3. Tool alignment is concerned with the problem of aligning non-agentic (tool) systems, i.e. those where an AI transforms a given input into an output. The current, prototypical example of tool AIs is the "large language model".
  4. AI governance is concerned with how humanity can best navigate the transition to advanced AI systems. This includes focusing on the political, economic, military, governance, and ethical dimensions.
  5. Value alignment is concerned with understanding and extracting human preferences and designing methods that stop AI systems from acting against these preferences.


  1. Type of work

Meta

Gov        

Modelling

                Threat models

                Forecasting

Engineering

Informal theory

Formal theory



  1. Gavin’s stream of consciousness taxonomy

Control

Autoalignment

Informal theory

Formal theory

Bio

technical policy

Meta

Gov

threat modeling

Evals

Misc


  1. Which alignment problem does it solve? (via Davidad)

We don’t know how to determine an AGI’s goals or values

2. Corrigibility is anti-natural

5. Instrumental convergence

We don’t know what values would give good outcomes in an AGI;
we don’t know how to change them once loaded

1. Value is fragile and hard to specify

4. Goals misgeneralize out of distribution

10. Humanlike minds/goals are not necessarily safe

a smarter agent is able to do things you don’t expect in ways you can’t anticipate

7. Superintelligence can fool human supervisors

8. Superintelligence can hack software supervisors

9. Humans cannot be first-class parties to a superintelligent value handshake

12. A boxed AGI might exfiltrate itself by steganography, spearphishing

Human-human coordination is hard

11. Someone else will deploy unsafe superintelligence first

13. Fair, sane pivotal processes

Guarding against dangerous capabilities may need dangerous capabilities

3. Pivotal processes require dangerous capabilities

6. Pivotal processes likely require incomprehensibly complex plans


  1. Buck

Ready to go (develop a technique which just works)

Worst-case (techniques which would work for all models / all variations)

Gain of function research (toy demonstrations of some steps of catastrophic takeover)

Low-hanging fruit (do things which are cheap and easy to do, like finding activation directions)

Big if true (model internals)


 

  1. Theory of change

prevent all catastrophic misalignment

        provably prevent all catastrophic misalignment

push out the frontier of how advanced our agents need to be before we get catastrophic misalignment

Slow down capabilities


  1. Target

Intent alignment

(ambitious value learning)

Corrigibility

Respecting autonomy, boundaries

What humans would want if not cognitively constrained (CEV)


  1. Farnik

How to not die from AI

- Don't build the doomsday machine

        - Governance (some of this may instead be buying time)

                - Developing policy proposals and researching their impact

                        - Compute governance

                                - Chip export restrictions

                        - Licensing

                        - Mandating evals results

                        - Developing safety cases for models

                - Direct lobbying in Washington/Westminster/Brussels/Beijing

                - Communicating with the public/academia

                - Scary demos

                        - Model organisms (also in deconfusion)

        - Evals — can't do evals-based regulation without good evals

                - Black box stuff (eg ARC Evals)

                - White box stuff (currently non-existent cause it's hard)

                - {Another taxonomy: different capabilities (deception, planning, power-seeking, situational awareness, gradient hacking)}

        - Build something safe-ish by default to make banning LLMs more politically feasible

                - CoEms

- Turn the doomsday machine into a doomsdayn't machine (ie alignment)

        - Galaxy-brained math things ("ALIGNMENT RESEARCHERS HATE HIM! Solve the alignment problem with this one 'simple' trick")

                - QACI

                - Infrabayesianism (probably fits here, I don't fuckin know what Vanessa is talking about)

                - Corrigibility (RIP)

                - {Probably some other stuff that I cba to look at; I don't think any of this will work cause of the law of leaky abstractions}

        - Interp (ie figuring out what a trained model does)

                - Methodological advances in bottom-up interp (ie manual stuff, eg Interp In The Wild)

                - Automating interp (eg ACDC)

                - Semantic interp automation (ie automatically proposing hypotheses instead of just automatically verifying them; currently non-existent cause it's hard)

                - High-level/concept-based interp ("I may not know what the hell the model is doing, but I won't let that stop me from *thinking* I know what the hell the model is doing" — jokes aside I think this is acc pretty promising given timelines)

                - Grounding interp in theory

                        - Causal abstractions

                        - Natural abstractions (also in deconfusion)

        - Non-interp model internals ("I don't really know what's going on but fuck it, concept vector go brr")

                - Activation engineering

        - Understanding training better

                - Science of DL

                - Dev interp

                - SLT

        - Getting the model to learn what we want

                - CHAI-esque IRL thingies

        - Directly measuring and analyzing OOD

        - Swiss cheese stuff ("it probably won't work but at least we get dignity points”; also maybe alignment is actually that easy, also also timelines might be short so this might be the least bad option if everything else is too hard to do in time)

                - Making RLHF-style stuff better (incl constitutional AI, scalable oversight, debate, HCH etc)

                - Brain-like AGI safety (imo fits here; my friends who work on this would crucify me for putting it in this category)

                - Non-interp deception prevention

                        - ELK

                        - Externalized reasoning oversight

                                - Detecting sycophancy in CoTs

                - Red teaming, adversarial stuff

        - Automating alignment research

                - Superalignment

                - Cyborgism

help others working on this

- Buying time (some of this may instead be governance)

        - Building coordination and trust between AGI labs so that they can collectively pause before crossing the "threshold"

        - Non-perfect governance (see section above)

- Deconfusion/gathering strategic info

        - Timelines forcasting (Ajeya, Epoch etc)

        - Model organisms (also in scary demos)

        - Some of agent foundations (probably)

        - Natural abstractions (also in interp)

        - Shard theory

        - Scaling laws

        - RL theory, goal misgeneralization


  1. Casper

Solutions that require both knowing what failure looks like and how it happens

Solutions that require knowing what failure looks like

Solutions that require knowing how failure happens

Solutions that require neither


  1. Ortega

specification, robustness, and assurance


  1. Neel

Addressing threat models

Agendas to build safe AGI

Robustly good approaches

Deconfusion

AI governance


  1. Secure all system boundaries

AI - lab (intent alignment)

AI - its server (security)

lab - nation (governance)

nation - world (international governance)


  1. Basic 

Prosaic

Value learning

    - Bio

    - CHAI stuff

    - Shard theory

Conceptual

    - Agent foundations

    - Corrigibility

Governance


  1. Very general problems (MIRI esque)

Make it steerable

Decide where to steer

Stop it steering itself

Slow it down

Speed us up

_____________________________________________________________________

  1. TJ’s draft taxonomy

Improving our understanding of space of possible agents and possible risks

        Conceptual:

                Causal folks

        Empirical:

                Model Organisms?

Improving our understanding of what is going on with models

Improving directing new models towards desirable directions

Improving ability to better coordinate on AI strategy