ARCHIVED. This copy of the curriculum is now obsolete, and is only left online for reference purposes. For the latest curriculum, which will be run in 2023, see here, and apply here.
Key points
The AGI Safety Fundamentals alignment curriculum provides a high-level understanding of the AI alignment problem and some of the key research directions which aim to solve it. It can be read independently or as part of a discussion-based course. The next cohort of the course will be run from February 2023; you can apply here. If reading independently, you can apply here to join our slack community.
The curriculum was compiled, and is maintained, by Richard Ngo. The curriculum is intended to be accessible for people with a wide range of backgrounds; those who are already familiar with some readings can choose substitutes from the Further Readings section for that week. The curriculum doesn't aim to teach practical programming or machine learning skills; those who primarily want to upskill for alignment work should instead take the courses listed in the Learn More section below.
Course details
The course consists of 8 weeks of readings, plus a final project. Participants are divided into groups of 4-6 people, matched based on their prior knowledge about ML and safety. (No background machine learning knowledge is required, but participants will be expected to have some fluency in basic statistics and mathematical notation.) Each week (apart from week 0) each group and their discussion facilitator will meet for 1.5 hours to discuss the readings and exercises. Broadly speaking, the first half of the course explores the motivations and arguments underpinning the field of AGI safety, while the second half focuses on proposals for technical solutions. After week 7, participants will have several weeks to work on projects of their choice, to present at the final session.
The main focus each week will be on the core readings and one exercise of your choice out of the exercises listed, for which you should allocate around 2 hours preparation time. Most people find some concepts from the readings confusing, but that’s totally fine - resolving those uncertainties is what the discussion groups are for. Approximate times taken to read each piece in depth are listed next to them. Note that in some cases only a small section of the linked reading is assigned. In several cases, blog posts about machine learning papers are listed instead of the papers themselves; you’re only expected to read the blog posts, but for those with strong ML backgrounds reading the paper versions might be worthwhile.
If you’ve already read some of the core readings, or want to learn more about the topic, then the further readings are recommended. However, none of them are compulsory. Also, you don’t need to think about the discussion prompts in advance - they’re just for reference during the discussion session.
Curriculum outline
Week 0 (optional): Introduction to machine learning 2
Week 1: Artificial general intelligence 4
Week 2: Goals and misalignment 7
Week 3: Threat models and types of solutions 9
Week 4: Learning from humans 12
Week 5: Decomposing tasks for better supervision 14
Week 6: Interpretability 16
Week 7: Agent foundations, AI governance, and careers in alignment 18
Week 8 (four weeks later): Projects 21
Learn More 23
Full curriculum
Week 0 (optional): Introduction to machine learning
This week mainly involves learning about foundational concepts in machine learning, for those who are less familiar with them, or want to review the basics. Instead of the group discussions from most weeks, there will be a lecture and group exercises. If you’d like to learn ML in more depth, see the Learn More section at the end of this curriculum. (For those with little background ML knowledge, this week’s core readings will take roughly 30 mins longer than most other weeks, since there’s a lot to cover.)
Start with Ngo (2021), which provides a framework for thinking about machine learning, in particular the two key components of deep learning: neural networks and optimisation. For more details and intuitions about neural networks, watch 3Blue1Brown (2017a); for more details and intuitions about optimisation, watch 3Blue1Brown (2017b). OpenAI (2021) showcases one of the most impressive deep learning models so far, using it to write code for a game based on high-level language instructions. Lastly, see von Hasselt (2021) for an introduction to the field of reinforcement learning.
Core readings:
- A short introduction to machine learning (Ngo, 2021) (20 mins)
- But what is a neural network? (3Blue1Brown, 2017a) (20 mins)
- Gradient descent, how neural networks learn (3Blue1Brown, 2017b) (20 mins)
- Creating a Space Game with OpenAI Codex (OpenAI, 2021) (10 mins)
- Introduction to reinforcement learning (von Hasselt, 2021) (from 2:00 to 1:02:10, ending at the beginning of the section titled Inside the Agent: Models) (60 mins)
Further readings:
- What is backpropagation really doing? (3Blue1Brown, 2017c) (15 mins)
- This third video in the 3Blue1Brown series covers backpropagation, the technique used to implement gradient descent for neural networks.
- The spelled-out intro to neural networks and backpropagation: building micrograd (Karpathy, 2022) (150 mins)
- A lecture introducing the most foundational concepts in deep learning in a very comprehensive way, from one of the leading experts.
- Spinning up deep RL: part 1 and part 2 (OpenAI, 2018) (40 mins)
- This reading provides a more technical introduction to reinforcement learning (for more, see also the last half-hour of von Hasselt (2021)).
- Machine learning for humans (Maini and Sabri, 2017)
- Maini and Sabri provide a long but accessible introduction to machine learning.
- Collection of GPT-3 results (Sotala, 2020)
- This and the next reading cover impressive language models from OpenAI. In this one, Sotala collects many examples of sophisticated behavior from GPT-3.
- AlphaStar: mastering the real-time strategy game StarCraft II (Vinyals et al., 2019) (20 mins)
- This and the next reading cover impressive reinforcement learning agents from DeepMind. Vinyals et al. explain how DeepMind trained a deep reinforcement learning agent, AlphaStar, to play StarCraft at a very high level.
- Generally capable agents emerge from open-ended play (DeepMind, 2021) (25 mins)
- DeepMind showcases agents that can play a wide range of multiplayer games, including ones that it didn’t encounter during training.
- The unreasonable effectiveness of recurrent neural networks (Karpathy, 2015) (40 mins)
- This is an accessible introduction to recurrent neural networks (one of the most widely-used neural network architectures).
- A (long) peek into reinforcement learning (Weng, 2018) (35 mins)
- Weng provides a concise yet detailed introduction to reinforcement learning.
- Machine learning glossary (Google, 2017)
- For future reference, see this glossary for explanations of unfamiliar terms.
Exercises (answers linked below):
- What are the main similarities and differences between the process of fitting a linear regression to some data, and the process of training a neural network on the same data?
- Explain why the “nonlinearity” in an artificial neuron (e.g. the sigmoid or RELU function) is so important. What would happen if we removed all the nonlinearities in a deep neural network? (Hint: try writing out explicit equations for a neural network with only one hidden layer between the input and output layers, and see what happens if you remove the nonlinearity.)
Live session:
- Recorded lecture and lecture slides
- Group exercises and answers for all exercises
Week 1: Artificial general intelligence
Artificial general intelligence (AGI) is the key concept underpinning the course, so it’s important to deeply explore what we mean by it, and reasons for thinking that the field of machine learning is heading towards it. The first two readings this week offer several different perspectives on how we should think about AGI. The third reading focuses on grounding these high-level arguments by reference to the behavior of existing ML systems. Steinhardt argues that, in machine learning, novel behaviors tend to emerge at larger scales in ways which are difficult to predict using standard approaches. This fits with AI pioneer Rich Sutton’s “bitter lesson” from the last 7 decades of AI research: that “general methods that leverage computation are ultimately the most effective”. Compared with earlier approaches, these methods rely much less on human design, and therefore raise the possibility that we build AGIs whose cognition we know very little about.
These first three readings don’t say much about when we should expect AGI. The most comprehensive report on the topic (summarized by Karnofsky (2021)) estimates the amount of compute required to train neural networks as large as human brains to do highly impactful tasks, and concludes that this will probably be feasible within the next four decades - although the estimate is highly uncertain.
Core readings:
- Four background claims (Soares, 2015) (15 mins) (note that claims #3 and #4 are covered in more detail in the following two weeks)
- AGI safety from first principles (Ngo, 2020) (only sections 1 and 2) (20 mins)
- The Bitter Lesson (Sutton, 2019) (5 mins)
- Future ML systems will be qualitatively different (Steinhardt, 2022) (5 mins)
- Forecasting transformative AI: the “biological anchors” method in a nutshell (Karnofsky, 2021b) (30 mins)
Further readings:
- AI: racing towards the brink (Harris and Yudkowsky, 2018) (110 mins) (audio here)
- Transcript of a podcast conversation between Sam Harris and Eliezer Yudkowsky. Probably the best standalone resource for introducing AGI risk; covers many of the topics from this week and next week.
- General intelligence (Yudkowsky, 2017) and The power of intelligence (Yudkowsky, 2007) (35 mins)
- Yudkowsky provides a high-level explanation of the concept of general intelligence, and its importance.
- Three Impacts of Machine Intelligence (Christiano, 2014)
- Understanding human intelligence through human limitations (Griffiths, 2020) (40 mins)
- Griffiths provides a framework for thinking about ways in which machine intelligence might differ from human intelligence.
- AI and compute: how much longer can computing power drive AI progress? (Lohn and Musser, 2022) (30 mins)
- This and the next two readings focus on forecasting progress in AI, via looking at trends in compute and algorithms, and surveying expert opinion.
- AI and efficiency (Hernandez and Brown, 2020) (15 mins)
- 2022 expert survey on progress in AI (Stein-Perlman, Weinstein-Raum and Grace, 2022) (15 mins)
- Predictability and surprise in large generative models (Ganguli et al., 2022)
Exercises:
- As discussed in Ngo (2020), Legg and Hutter define intelligence as “an agent’s ability to achieve goals in a wide range of environments”: a definition of intelligence in terms of the outcomes it leads to. An alternative approach is to define intelligence in terms of the cognitive skills (memory, planning, etc) which intelligent agents used to achieve their desired outcomes. What are the key cognitive skills which should feature in such a definition of intelligence?
- A crucial feature of AGI is that it will possess cognitive skills which are useful across a range of tasks, rather than just the tasks it was trained to perform. Which cognitive skills did humans evolve because they were useful in our ancestral environments, which have remained useful in our modern environment? Which have become less useful?
- Some have argued that “no free lunch” theorems make the concept of general intelligence a meaningless one. However, we should interpret the concept of "general intelligence" not as requiring strong performance on all possible problems, but only strong performance on problems which are plausible in a universe like ours. Identify three ways in which the latter category of problems is more restrictive than the former.
Notes:
- Instead of AGI, some people use the terms “human-level AI” or “strong AI”. “Superintelligence” refers to AGI which is far beyond human-level intelligence. The opposite of general AI is called narrow AI. In his “Most important century” series, Karnofsky focuses on AI which will automate the process of scientific and technological advancement (which he gives the acronym PASTA) - this seems closely related to the concept of AGI, but without some additional connotations that the latter carries.
- Most of the content in this curriculum doesn’t depend on strong claims about when AGI will arise, so try to avoid focusing disproportionately on the reading about timelines during discussions. However, I expect that it would be useful for participants to consider which evidence would change their current expectations in either direction. Note that the forecasts produced by the biological anchors method are fairly consistent with the survey of expert opinions carried out by Grace et al. (2017).
Discussion prompts:
- Here’s yet another way of defining general intelligence: whatever mental skills humans have that allow us to build technology and civilization (in contrast to other animals). What do you think about this definition?
- One intuition for how to think about very smart AIs: imagine speeding up human intellectual development by a factor of X. If an AI could do the same quality of research as a top scientist, but 10 or 100 times faster, and with the ability to make thousands of copies, how would you use it?
- How frequently do humans build technologies where some of the details of why they work aren’t understood by anyone? What, if anything, makes AI different from other domains? Would it be very surprising if we built AGI without understanding very much about how its thinking process works?
- Consider the following argument: if aliens with far more advanced technology than ours arrived on earth, we’d expect that the outcome of that meeting would be primarily determined by what their goals were, rather than what ours are. This is analogous to AGI in the sense that AGI will be very alien, and very powerful; but it’s disanalogous because we can try to shape the goals of AGIs to make them more compatible with ours. Therefore it’s very important to ensure that we have techniques for effectively doing so. How persuasive do you find this argument?
- What are the most plausible ways for the hypothesis “we will eventually build AGIs which have transformative impacts on the world” to be false? How likely are they?
Week 2: Goals and misalignment
This week we’ll focus on how and why AGIs might develop goals that are misaligned with those of humans, in particular when they’ve been trained using machine learning. We cover three core ideas. Firstly, it’s difficult to create reward functions which specify the desired outcomes for complex tasks - the problem of reward misspecification (also known as outer misalignment). Krakovna et al. (2020) highlights the difficulty of designing correct reward functions by showcasing examples of misspecification in simple environments.
Secondly, however, it’s important to distinguish between the reward function which is used to train a reinforcement learning policy, versus the goals which that policy learns to pursue. Langosco et al. (2022) argue that even an agent trained on the “right” reward function might acquire undesirable goals - the problem of goal misgeneralization (also known as inner misalignment).
Why is it likely that agents will generalize their goals in undesirable ways? Bostrom (2014) argues that almost all goals which an AGI might develop would incentivise it to misbehave badly (e.g. by trying to seek power over the world), due to the phenomenon of instrumental convergence.
Finally, Ngo (2022) explores in more detail how these problems might arise and play out in a deep learning context.
Core readings:
- Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) (15 mins)
- Goal misgeneralization in deep reinforcement learning (Langosco et al., 2022) (ending after section 3.3) (25 mins)
- Those with less background in reinforcement learning can skip the parts of section 2.1 focused on formal definitions.
- Superintelligence, Chapter 7: The superintelligent will (Bostrom, 2014) (25 mins)
- The alignment problem from a deep learning perspective (Ngo, 2022) (only sections 2, 3 and 4) (25 mins)
Further readings:
- Optimal policies tend to seek power: video (Turner et al., 2021) (15 mins)
- Turner et al. flesh out the arguments from Bostrom (2014) by formalizing the notion of power-seeking in the reinforcement learning context, and proving that many agents converge to power-seeking. (See also the corresponding blog post and paper.)
- Goal misgeneralization: why correct specifications aren’t enough for correct goals (Shah et al., 2022)
- The effects of reward misspecification: mapping and mitigating misaligned models (Pan, Bhatia and Steinhardt, 2022)
- Consequences of misaligned AI (Zhang and Hadfield-Menell, 2021)
- Categorizing variants of Goodhart’s law (Manheim and Garrabrant, 2018) (see also the Goodhart’s Law wikipedia page)
- Advanced artificial agents intervene in the provision of reward (Cohen et al., 2022) (45 mins)
- What is inner misalignment? (Leike, 2022) (10 mins)
- Leike provides another framing of the inner misalignment problem, formulated in terms of meta reinforcement learning.
- Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (Cotra, 2022)
- Is power-seeking AI an existential risk? (Carlsmith, 2021) (only sections 2: Timelines and 3: Incentives) (25 mins)
- Risks from Learned Optimization (Hubinger et al., 2019) (introduction and section on the Inner Alignment Problem) (55 mins)
- Hubinger et al. provide the original presentation of the inner alignment problem.
- Why alignment could be hard with modern deep learning (Cotra, 2021) (25 mins)
- Cotra presents one broad framing for why achieving alignment might be hard, tying together the ideas from the core readings in a more accessible way.
- The other alignment problem: mesa-optimisers and inner alignment (Miles, 2021) (25 mins)
- This video gives a more accessible introduction to the inner alignment problem, as discussed in Hubinger et al. (2019a).
Exercises:
- Why is it not appropriate to describe the agents discussed by Krakovna et al. as displaying goal misgeneralization? Describe a toy experiment similar to those from Langosco et al. which you expect could demonstrate goal misgeneralization.
- Goal misgeneralization is most easily described in the context of reinforcement learning, but it could also arise in (self-)supervised learning or imitation learning contexts. Describe what it might look like for a large language model like GPT-3 to have learned a goal which then generalizes in undesirable ways in novel environments. How might you test whether that has happened?
- Guez et al. (2019) devised a test for whether a recurrent network is doing planning: by seeing whether its performance improves when given more time to “think” before it can act. Think of some other test that we could run that would give us evidence about the extent to which a neural network is internally doing planning.
Notes:
- The core idea behind the goal misgeneralization problem is that in RL training, although the reward function is used to update a policy’s behavior based on how well it performs tasks during training, the policy doesn’t refer to the reward function while carrying out any given task (e.g. playing an individual game of Starcraft). So the motivations which drive a policy’s behavior when performing tasks need not precisely match the reward function it was trained on. The best thought experiments to help understand this are cases where the reward function is strongly correlated with some proxy objective, like resource acquisition or survival; this is analogous to how humans evolved to care directly about some proxies for genetic fitness.
Discussion prompts:
- Christiano (2018) defined alignment as follows: “an AI A is aligned with an operator H if A is trying to do what H wants it to do”. Some questions about this:
- What’s the most natural way to interpret “what the human wants” - what they say, or what they think, or what they would think if they thought about it for much longer?
- How should we define an AI being aligned to a group of humans, rather than an individual?
- Did Bostrom miss any important convergent instrumental goals? (His current list: self-preservation, goal-content integrity, cognitive enhancement, technological perfection, resource acquisition.) One way of thinking about this might be to consider which goals humans regularly pursue and why.
- To what extent are humans inner misaligned with respect to evolution? How can you tell, and what might similar indicators look like in AGIs?
- Suppose that we want to build a highly intelligent AGI that is myopic, in the sense that it only cares about what happens over the next (short) bounded timeframe. Would such an agent still have convergent instrumental goals? What factors might make it easier or harder to train a myopic AGI than a non-myopic AGI?
- By some definitions, a chess AI has the goal of winning. When is it useful to describe it that way? What are the key differences between human goals and the “goals” of a chess AI?
- The same questions, but for corporations and countries instead of chess AIs. Does it matter that these consist of many different people, or can we treat them as agents with goals in a similar way to individual humans?
Week 3: Threat models and types of solutions
How might misaligned AGIs cause existential catastrophes, and how might work on alignment prevent that? Muehlhauser and Salamon (2012) outline one threat model: that progress in AI will at some point speed up dramatically, in an “intelligence explosion” caused by recursive self-improvement. However, Christiano (2019) also outlines two other threat models which may lead to catastrophe before an intelligence explosion occurs - the first focusing on outer misalignment, the second on inner misalignment.
Why would we deploy misaligned models in the first place? Steinhardt (2022) provides one key intuition: that misaligned AGIs will deceive humans during training. Ngo (2022) evaluates the implications of this possibility and several other factors which will affect our ability to control misaligned AGIs.
How might we prevent these scenarios? Christiano (2020) gives a broad overview of the landscape of different ways to make AI go well, with a particular focus on some of the techniques we’ll be covering in later weeks.
Core readings:
- Intelligence explosion: evidence and import (Muehlhauser and Salamon, 2012) (only pages 10-15) (15 mins)
- What failure looks like (Christiano, 2019) (20 mins)
- ML systems will have weird failure modes (Steinhardt, 2022) (15 mins)
- AGI safety from first principles (Ngo, 2020) (only section 5: Control) (15 mins)
- AI alignment landscape (Christiano, 2020) (35 mins)
Further readings:
- Risks from Learned Optimisation: Deceptive alignment (Hubinger et al., 2019) (30 mins)
- Hubinger et al. describe how and why misaligned agents might get high reward during training by deceiving humans.
- Unsolved problems in ML safety (Hendrycks et al., 2021) (50 mins)
- Hendryks et al. provide an overview of open problems in safety which focuses more on links to mainstream ML.
- Takeoff speeds (Christiano, 2018) (35 mins)
- In response to Yudkowsky’s (2015) argument that there will be a sharp “intelligence explosion”, Christiano argues that the rate of progress will instead increase continuously over time. However, there is less distance between these positions than there may seem: Christiano still expects self-improving AI to eventually cause incredibly rapid growth.
- Yudkowsky Contra Christiano On AI Takeoff Speeds (2022) (40 mins)
- This piece provides an overview of a long-running debate about how abrupt or gradual AI advances will be.
- AGI ruin: a list of lethalities (Yudkowsky, 2022) (50 mins)
- Yudkowsky is one of the most pessimistic alignment researchers; this piece explains why (although see the next piece by Christiano for some pushback).
- Where I agree and disagree with Eliezer (Christiano, 2022) (30 mins)
- X-risk analysis for AI research (Hendrycks and Mazeika, 2022)
- What multipolar failure looks like (Critch, 2021) (45 mins)
- This and the next reading give further threat scenarios also motivated by the possibility of serious coordination failures.
- Another outer alignment failure story (Christiano, 2021) (20 mins)
- Modeling the human trajectory (Roodman, 2020) (35 mins)
- Roodman argues that historical GWP growth is best fitted by a power law, which if extrapolated forward predicts that growth rates will increase dramatically within a matter of decades. The importance of this piece in the context of alignment depends less on the reliability of this specific forward extrapolation, and more on the idea that the historical record doesn’t strongly weigh against the possibility of very rapid growth over the coming decades.
- Long-term growth as a sequence of exponential modes (Hanson, 2000) (40 mins)
- Hanson uses historical data to predict a shift to a new “growth mode” in which the economy doubles every few weeks, which he considers a more plausible outcome of AI progress than an intelligence explosion. Although this disagrees significantly with the model given by Roodman (2020) (as well as the position taken by Yudkowsky in the Hanson-Yudkowsky Foom debate) it supports the broader claim that technological progress will speed up to a much greater extent than almost anyone currently expects.
Exercises:
- The possibility of deceptive alignment, as discussed by Steinhardt (2022), is an example of goal misgeneralization where a policy learns a goal that generalizes beyond the bounds of any given training episode. What factors might make this type of misgeneralization likely or unlikely?
- Christiano’s “influence-seeking systems” threat model in What Failure Looks Like is in some ways analogous to profit-seeking companies. What are the most important mechanisms preventing companies from catastrophic misbehavior? Which of those would and wouldn’t apply to influence-seeking AIs?
- Ask the OpenAI API what steps it would perform to achieve some large-scale goal. Then recursively ask it how it’d perform each of those steps, until it reaches a point where its answers don’t make sense. What’s the hardest task you can find for which the API can not only generate a plan, but also perform each of the steps in that plan?
- What are the individual tasks involved in machine learning research (or some other type of research important for technological progress)? Identify the parts of the process which have already been automated, the parts of the process which seem like they could plausibly soon be automated, and the parts of the process which seem hardest to automate.
Discussion prompts:
- What are the biggest vulnerabilities in human civilisation that might be exploited by misaligned AGIs? To what extent do they depend on the development of other technologies more powerful than those which exist today?
- Does the distinction between “paying the alignment tax” and “reducing the alignment tax” make sense to you? Give a concrete example of each case. Are there activities which fall into both of these categories, or are ambiguous between them?
- Most of the readings so far have been framed in the current paradigm of deep learning. Is this reasonable? To what extent are they undermined by the possibility of future paradigm shifts in AI?
Week 4: Learning from humans
This week, we look at techniques for training AIs on human data (falling under “learn from teacher” in Christiano’s AI alignment landscape from last week). Next week, we’ll look at some ways to make these techniques more powerful and scalable; this week focuses on understanding each of them. Participants who are already familiar with these techniques should read the full papers linked from the blog posts, or some of the further readings, instead.
The first technique, behavioral cloning, is essentially an extension of supervised learning to settings where an AI must take actions over time - as discussed by Levine (2021). The second, reinforcement learning from human feedback (RLHF), involves reinforcement learning agents receiving rewards determined by humans evaluations of their behavior (often via training a reward model from those human evaluations); this technique is used by Christiano et al. (2017) and Ouyang et al. (2022). Thirdly, in the long term we’d like models to assist us via formats more sophisticated than just modeling rewards. Saunders et al. (2022) take a step towards this by training AIs to generate natural-language feedback; while Perez et al. (2022) use a language model to automate red-teaming and discover systematic failures.
Finally, Christiano (2015) argues that inferring human values is likely a hard enough problem that the techniques discussed so far won’t be sufficient to reliably solve it - a possibility which motivates the extensions to these techniques discussed in subsequent weeks.
Core readings:
- Imitation learning lecture: part 1 (Levine, 2021a) (20 mins)
- Read all four of the following blog posts, plus the full paper for whichever you found most interesting (if you’re undecided, default to the critiques paper):
- Deep RL from human preferences: blog post (Christiano et al., 2017) (10 mins)
- Aligning language models to follow instructions: blog post (Ouyang et al,, 2022) (10 mins)
- AI-written critiques help humans notice flaws: blog post (Saunders et al., 2022) (10 mins)
- Red-teaming language models with language models (Perez et al., 2022) (10 mins)
- The easy goal inference problem is still hard (Christiano, 2015) (10 mins)
Further readings:
- Learning to summarize with human feedback: blog post (Stiennon et al., 2020) (20 mins)
- RL with KL penalties is best seen as Bayesian inference (Korbak and Perez, 2022)
- Adversarial training for high-stakes reliability (Ziegler et al., 2022)
- Training language models with language feedback (Scheurer et al., 2022)
- Scaling laws for reward model overoptimization (Gao, Schulman, and Hilton, 2022)
- Reward-rational (implicit) choice: a unifying formalism for reward learning (Jeon et al., 2020) (60 mins)
- The task of aiming to identify human preferences from human data is known as reward learning. Both inverse reinforcement learning and training a reward model are examples of reward learning, using different types of data. In response to the proliferation of different types of reward learning, Jeon et al. (2020) proposes a unifying framework.
- Humans can be assigned any values whatsoever (Armstrong, 2018) (15 mins)
- A key challenge for reward learning is to account for ways in which humans are less than perfectly rational. Armstrong argues that this will be difficult, because there are many possible combinations of preferences and biases that can lead to any given behavior, and the simplest is not necessarily the most accurate.
- Learning the preferences of bounded agents (Evans et al., 2015) (25 mins)
- Evans et al. discuss a few biases that humans display, and ways to account for them when learning values.
- An EPIC way to evaluate reward functions (Gleave et al., 2021) (15 mins) (see also a recorded presentation)
- Gleave et al. provide a way to evaluate the quality of learned reward models.
- A general language assistant as a laboratory for alignment (Askell et al., 2021) (sections 1 and 2) (40 mins)
- Askell et al. focus on another way of learning from humans: having humans design prompts which encourage aligned behavior, and then fine-tuning on those prompts (via a method they call context distillation).
- The MineRL BASALT Competition on Learning from Human Feedback (Shah et al. 2021) (only section 1) (25 mins)
- Shah et al. present a competition to train agents using human feedback to perform complex tasks in a Minecraft environment.
- Cooperative inverse reinforcement learning (Hadfield-Menell et al., 2016) (40 mins)
- CIRL focuses on a setting where an AI tries to infer a human’s preferences while they interact in a shared environment. In this setting, the best strategy for the human is often to help the AI learn what goal the human is pursuing - making it a “cooperative” variant of inverse reinforcement learning.
Exercises:
- Autoregressive language models are trained to predict the next word in a sentence, given the previous words. (Since the correct answer for each prediction can be generated automatically from existing training data, this is known as self-supervised learning, and is the key technique for training cutting-edge language models.) In what ways is this the same as, or different from, behavioral cloning?
- Imagine using RHLF, as described in the middle three core readings for this week, to train an AI to perform a complex task like building a castle in Minecraft. What sort of problems would you encounter?
- In the first further reading, Stiennon et al. (2020) note that “optimizing our reward model eventually leads to sample quality degradation”. Look at the corresponding graph, and explain why the curves are shaped that way. How could we prevent performance from decreasing so much?
- If you’re familiar with CIRL (discussed in the last further reading): does it help us address the problems discussed in Christiano (2015)? Why or why not?
- Read the further reading by Armstrong about how humans can be assigned any values, then explain: why does reward learning ever work in practice?
Notes:
- The techniques discussed this week showcase a tradeoff between power and alignment: behavioral cloning provides the fewest incentives for misbehavior, but is also hardest to use to go beyond human-level ability. RLHF, by contrast, can reward agents for unexpected behavior that leads to good outcomes - but also rewards agents for manipulative or deceptive actions. (Although deliberate deception is likely beyond the capabilities of current agents, there are examples of simpler behaviors have a similar effect: Christiano et al. (2017) describes an agent learning behavior which misled the human evaluator; and Stiennon et al. (2020) describes an agent learning behavior which was misclassified by its reward model.) Lastly, while IRL and CIRL (discussed in the last further reading) can potentially be used even for tasks that humans can’t evaluate, the theoretical justification for them succeeding on those tasks relies on us having implausibly accurate models of human (ir)rationality.
Discussion prompts:
- What are the key similarities and differences between behavioral cloning, RL, and RLHF? What types of human preferences can these techniques most easily learn? What types would be hardest to learn?
- What implications does the size of the discriminator-critique gap (as discussed by Saunders et al.’s paper on AI-written critiques) have?
- Should we expect RLHF to be necessary for building AGI (independent of safety concerns)?
- How might using RLHF lead to misaligned AGIs?
Week 5: Decomposing tasks for better supervision
The most prominent research directions in technical AGI safety involve training AIs to do complex tasks by decomposing those tasks into simpler ones where humans can more easily evaluate AI behavior. This week we’ll cover three closely-related algorithms (all falling under “build a better teacher” in Christiano’s AI alignment landscape). Compared with most machine learning techniques, such as the ones discussed last week, these are less well-verified on existing problems, but have stronger justifications for why they might scale up to help align AGIs; unfortunately we don’t yet have techniques with both of these properties.
Wu et al. (2021) uses recursive task decomposition in order to solve tasks that are too difficult to perform directly. This is one example of a more general class of techniques called iterated amplification (also known as iterated distillation and amplification), which is described in Ought (2019). A more technical description of iterated amplification is given by Christiano et al. (2018), along with some small-scale experiments.
The third technique we’ll discuss this week is Debate, proposed by Irving et al. (2018). Unlike the other two techniques, Debate focuses on evaluating claims made by language models, rather than supervising AI behavior over time. Barnes and Christiano (2020) describe some problems identified via human experiments using the debate protocol. You’ll spend some time during this week’s session trying out a toy implementation of Debate (as explained in the curriculum notes).
Core readings:
- Summarizing books with human feedback: blog post (Wu et al., 2021) (5 mins)
- Factored cognition (Ought, 2019) (introduction and scalability section) (20 mins)
- Supervising strong learners by amplifying weak experts (Christiano et al., 2018) (35 mins)
- AI safety via debate (Irving et al., 2018) (ending after section 3) (35 mins)
- Those without a background in complexity theory can skip section 2.2.
- Debate update: obfuscated arguments problem (Barnes and Christiano, 2020) (excluding appendix) (15 mins)
Further readings:
- Humans consulting HCH (Christiano, 2016a) and Strong HCH (Christiano, 2016b) (15 mins)
- One useful way of thinking about Iterated Amplification is that, in the limit, it aims to instantiate HCH, a theoretical structure described by Christiano.
- Least-to-most prompting enables complex reasoning in large language models (Zhou et al., 2022) (35 mins)
- This paper provides an example of non-task-specific decomposition steps which allow amplification of a model beyond its baseline capabilities.
- Chain of thought imitation with procedure cloning (Yang et al., 2022)
- Yang et al. introduce Procedural Cloning, in which an agent is trained to mimic not just expert outputs, but also the process by which the expert reached those outputs.
- WebGPT (Nakano et al., 2022) and GopherCite (Menick et al., 2022)
- WebGPT and GopherCite were trained to provide citations for the claims they make, so that their answers are easier to evaluate.
- Iterated Distillation and Amplification (Cotra, 2018) (20 mins)
- Another way of understanding Iterated Amplification is by analogy to AlphaGo: as Cotra discusses, AlphaGo’s tree search is an amplification step which is then distilled into its policy network.
Exercises:
- Identify another mechanism which could be added to the debate protocol and might improve its performance. (It may be helpful to think about ways in which AI debaters are disanalogous to humans.)
- Think of a complex question which you know a lot about. How would you argue for the dishonest side if doing a debate on that question? How would you rebut that line of argument if you were the honest debater?
- A complex task like running a factory can be broken down into subtasks in a fairly straightforward way, allowing a large team of workers to perform much better than even an exceptionally talented individual. Describe a task where teams have much less of an advantage over the best individuals. Why doesn’t your task benefit as much from being broken down into subtasks? How might we change that?
- Read Christiano’s posts on HCH from the further readings. Why might even an ideal implementation of HCH not be aligned? What assumptions could change that?
Notes:
- During this week’s discussion session, try playing OpenAI’s implementation of the Debate game. The instructions on the linked page are fairly straightforward, and each game should be fairly quick. Note in particular the example GIF on the webpage, and the instructions that “the debaters should take turns, restrict themselves to short statements, and not talk too fast (otherwise, the honest player wins too easily).”
- What makes AI Debate different from debates between humans? One crucial point is that in debates between humans, we prioritize the most important or impactful claims made - whereas any incorrect statement from an AI debater loses them the debate. This is a demanding standard (aimed at making debates between superhuman debaters easier to judge).
Discussion prompts:
- Wu et al. (2021) use a combination of behavioral cloning and reinforcement learning to train a summarization model; this combination was also used to train AlphaGo and AlphaStar. What are the advantages of this approach over using either technique by itself?
- Different types of iterated amplification can use different techniques for learning from the amplified training signal. One type, imitative amplification, uses behavioral cloning; we could also use supervised learning or reinforcement learning. How should we expect these to differ?
- Debate is limited to training agents to answer questions correctly. How important do you expect this limitation to be for training economically competitive agents?
Week 6: Interpretability
Our current methods of training capable neural networks give us very little insight into how or why they function. This week we cover the field of interpretability, which aims to change this by developing methods for understanding how neural networks think.
We touch on two subfields of interpretability: neuron-level mechanistic interpretability, and concept-based interpretability. Start with Olah et al. (2017), who describe a range of techniques for visualizing the features represented by individual neurons. Olah et al. (2020) then explore how neural circuits build up representations of high-level features out of lower-level features; further intuitions along these lines are provided by Olah (2022). Meng et al. (2022) then demonstrate how interpretability can help us modify neural weights in semantically meaningful ways. Finally, McGrath et al. (2021) provide a case study of using concept-based interpretability techniques to understand AlphaZero’s development of human chess concepts.
Core readings:
- Feature visualization (Olah et al, 2017) (20 mins)
- Zoom In: an introduction to circuits (Olah et al., 2020) (35 mins)
- Mechanistic interpretability, variables, and the importance of interpretable bases (Olah, 2022) (10 mins)
- Locating and Editing Factual Associations in GPT: blog post (Meng et al., 2022) (10 mins)
- Acquisition of chess knowledge in AlphaZero (McGrath et al., 2021) (only up to the end of section 2.1) (20 mins)
Further readings:
- Toy models of superposition (Elhage et al., 2022) (60 mins) (everything up to the beginning of the Superposition and Learning Dynamics section, plus the Strategic Picture of Superposition section)
- Thread: Circuits (Cammarata et al., 2020)
- A series of short articles building on Zoom In, exploring different circuits in the InceptionV1 vision network.
- A mathematical framework for transformer circuits (Elhage et al., 2021) (90 mins)
- Elhage et al. build on previous circuits work to analyze transformers, the neural network architecture used by most cutting-edge models. For a deeper dive into the topic, see the associated videos.
- Rewriting a deep generative model (Bau et al., 2020)
- Bau et al. find a way to change individual associations within a neural network, which allows them to replace specific components of an image.
- Intro to brain-like AGI safety (Byrnes, 2022) (part 3: two subsystems, part 6: big picture, part 7: worked example)
- In addition to interpretability research on neural networks, another approach to developing more interpretable AI involves studying human and animal brains. Byrnes gives an example of applying ideas from neuroscience to better understand AI.
- Compositional explanations of neurons (Mu and Andreas, 2021)
- Discovering latent knowledge in language models without supervision
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
- Polysemanticity and capacity in neural networks (Scherlis et al., 2022)
- Interpretability beyond feature attribution: quantitative testing with concept attribution vectors (Kim et al., 2018) (35 mins)
- Kim et al. introduce a technique for interpreting a neural net's internal state in terms of human concepts.
- Chris Olah’s views on AGI safety (Hubinger, 2019) (20 mins)
- Eliciting latent knowledge (Christiano et al., 2021) (up to the end of the Ontology Identification section on page 38) (60 mins)
- This reading outlines the research agenda of Paul Christiano’s Alignment Research Center. The problem of eliciting latent knowledge can be seen as a long-term goal for interpretability research.
Exercises:
- Given sufficient progress in interpretability, we might be able to supervise not just an agents’ behavior but also its thoughts (i.e. its neural activations). One concern with such proposals is that if we train a network to avoid any particular cognitive trait, that cognition will instead just be distributed across the network in a way that we can’t detect. Describe a toy example of a cognitive trait that we can currently detect automatically. Design an experiment to determine whether, after training to remove that trait, the network has learned to implement an equivalent trait in a less-easily-detectable way.
- Interpretability work on artificial neural networks is closely related to interpretability work on biological neural networks (aka brains). Describe two ways in which the former is easier than the latter, and two ways in which it’s harder.
Discussion prompts:
- Were you surprised by the results and claims in Zoom In? Do you believe the Circuits hypothesis? If true, what are its most important implications? How might it be false?
- How compelling do you find the analogy to reverse engineering programs in Olah (2022)? What evidence would make the analogy more or less compelling?
Week 7: Agent foundations, AI governance, and careers in alignment
This last week of curriculum content is split between three different topics: agent foundations research, AI governance research, and pursuing careers in alignment.
The agent foundations research agenda (pursued primarily by the Machine Intelligence Research Institute (MIRI)) aims to develop better theoretical frameworks for describing AIs embedded in real-world environments. Demski and Garrabrant (2018) identify a range of open problems in this area, and links between them; we then briefly touch on three examples of work in this area. Garrabrant et al. (2016) provide an idealized algorithm for induction under logical uncertainty (e.g. uncertainty about mathematical statements); Everitt et al. (2021) formalize the incentives of RL agents; and Garrabrant (2021) formulates a framework which attempts to supersede Pearl’s framework for causation.
Dafoe (2020) then gives a thorough overview of AI governance and ways in which it might be important, particularly focusing on the framing of AI governance as field-building. An alternative framing - of AI governance as an attempt to prevent cooperation failures - is explored by Clifton (2019).
Lastly, for thinking about careers in alignment, see the resources compiled by Ngo (2022).
In the taxonomy of AI governance given by Clarke (2022) in the optional readings (diagram below) this week’s governance readings focus on strategy research, tactics research and field-building, not on developing, advocating or implementing specific policies. Those interested in exploring AI governance in more detail, including evaluating individual policies, should look at the curriculum for the parallel AI governance track of this course.

Core readings:
- Embedded agents, part 1 (Demski and Garrabrant, 2018) (15 mins)
- Read one of the following three blog posts, which give brief descriptions of work on agent foundations.
- Logical induction: blog post (Garrabrant et al., 2016) (10 mins)
- Finite factored sets: talk transcript (Garrbarant, 2021) (only sections 2m: the Pearlian paradigm and 2t: we can do better) (10 mins)
- Progress on causal influence diagrams: blog post (Everitt et al., 2021) (15 mins)
- AI Governance: Opportunity and Theory of Impact (Dafoe, 2020) (25 mins)
- Cooperation, conflict and transformative AI: sections 1 & 2 (Clifton, 2019) (25 mins)
- Careers in alignment (Ngo, 2022) (30 mins)
Further readings:
On agent foundations:
- MIRI’s approach (Soares, 2015) (25 mins)
- Soares explains and defends MIRI’s focus on discovering new mathematical frameworks for thinking about intelligence.
- Cheating Death in Damascus: blog post (Soares and Levenstein, 2017) (10 mins)
- Soares and Levenstein (2017) propose a new decision theory by extending the standard notion of physical causation to include logical causation.
- Parametric Bounded Löb's Theorem and Robust Cooperation of Bounded Agents (Critch, 2016) (40 mins)
- Critch finds an algorithm by which agents which can observe each other’s source code can reliably cooperate. For a brief introduction to the paper, see here.
- Infra-bayesianism unwrapped (Shimi, 2021) (45 mins)
- This post gives an overview of a line of research focused on extending bayesianism to include incomplete hypotheses.
- Causal inference in statistics: a primer (Pearl et al., 2016)
- Pearl’s approach to causality underlies a wide range of work in agent foundations.
On AI governance:
- The semiconductor supply chain (Khan, 2021) (up to page 15) (15 mins)
- This and the following two readings give brief introductions to three key factors affecting the AI strategic landscape.
- The global AI talent tracker (Macro Polo, 2020) (5 mins)
- Sharing powerful AI models (Shevlane, 2022) (10 mins)
- Deciphering China’s AI dream (Ding, 2018) (95 mins) (see also his podcast on this topic)
- Ding gives an overview of Chinese AI policy, one of the key factors affecting the landscape of possible approaches to AI governance.
- AI Governance: a research agenda (Dafoe, 2018) (120 mins)
- Dafoe outlines an overarching research agenda linking many areas of AI governance.
- Some AI governance research ideas (Anderljung and Carlier, 2021) (60 mins)
- This and the next two readings provide lists of research directions which have either been promising so far, or which may be useful to look into in the future.
- Our AI governance grantmaking so far (Muehlhauser, 2020) (15 mins)
- See above.
- The longtermist AI governance landscape: a basic overview (Clarke, 2022) (15 mins)
- See above.
Exercises:
- Explain the importance of the ability to make credible commitments for Clifton’s (2019) game-theoretic analysis of cooperation failures.
- What are the most important disanalogies between POMDPs and the real world?
- In what ways has humanity’s response to large-scale risks other than AI (e.g. nuclear weapons, pandemics) been better than we would have expected beforehand? In what ways has it been worse? What can we learn from this?
Notes:
- “Accident” risks, as discussed in Dafoe (2020), include the standard risks due to misalignment which we’ve been discussing for most of the course. I don’t usually use the term, because “deliberate” misbehavior from AIs is quite different from standard accidents.
- Compared with the approaches discussed over the last few weeks, agent foundations research is less closely-connected to existing systems, and more focused on developing new theoretical foundations for alignment. Given this, there are many disagreements about how relevant it is for deep-learning-based systems.
Discussion prompts:
- What are the best examples throughout history of scientists discovering mathematical formalisms that allowed them to deeply understand a phenomenon that they were previously confused about? (The easiest examples are from physics; how about others from outside physics?) To what extent should these make us optimistic about agent foundations research developing a better mathematical understanding of intelligence?
- How worried are you about misuse vs structural vs accident risk?
- Do you expect AGI to be developed by a government or corporation (or something else)? What are the key ways that this difference would affect AI governance?
- What are the main ways in which technical work could make AI governance easier or harder?
- What are the biggest ways you expect AI to impact the world in the next 10 years? How will these affect policy responses aimed at the decade after that?
- It seems important for regulators and policy-makers to have a good technical understanding of AI and its implications. In what cases should people with technical AI backgrounds consider entering these fields?
Week 8 (four weeks later): Projects
Projects overview
The final part of the AGI safety fundamentals course will be projects where you get to dig into something related to the course. The project is a chance for you to explore your interests, so try to find something you’re excited about! The goal of this project is to help you practice taking an intellectually productive stance towards AGI safety - to go beyond just reading and discussing existing ideas, and take a tangible step towards contributing to the field yourself. This is particularly valuable because it’s such a new field, with lots of room to explore.
We’ve allocated four weeks between the last week of the curriculum content and the sessions where people present their projects. As a rough guide, we expect participants to spend at least 10 hours on the project during that time. You may find it useful to write up a tentative project proposal before starting the project and send it to your cohort for feedback.
We’re flexible on project format; three recommended (significantly overlapping) categories are:
- Technical upskilling - e.g. training neural networks, or reimplementing papers.
- Distilling understanding - e.g. summarizing existing work, or doing literature reviews.
- Novel exploration - e.g. black-box investigation of language models.
Many projects in these categories would take more than the standard 10-15 hours; we’re happy for participants to aim for more ambitious projects, and then present work in progress to their cohorts in the week 8 session. There’s no need for week 8 presentations to be polished; we’d prefer participants to spend more time on the project itself, then talk through it informally. We’d encourage participants who produce pieces of writing (like summaries) to put them online after finishing them (although this is entirely optional). We expect most projects to be individual ones; but feel free to do a collaborative project if you’d like to.
Some project ideas for each category
Technical upskilling
- (For those with no ML experience): Train a neural network on some standard datasets. For help, see the fast.ai course or the PyTorch tutorials.
- (For those with some ML experience): Do some of the exercises from Jacob Hilton’s deep learning curriculum.
- (For those with extensive ML and RL experience): Replicate the TREX paper (easier) or the Deep Reinforcement Learning from Human Preferences paper (harder) in a simpler environment (e.g. cartpole). See if you can train the agent to do something in that environment which you can’t write an explicit reward function for.
- (For those interested in agent foundations research): Do some of Garrabrant’s fixed point exercises (on topology, diagonalization, or iteration).
- (Meta): If you’re considering a career in alignment, put together a career plan, with a particular focus on the most important skills for you to acquire, and how you’ll do that.
Distilling understanding
- Pick a reading you found interesting from the curriculum, and summarize or critique it. (Here are some examples of this being done well for Iterated Amplification and Inner Alignment, although we don’t expect projects to be as comprehensive as these.)
- Pick an exercise from this curriculum which is suitable to be fleshed out into a longer project, and solve it more thoroughly.
- Make a set of forecasts about the future of AI, in a way that’s concrete enough that you will be able to judge whether you were right or not, and predict what would significantly change your mind.
- Pick a key underlying belief which would impact your AGI alignment research interests, or whether to research alignment at all. Review the literature around this question, and write up a post giving your overall views on it, and the strongest arguments and evidence on the topic.
Novel exploration
- Do a black-box investigation trying to discover an interesting novel property of language models (e.g. GPT-3). Some more specific possibilities:
- Submit an entry to the inverse scaling prize (e.g. using the different model sizes available via the OpenAI API).
- Search for alignment failures - cases where the model is capable of doing what you intend, but doesn’t. As one example (discussed in section 7.2 of this paper), when the user gives it a prompt containing a subtle bug, the Codex language model may “deliberately” introduce further bugs into the code it writes, in order to match the style of the user prompt.
- Try to discover a novel capability or property of large language models - e.g. a type of prompting more effective than previous ones (like this paper discovers).
- Read Christiano’s Eliciting Latent Knowledge proposal (the final further reading in week 6), then try producing a proposal for the contest.
- Identify a set of capabilities which would be necessary for an AI to be an existential risk, and produce some tests for which of those capabilities current systems do or don’t have.
See also this longer list of project ideas, and this list of conceptual research projects.
Learn More
This is a selection of resources aimed primarily at helping people learn the skills required for doing alignment research. For a more comprehensive list of resources relevant for AGI safety, including funding sources and job opportunities, see here.
All resources freely available online except where marked otherwise.
AI safety resources:
Learning to program in Python:
ML courses:
ML textbooks:
Research skills: