3 of 42

What do we mean by Technical AI Safety?

Critical systems: systems whose failure may lead to injury or loss of life, damage to the environment, unauthorized disclosure of information, or serious financial losses
Safety-critical systems: systems whose failure may result in injury, loss of life, or serious environmental damage
Technical AI safety: designing safety-critical AI systems (and more broadly, critical AI systems) in ways that guard against accident risks – i.e., harms arising from AI systems behaving in unintended ways

Sources:

Ian Sommerville, supplement to Software Engineering (10^th edition)
Remco Zwetsloot and Allan Dafoe, “Thinking About Risks From AI: Accidents, Misuse and Structure”

4 of 42

Other related concerns

Security against exploits by adversaries

Often considered part of AI Safety

Misuse from people using AI in unethical or malicious ways

Ex: deepfakes, terrorism, suppression of dissent

Machine ethics

Designing AI systems to make ethical decisions
Debate over lethal autonomous weapons

Structural risks from AI shaping the environment in subtle ways

Ex: job loss, increased risks of arms races

Governance, strategy, and policy

Should government regulate AI?
Who should be held accountable?
How do we coordinate with other governments and stakeholders to prevent risks?

AI forecasting and risk analysis

When are these concerns likely to materialize?
How concerned should we be?

Adversarial examples: fooling AI into thinking a stop sign is a 45 mph sign

(image source)

Potential terrorist use of lethal fully autonomous drones

(image source, based on a report from the OECD)

Jobs at risk of automation by AI

5 of 42

AI Safety research communities

Two related research communities: AI Safety, Assured Autonomy
AI Safety

Focus on long-term risks from roughly human-level AI or beyond
Also focused on near-term concerns that may scale up / provide insight into long-term issues
Relatively new field – past 10 years or so
Becoming progressively more mainstream

Many leading AI researchers have expressed strong support for the research
AI Safety research groups set up at several major universities and AI companies

Assured Autonomy

Older, established community with broader focus on assuring autonomous systems in general
Recently started looking at challenges posed by machine learning
Current and near-term focus

In the past year both communities have finally started trying to collaborate and work out a shared research landscape and vision
APL’s focus: near- and mid-term concerns, but it would be nice if our research also scales up to longer-term concerns

6 of 42

AI Safety: Lots of ways to frame conceptually

Many different ways to divide up the problem space, and many different research agendas from different organizations
It can get pretty complicated

AI Safety Landscape overview from the Future of Life Institute (FLI)

Connections between different research agendas

(Source: Everitt et al, AGI Safety Literature Review)

7 of 42

AI Safety: DeepMind’s conceptual framework

Source: DeepMind Safety Research Blog

8 of 42

Assured Autonomy: AAIP conceptual framework

Source: Ashmore et al., Assuring the Machine Learning Lifecycle

AAIP = Assuring Autonomy International Programme (University of York)

9 of 42

Combined framework

This is the proposed framework for combining AI Safety and Assured Autonomy research communities
Also tries to address relevant topics from the AI Ethics, Security and Privacy communities
Until now these communities haven’t been talking to each other as much as they should
Still in development; AAAI 2020 has a full-day workshop on this
Personal opinion: I like that it’s general, but I think it’s a bit too general – best used only for very abstract overviews of the field

= focus of AI Safety / Deepmind framework

= focus of Assured Autonomy / AAIP framework

10 of 42

My personal preference

Problems that scale up to long term:

DeepMind framework

Near-term machine learning:

AAIP framework

Everything else:

Combined framework

11 of 42

AI safety concerns and APL’s mission areas

All of APL’s mission areas involve safety- or mission-critical systems
The military is concerned with assurance rather than safety (obviously, military systems are unsafe for the enemy), but the two concepts are very similar and involve similar problems and solutions
The government is very aware of these problems, and this is part of why the military has been reluctant to adopt AI technologies

Recent report from the Defense Innovation Board: primary document, supporting document
Congressional Report on AI and National Security
DARPA: Assured Autonomy program, Explainable AI program

If we want to get the military to adopt the AI technologies we develop here, those technologies will need to be assured and secure

12 of 42

Technical AI Safety

14 of 42

Specification problems

These problems arise when there is a gap (often very subtle and unnoticed) between what we really want and what the system is actually optimizing for
Powerful optimizers can find surprising and sometimes undesirable solutions for objectives that are even subtly mis-specified
Often extremely difficult or impossible to fully specify everything we really want
Some examples:

Specification gaming
Avoiding side effects
Unintended emergent behaviors
Bugs and errors

15 of 42

Specification: Specification Gaming

Agent exploits a flaw in the specification
Powerful optimizers can find extremely novel and potentially harmful solutions
Example: evolved radio
Example: Coast Runners
There are many other similar examples

The evolvable motherboard that led to the evolved radio

A reinforcement learning agent discovers an unintended strategy for achieving a higher score

(Source: OpenAI, Faulty Reward Functions in the Wild)

16 of 42

Specification: Specification Gaming (cont.)

Can be a problem for classifiers as well: The loss function (“reward”) might not really be what we care about, and we may not discover the discrepancy until later
Example: Bias

We care about the difference between humans and animals more than between breeds of dogs, but loss function optimizes for all equally
We only discovered this problem after it caused major issues

Example: Adversarial examples

Deep Learning (DL) systems discovered weird correlations that humans never thought to look for, so predictions don’t match what we really care about
We only discovered this problem well after the systems were in use

Google images misidentified black people as gorillas

(source)

Blank labels can make DL systems misidentify stop signs as Speed Limit 45 MPH signs

(source)

17 of 42

Specification: Avoiding side effects

What we really want: achieve goals subject to common sense constraints
But current systems do not have anything like human common sense
In any case would not by default constrain itself unless specifically programmed to do so
Problem likely to get much more difficult going forward:

Increasingly complex, hard-to-predict environments
Increasing number of possible side effects
Increasingly difficult to think of all those side effects in advance

Two side effect scenarios

(source: DeepMind Safety Research blog)

18 of 42

Specification: Avoiding side effects (cont.)

Standard TEV&V approach: brainstorm with experts "what could possibly go wrong?"
In complex environments it might not be possible to think about all the things that could go wrong beforehand (unknown unknowns) until it's too late
Is there a general method we can use to guard against even unknown unknowns?
Ideas in this category

Penalize changing the environment (example)
Agent learns constraints by observing humans (example)

Get from point A to point B – but don’t knock over the vase!

Can we think of all possible side effects like this in advance?

(image source)

19 of 42

Specification: Other problems

OpenAI’s hide and seek AI agents demonstrated surprising emergent behaviors (source)

(image source)

Emergent behaviors

E.g., multi-agent systems, human-AI teams
Makes it much more difficult to predict and verify, which makes a lot of the above problems worse

Bugs and errors

Can be even harder to find and correct logic errors in complex ML systems (especially Deep Learning) than in regular software systems
(See later on TEV&V)

20 of 42

Robustness problems

How to ensure that the system continues to operate within safe limits upon perturbation
Some examples:

Distributional shift / generalization
Safe exploration
Security

21 of 42

Robustness: Distributional shift / generalization

How do we get a system trained on one distribution to perform well and safely if it encounters a different distribution after deployment?
Especially, how do we get the system to proceed more carefully when it encounters safety-critical situations that it did not encounter during training?
Generalization is a well-known problem in ML, but more work needs to be done
Some approaches:

Cautious generalization
“Knows what it knows”
Expanding on anomaly detection techniques

(image source)

22 of 42

Robustness: Safe exploration

If an RL agent uses online learning or needs to train in a real-world environment, then the exploration itself needs to be safe
Example: A self-driving car can't learn by experimenting with swerving onto sidewalks
Restricting learning to a controlled, safe environment might not provide sufficient training for some applications

How do we tell a cleaning robot not to experiment with sticking wet brooms into sockets during training?

(image source)

23 of 42

Robustness: Security

(Security is sometimes considered part of safety / assurance, and sometimes separate)
ML systems pose unique security challenges
Data poisoning: Adversaries can corrupt the training data, leading to undesirable results
Adversarial examples: Adversaries can use tricks to fool ML systems
Privacy and classified information: By probing ML systems, adversaries may be able to uncover private or classified information that was used during training

What if an adversary fools an AI into thinking a school bus is a tank?

24 of 42

Monitoring and Control

(DeepMind calls this Assurance, but that’s confusing since we’ve also been discussing Assured Autonomy)
Interpretability: Many ML systems (esp. DL) are mostly black boxes
Scalable oversight: It can be very difficult to provide oversight of increasingly autonomous and complex agents
Human override: We need to be able to shut down the system if needed

Building in mechanisms to do this is often difficult
If the operator is part of the environment that the system learns about, the AI could conceivably learn policies that try to avoid the human shutting it down

“You can't get the cup of coffee if you're dead"
Example: robot blocks camera to avoid being shut off

25 of 42

Scaling up testing, evaluation, verification, and validation

The extremely complex, mostly black-box models learned by powerful Deep Learning systems makes it difficult or impossible to scale up existing TEV&V techniques
Hard to do enough testing or evaluation when the possible types of unusual inputs or situations can be huge
Most existing TEV&V techniques need to specify exactly what the boundaries are that we care about, which can be difficult or intractable
Often can only be verified in relatively simple constrained environments – doesn’t scale up well to more complex environments
Especially difficult to use standard TEV&V techniques for systems that continue to learn after deployment (online learning)
Also difficult to use TEV&V for multi-agent or human-machine teaming environments due to possible emergent behaviors

26 of 42

Theoretical issues

A lot of decision theory and game theory breaks down if the agent is itself part of the environment that it's learning about
Reasoning correctly about powerful ML systems might become very difficult and lead to mistaken assumptions with potentially dangerous consequences
Especially difficult to model and predict the actions of agents that can modify themselves in some way or create other agents

Embedding agents in the environment can lead to a host of theoretical problems

(source: MIRI Embedded Agency sequence)

27 of 42

Human-AI teaming

Understanding the boundaries - often even the system designers don't really understand where the system does or doesn't work
Example: Researchers didn’t discover the problem of adversarial examples until well after the systems were already in use; it took several more years to understand the causes of the problem (and it’s still debated)
Humans (even the designers) sometimes anthropomorphize too much and therefore use faulty “machine theories of mind” – current ML systems do not process data and information in the same way humans do
Can lead to people trusting AI systems in unsafe situations

28 of 42

Systems engineering and best practices

Careful design with safety / assurance issues in mind from the start
Getting people to incorporate the best technical solutions and TEV&V tools
Systems engineering perspective would likely be very helpful, but further work is needed to adapt systems / software engineering approaches to AI
Training people to not using AI systems beyond what they're good for
Being aware of the dual use nature of AI and developing / implementing best practices to prevent malicious use (a different issue from what we’ve been discussing)

Examples: deepfakes, terrorist use of drones, AI-powered cyber attacks, use by oppressive regimes
Possibly borrowing techniques and practices from other dual-use technologies, such as cybersecurity

(image source)

29 of 42

Assuring the Machine Learning Lifecycle

31 of 42

Data management

32 of 42

Model learning

33 of 42

Model verification

34 of 42

Model deployment

35 of 42

Final notes

Some of these areas have received a significant amount of attention and research (e.g., adversarial examples, generalizability, safe exploration, interpretability), others not quite as much (e.g., avoiding side effects, reward hacking, verification & validation)
It's generally believed that if early programming languages such as C had been designed from the ground up with security in mind, then computer security today would be in a much stronger position
We are mostly still in the early days of the most recent batch of powerful ML techniques (mostly Deep Learning); we should probably build in safety / assurance and security from the ground up
Again, the military knows all this; if we want the military to adopt the AI technologies that we develop here, those technologies will need to be assured and secure

36 of 42

Research groups outside APL (partial list)

Technical AI Safety

DeepMind safety research (two teams – AI Safety team, Robust & Verified Deep Learning team)
OpenAI safety team (no particular team website – core part of their mission)
Machine Intelligence Research Institute (MIRI)
Stanford AI Safety research group
Center for Human-Compatible AI (CHAI, UC Berkeley)

Assured Autonomy

Institute for Assured Autonomy (IAA, partnership between Johns Hopkins University and APL)
Assuring Autonomy International Programme (University of York)
University of Pennsylvania Assured Autonomy research group
University of Waterloo AssuredAI project

AI Safety Risks – Strategy, Policy, Analysis

Future of Life Institute (MIT)
Future of Humanity Institute (University of Oxford)
Center for the Study of Existential Risk (CSER, University of Cambridge)
Center for Security and Emerging Technology (CSET, Georgetown University)

Many of these organizations are closely tied to the Effective Altruism movement

37 of 42

Primary reading

Technical AI Safety

Amodei et al, Concrete Problems in AI Safety (2016) – still probably the best technical introduction
Alignment Newsletter – excellent coverage of related research

Podcast version
Database of all links from previous newsletters, arranged by topic – covers almost all major papers related to the field from the past year or two

DeepMind’s Safety Research blog
Informal document from Jacob Steinhardt (UC Berkeley) - overview of several current research directions

Assured Autonomy: Ashmore et al, Assuring the Machine Learning Lifecycle (2019)
Longer-term concerns

Stuart Russell, Human Compatible: Artificial Intelligence and the Problem of Control (2019)
Nick Bostrom, Superintelligence: Paths, Dangers, Strategies (2014)

Excellent series of posts summarizing each chapter and providing additional notes

[Tom Chivers, The AI Does Not Hate You: Superintelligence, Rationality and the Race to Save the World (2019) – lighter overview of the subject from a journalist; includes a good history of the AI Safety movement and other closely related groups]

38 of 42

Partial bibliography: General / Literature Reviews

Saria et al (JHU), Tutorial on safe and reliable ML (2019); video, slides, references
Richard Mallah (Future of Life Institute), “The Landscape of AI Safety and Beneficence Research,” 2017
Hernandez-Orallo et al, Surveying Safety-relevant AI Characteristics (2019)
Rohin Shah (UC Berkeley), An overview of technical AGI alignment (podcast episode with transcript, 2019) – part 1, part 2, related video lecture
Everitt et al, AGI Safety literature review (2018)
Paul Christiano, AI alignment landscape (2019 blog post)
Andrew Critch and Stuart Russell, detailed syllabus with links from a fall 2018 AGI Safety course at UC Berkeley
Joel Lehman (Uber), Evolutionary Computation and AI Safety: Research Problems Impeding Routine and Safe Real-world Application of Evolution (2019)
Victoria Krakovna, AI safety resources list

39 of 42

Partial bibliography: Technical AI Safety literature

AI Alignment Forum, including several good curated post sequences
Paul Chrisiano, Directions and desiderata for AI alignment (2017 blog post)
Rohin Shah (UC Berkeley), Value Learning sequence (2018) – gives a thorough introduction to the problem and explains some of the most promising approaches
Leike et al (DeepMind), Reward Modeling (2018); associated blog post
Dylan Hadfield-Menell (UC Berkeley), Cooperative Inverse Reinforcement Learning (2016); associated podcast episode; also see this video lecture
Dylan Hadfield-Menell (UC Berkeley), Inverse Reward Design (2017)
Christiano et al (OpenAI), Iterative Amplification (2018); associated blog post; Iterative Amplification sequence on the Alignment Forum
Irving et al (OpenAI), Value alignment via debate (2018); associated blog post, podcast episode
Christiano et al (OpenAI, DeepMind), Deep reinforcement learning from human preferences (2017)
Andreas Stuhlmüller (Ought), Factored Cognition (2018 blog post)
Stuart Armstrong (MIRI / FHI), Research Agenda v0.9: Synthesizing a human's preferences into a utility function (2019 blog post)

40 of 42

Partial bibliography: Assured Autonomy literature

University of York, Assuring Autonomy Body of Knowledge (in development)
Assuring Autonomy International Program, list of research papers
Sandeep Neema (DARPA), Assured Autonomy presentation (2019)
Schwarting et al (MIT, Delft University), Planning and Decision-Making for Autonomous Vehicles (2018)
Kuwajima et al, Open Problems in Engineering Machine Learning Systems and the Quality Model (2019)
Colinescu et al (University of York), Socio-Cyber-Physical Systems: Models, Opportunities, Open Challenges (2019) – focuses on the human component of human-machine teaming
Salay et al (University of Waterloo), Using Machine Learning Safely in Automotive Software (2018)
Czarnecki et al (University of Waterloo), Towards a Framework to Manage Perceptual Uncertainty for Safe Automated Driving (2018)
Colinescu et al (University of York), Engineering Trustworthy Self-Adaptive Software with Dynamic Assurance Cases (2017)
Lee et al (University of Waterloo), WiseMove: A Framework for Safe Deep Reinforcement Learning for Autonomous Driving (2019)
Garcia et al, A Comprehensive Survey on Safe Reinforcement Learning (2015)

41 of 42

Partial bibliography: Misc.

Avoiding side effects

Krakovna et al (DeepMind), Penalizing side effects using stepwise relative reachability (2019); associated blog post
Alex Turner, Towards a new impact measure (2018 blog post)
Achiam et al (UC Berkeley), Constrained Policy Optimization (2017)

Testing and verification

Defense Innovation Board, AI Principles: Recommendations on the Ethical Use of Artificial Intelligence by the Department of Defense, Appendix IV.C (2019) – study by the MITRE Corporation on the state of AI T&E
Kohli et al (DeepMind), Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (2019 blog post) – references several important papers on testing and validation of advanced ML techniques, and summarizes some of DeepMind’s research in this area
Haugh et al, The Status of Test, Evaluation, Verification, and Validation (TEV&V) of Autonomous Systems (2018)
Hains et al, Formal methods and software engineering for DL (2019)

Security: Xiao et al, Characterizing Attacks on Deep Reinforcement Learning (2019)
Control: Babcock et al, Guidelines for Artificial Intelligence Containment (2017)
Risks from emergent behavior: Jesse Clifton, Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda (blog post sequence, 2019)
Long term risks:

AI Impacts
Ben Cottier and Rohin Shah, Clarifying some key hypotheses in AI alignment (blog post, 2019)

1 of 42

2 of 42

3 of 42

4 of 42

5 of 42

6 of 42

7 of 42

8 of 42

9 of 42

10 of 42

11 of 42

12 of 42

13 of 42

14 of 42

15 of 42

16 of 42

17 of 42

18 of 42

19 of 42

20 of 42

21 of 42

22 of 42

23 of 42

24 of 42

25 of 42

26 of 42

27 of 42

28 of 42

29 of 42

30 of 42

31 of 42

32 of 42

33 of 42

34 of 42

35 of 42

36 of 42

37 of 42

38 of 42

39 of 42

40 of 42

41 of 42

42 of 42