1 of 42

Introduction to AI Safety

Aryeh L. Englander

AMDS / A4I

2 of 42

Overview

2

3 of 42

What do we mean by Technical AI Safety?

  • Critical systems: systems whose failure may lead to injury or loss of life, damage to the environment, unauthorized disclosure of information, or serious financial losses
  • Safety-critical systems: systems whose failure may result in injury, loss of life, or serious environmental damage
  • Technical AI safety: designing safety-critical AI systems (and more broadly, critical AI systems) in ways that guard against accident risks – i.e., harms arising from AI systems behaving in unintended ways

3

Sources:

4 of 42

Other related concerns

  • Security against exploits by adversaries
    • Often considered part of AI Safety
  • Misuse from people using AI in unethical or malicious ways
    • Ex: deepfakes, terrorism, suppression of dissent
  • Machine ethics
    • Designing AI systems to make ethical decisions
    • Debate over lethal autonomous weapons
  • Structural risks from AI shaping the environment in subtle ways
    • Ex: job loss, increased risks of arms races
  • Governance, strategy, and policy
    • Should government regulate AI?
    • Who should be held accountable?
    • How do we coordinate with other governments and stakeholders to prevent risks?
  • AI forecasting and risk analysis
    • When are these concerns likely to materialize?
    • How concerned should we be?

4

Adversarial examples: fooling AI into thinking a stop sign is a 45 mph sign

Potential terrorist use of lethal fully autonomous drones

(image source, based on a report from the OECD)

Jobs at risk of automation by AI

5 of 42

AI Safety research communities

  • Two related research communities: AI Safety, Assured Autonomy
  • AI Safety
    • Focus on long-term risks from roughly human-level AI or beyond
    • Also focused on near-term concerns that may scale up / provide insight into long-term issues
    • Relatively new field – past 10 years or so
    • Becoming progressively more mainstream
      • Many leading AI researchers have expressed strong support for the research
      • AI Safety research groups set up at several major universities and AI companies
  • Assured Autonomy
    • Older, established community with broader focus on assuring autonomous systems in general
    • Recently started looking at challenges posed by machine learning
    • Current and near-term focus
  • In the past year both communities have finally started trying to collaborate and work out a shared research landscape and vision
  • APL’s focus: near- and mid-term concerns, but it would be nice if our research also scales up to longer-term concerns

5

6 of 42

AI Safety: Lots of ways to frame conceptually

  • Many different ways to divide up the problem space, and many different research agendas from different organizations
  • It can get pretty complicated

6

AI Safety Landscape overview from the Future of Life Institute (FLI)

Connections between different research agendas

(Source: Everitt et al, AGI Safety Literature Review)

7 of 42

AI Safety: DeepMind’s conceptual framework

7

8 of 42

Assured Autonomy: AAIP conceptual framework

8

AAIP = Assuring Autonomy International Programme (University of York)

9 of 42

Combined framework

  • This is the proposed framework for combining AI Safety and Assured Autonomy research communities
  • Also tries to address relevant topics from the AI Ethics, Security and Privacy communities
  • Until now these communities haven’t been talking to each other as much as they should
  • Still in development; AAAI 2020 has a full-day workshop on this
  • Personal opinion: I like that it’s general, but I think it’s a bit too general – best used only for very abstract overviews of the field

9

= focus of AI Safety / Deepmind framework

= focus of Assured Autonomy / AAIP framework

10 of 42

My personal preference

Problems that scale up to long term:

DeepMind framework

10

Near-term machine learning:

AAIP framework

+

+

Everything else:

Combined framework

11 of 42

AI safety concerns and APL’s mission areas

  • All of APL’s mission areas involve safety- or mission-critical systems
  • The military is concerned with assurance rather than safety (obviously, military systems are unsafe for the enemy), but the two concepts are very similar and involve similar problems and solutions
  • The government is very aware of these problems, and this is part of why the military has been reluctant to adopt AI technologies
  • If we want to get the military to adopt the AI technologies we develop here, those technologies will need to be assured and secure

11

12 of 42

Technical AI Safety

12

13 of 42

13

14 of 42

Specification problems

  • These problems arise when there is a gap (often very subtle and unnoticed) between what we really want and what the system is actually optimizing for
  • Powerful optimizers can find surprising and sometimes undesirable solutions for objectives that are even subtly mis-specified
  • Often extremely difficult or impossible to fully specify everything we really want
  • Some examples:
    • Specification gaming
    • Avoiding side effects
    • Unintended emergent behaviors
    • Bugs and errors

14

15 of 42

Specification: Specification Gaming

  • Agent exploits a flaw in the specification
  • Powerful optimizers can find extremely novel and potentially harmful solutions
  • Example: evolved radio
  • Example: Coast Runners
  • There are many other similar examples

15

The evolvable motherboard that led to the evolved radio

A reinforcement learning agent discovers an unintended strategy for achieving a higher score

(Source: OpenAI, Faulty Reward Functions in the Wild)

16 of 42

Specification: Specification Gaming (cont.)

  • Can be a problem for classifiers as well: The loss function (“reward”) might not really be what we care about, and we may not discover the discrepancy until later
  • Example: Bias
    • We care about the difference between humans and animals more than between breeds of dogs, but loss function optimizes for all equally
    • We only discovered this problem after it caused major issues
  • Example: Adversarial examples
    • Deep Learning (DL) systems discovered weird correlations that humans never thought to look for, so predictions don’t match what we really care about
    • We only discovered this problem well after the systems were in use

16

Google images misidentified black people as gorillas

(source)

Blank labels can make DL systems misidentify stop signs as Speed Limit 45 MPH signs

(source)

17 of 42

Specification: Avoiding side effects

  • What we really want: achieve goals subject to common sense constraints
  • But current systems do not have anything like human common sense
  • In any case would not by default constrain itself unless specifically programmed to do so
  • Problem likely to get much more difficult going forward:
    • Increasingly complex, hard-to-predict environments
    • Increasing number of possible side effects
    • Increasingly difficult to think of all those side effects in advance

17

Two side effect scenarios

(source: DeepMind Safety Research blog)

18 of 42

Specification: Avoiding side effects (cont.)

  • Standard TEV&V approach: brainstorm with experts "what could possibly go wrong?"
  • In complex environments it might not be possible to think about all the things that could go wrong beforehand (unknown unknowns) until it's too late
  • Is there a general method we can use to guard against even unknown unknowns?
  • Ideas in this category
    • Penalize changing the environment (example)
    • Agent learns constraints by observing humans (example)

18

Get from point A to point B – but don’t knock over the vase!

Can we think of all possible side effects like this in advance?

(image source)

19 of 42

Specification: Other problems

19

OpenAI’s hide and seek AI agents demonstrated surprising emergent behaviors (source)

  • Emergent behaviors
    • E.g., multi-agent systems, human-AI teams
    • Makes it much more difficult to predict and verify, which makes a lot of the above problems worse
  • Bugs and errors
    • Can be even harder to find and correct logic errors in complex ML systems (especially Deep Learning) than in regular software systems
    • (See later on TEV&V)

20 of 42

Robustness problems

  • How to ensure that the system continues to operate within safe limits upon perturbation
  • Some examples:
    • Distributional shift / generalization
    • Safe exploration
    • Security

20

21 of 42

Robustness: Distributional shift / generalization

  • How do we get a system trained on one distribution to perform well and safely if it encounters a different distribution after deployment?
  • Especially, how do we get the system to proceed more carefully when it encounters safety-critical situations that it did not encounter during training?
  • Generalization is a well-known problem in ML, but more work needs to be done
  • Some approaches:
    • Cautious generalization
    • “Knows what it knows”
    • Expanding on anomaly detection techniques

21

22 of 42

Robustness: Safe exploration

  • If an RL agent uses online learning or needs to train in a real-world environment, then the exploration itself needs to be safe
  • Example: A self-driving car can't learn by experimenting with swerving onto sidewalks
  • Restricting learning to a controlled, safe environment might not provide sufficient training for some applications

22

How do we tell a cleaning robot not to experiment with sticking wet brooms into sockets during training?

(image source)

23 of 42

Robustness: Security

  • (Security is sometimes considered part of safety / assurance, and sometimes separate)
  • ML systems pose unique security challenges
  • Data poisoning: Adversaries can corrupt the training data, leading to undesirable results
  • Adversarial examples: Adversaries can use tricks to fool ML systems
  • Privacy and classified information: By probing ML systems, adversaries may be able to uncover private or classified information that was used during training

23

What if an adversary fools an AI into thinking a school bus is a tank?

24 of 42

Monitoring and Control

  • (DeepMind calls this Assurance, but that’s confusing since we’ve also been discussing Assured Autonomy)
  • Interpretability: Many ML systems (esp. DL) are mostly black boxes
  • Scalable oversight: It can be very difficult to provide oversight of increasingly autonomous and complex agents
  • Human override: We need to be able to shut down the system if needed
    • Building in mechanisms to do this is often difficult
    • If the operator is part of the environment that the system learns about, the AI could conceivably learn policies that try to avoid the human shutting it down
      • “You can't get the cup of coffee if you're dead"
      • Example: robot blocks camera to avoid being shut off

24

25 of 42

Scaling up testing, evaluation, verification, and validation

  • The extremely complex, mostly black-box models learned by powerful Deep Learning systems makes it difficult or impossible to scale up existing TEV&V techniques
  • Hard to do enough testing or evaluation when the possible types of unusual inputs or situations can be huge
  • Most existing TEV&V techniques need to specify exactly what the boundaries are that we care about, which can be difficult or intractable
  • Often can only be verified in relatively simple constrained environments – doesn’t scale up well to more complex environments
  • Especially difficult to use standard TEV&V techniques for systems that continue to learn after deployment (online learning)
  • Also difficult to use TEV&V for multi-agent or human-machine teaming environments due to possible emergent behaviors

25

26 of 42

Theoretical issues

  • A lot of decision theory and game theory breaks down if the agent is itself part of the environment that it's learning about
  • Reasoning correctly about powerful ML systems might become very difficult and lead to mistaken assumptions with potentially dangerous consequences
  • Especially difficult to model and predict the actions of agents that can modify themselves in some way or create other agents

26

Embedding agents in the environment can lead to a host of theoretical problems

(source: MIRI Embedded Agency sequence)

27 of 42

Human-AI teaming

  • Understanding the boundaries - often even the system designers don't really understand where the system does or doesn't work
  • Example: Researchers didn’t discover the problem of adversarial examples until well after the systems were already in use; it took several more years to understand the causes of the problem (and it’s still debated)
  • Humans (even the designers) sometimes anthropomorphize too much and therefore use faulty “machine theories of mind” – current ML systems do not process data and information in the same way humans do
  • Can lead to people trusting AI systems in unsafe situations

27

28 of 42

Systems engineering and best practices

  • Careful design with safety / assurance issues in mind from the start
  • Getting people to incorporate the best technical solutions and TEV&V tools
  • Systems engineering perspective would likely be very helpful, but further work is needed to adapt systems / software engineering approaches to AI
  • Training people to not using AI systems beyond what they're good for
  • Being aware of the dual use nature of AI and developing / implementing best practices to prevent malicious use (a different issue from what we’ve been discussing)
    • Examples: deepfakes, terrorist use of drones, AI-powered cyber attacks, use by oppressive regimes
    • Possibly borrowing techniques and practices from other dual-use technologies, such as cybersecurity

28

29 of 42

Assuring the Machine Learning Lifecycle

29

30 of 42

30

31 of 42

Data management

31

32 of 42

Model learning

32

33 of 42

Model verification

33

34 of 42

Model deployment

34

35 of 42

Final notes

  • Some of these areas have received a significant amount of attention and research (e.g., adversarial examples, generalizability, safe exploration, interpretability), others not quite as much (e.g., avoiding side effects, reward hacking, verification & validation)
  • It's generally believed that if early programming languages such as C had been designed from the ground up with security in mind, then computer security today would be in a much stronger position
  • We are mostly still in the early days of the most recent batch of powerful ML techniques (mostly Deep Learning); we should probably build in safety / assurance and security from the ground up
  • Again, the military knows all this; if we want the military to adopt the AI technologies that we develop here, those technologies will need to be assured and secure

35

36 of 42

Research groups outside APL (partial list)

36

37 of 42

Primary reading

37

38 of 42

Partial bibliography: General / Literature Reviews

38

39 of 42

Partial bibliography: Technical AI Safety literature

39

40 of 42

Partial bibliography: Assured Autonomy literature

40

41 of 42

Partial bibliography: Misc.

41

42 of 42