1 of 70

Potential Risks From Advanced AI

Aryeh L. Englander

JHU Applied Physics Laboratory (APL)

University of Maryland, Baltimore County (UMBC)

Email: Aryeh.Englander@jhuapl.edu

2 of 70

Outline

  • Overview of potential risks
  • AI safety for current systems
  • AI safety for future systems
  • Social and governance challenges
  • Key uncertainties
    • Basics
    • Timelines and Takeoff Speeds: How long do we have?
    • Proposed solutions: “Can’t we just…”
    • General / meta considerations
  • Decision factors: So what should we do about all this?
  • Getting involved
  • Q&A / supplementary topics

3 of 70

Overview of Potential Risks

4 of 70

Sources of risk

  • Malicious actors: Humans deliberately using AI to cause harm
    • Examples: Terrorism, robust totalitarianism
  • Incompetent AI: Humans (probably unintentionally) using AI beyond the limits of reliability
    • Example: Trusting AI to make nuclear launch decisions if they sometimes make mistakes
  • Incidental / structural risks from pervasive AI
    • Example: Military use of AI leads to rapid escalation and global conflict
    • Example: Mass unemployment
    • Example: Humans gradually lose self-determination due to complexity of interacting decision-making AIs
  • Misaligned objectives
    • (see next slide)

Terrorism

Accidental global conflict

Robust totalitarianism

Loss of control

Misaligned Objectives

Green City 2050

5 of 70

Misaligned Objectives: The “Alignment Problem”

  • Gap between what we want vs. what we specify
    • Extremely difficult / impossible to precisely specify what we do & don’t want the AI to do
  • Very powerful optimizers can exploit that gap in completely unexpected ways
    • Results may be catastrophic if on large enough scale
  • We currently have very little idea how to prevent this

Green City 2050

6 of 70

Misaligned objectives: Notional example

  • Scenario:
    • AI with roughly human-level planning & reasoning capabilities is given goal of finding a cure for cancer
    • AI reasons that the best way to achieve this goal is to infect billions of people with cancer and run experiments on them
      • Realizes humans don’t want this, but it only pursues the goals we give it
      • Will therefore use any and every means available to it to achieve its goals – even if it knows humans don’t want that!
  • May be relatively simple to avoid outcomes like this – if we think of them ahead of time
    • But turns out to be extremely difficult (impossible?) to fully anticipate everything we do or do not want powerful AI systems to do (see e.g., Hendrycks et. al. 2021)
  • This scenario is actually just a scaled-up version of similar phenomena that have been observed for today’s AI systems (see e.g., Krakovna 2020)

7 of 70

Scale: How bad could it get?

  • Loss of life – up to and including human extinction?
  • Permanent loss of human self-determination?
  • Dramatic reduction in future human potential? (longtermism)
  • Radically dystopian scenarios?

8 of 70

AI Safety for Current Systems

9 of 70

Specification Gaming

  • Gap between what we want vs. what we specify
    • Extremely difficult / impossible to precisely specify what we do & don’t want the AI to do
  • Very powerful optimizers can exploit that gap in completely unexpected ways
    • Results may be catastrophic if on large enough scale
  • Example: Coast Runners
  • There are many other similar examples

A reinforcement learning agent discovers an unintended strategy for achieving a higher score

(Source: OpenAI, Faulty Reward Functions in the Wild)

10 of 70

Specification Gaming (cont.)

  • Can be a problem for classifiers as well: The loss function (“reward”) might not really be what we care about, and we may not discover the discrepancy until later
  • Example: Bias
    • We care about the difference between humans and animals more than between breeds of dogs, but loss function optimizes for all equally
    • We only discovered this problem after it caused major issues
  • Example: Adversarial examples
    • Deep Learning (DL) systems discovered weird correlations that humans never thought to look for, so predictions don’t match what we really care about
    • We only discovered this problem well after the systems were in use

Google images misidentified black people as gorillas

(source)

Blank labels can make DL systems misidentify stop signs as Speed Limit 45 MPH signs

(source)

11 of 70

Avoiding side effects

  • What we really want: achieve goals subject to common sense constraints
  • But current systems do not have anything like human common sense
  • In any case would not by default constrain itself unless specifically programmed to do so
  • Problem likely to get much more difficult going forward:
    • Increasingly complex, hard-to-predict environments
    • Increasing number of possible side effects
    • Increasingly difficult to think of all those side effects in advance

Two side effect scenarios

(source: DeepMind Safety Research blog)

12 of 70

Avoiding side effects (cont.)

  • Standard testing and evaluation approach: brainstorm with experts "what could possibly go wrong?"
  • In complex environments it might not be possible to think about all the things that could go wrong beforehand (unknown unknowns) until it's too late
  • Is there a general method we can use to guard against even unknown unknowns?
  • Ideas in this category
    • Penalize changing the environment (example)
    • Agent learns constraints by observing humans (example)

Get from point A to point B – but don’t knock over the vase!

Can we think of all possible side effects like this in advance?

(image source)

13 of 70

Out of Distribution (OOD) Robustness

  • How do we get a system trained on one distribution to perform well and safely if it encounters a different distribution after deployment?
  • Especially, how do we get the system to proceed more carefully when it encounters safety-critical situations that it did not encounter during training?
  • Generalization is a well-known problem in ML, but more work needs to be done
  • Some approaches:
    • Cautious generalization
    • “Knows what it knows”
    • Expanding on anomaly detection techniques

14 of 70

Interpretability and Monitoring

  • Large deep learning models today consist of billions of learned weights between 0 and 1
  • Even the designers often only learn about some capabilities / failure modes years later
    • Adversarial examples
    • We’re still figuring out what GPT-3 is capable of
  • Can humans understand / monitor these systems? What about even more advanced AI?
  • Some promising work in this area, but most experts seem to consider this a long shot

15 of 70

Emergent behaviors

  • Multi-agent systems
  • Human-AI systems
  • Much more difficult to predict or verify, which makes many of the above problems worse

2010 Flash Crash

Human-AI teaming for military strategy?

16 of 70

Testing, Evaluation, Verification, and Validation (TEVV)

  • Can we scale up existing techniques for testing and verifying today’s systems?
  • The extremely complex, mostly black-box models learned by powerful Deep Learning systems make it difficult or impossible to scale up existing TEVV techniques
  • Hard to do enough testing or evaluation when the possible types of unusual inputs or situations can be huge
  • Most existing TEVV techniques need to specify exactly what the boundaries are that we care about, which can be difficult or intractable
  • Often can only be verified in relatively simple constrained environments – doesn’t scale up well to more complex environments
  • Especially difficult to use standard TEVV techniques for systems that continue to learn after deployment (online learning)
  • Also difficult to use TEVV for multi-agent or human-machine teaming environments due to possible emergent behaviors

17 of 70

AI Safety for Future Systems

18 of 70

Terminology: AGI, ASI, HLMI, TAI, PASTA

  • Artificial general intelligence (AGI): A machine capable of behaving intelligently over many domains
  • Artificial Superintelligence (ASI): An AI system that possesses cognitive abilities vastly superior to those of humans
  • High-Level Machine Intelligence (HLMI): AI systems capable of carrying out all human tasks and jobs
  • Transformative AI (TAI): AI systems that have a societal impact equivalent to that of the industrial revolution
  • Process for Automating Scientific and Technological Advancement (PASTA): AI systems that can essentially automate all of the human activities needed to speed up scientific and technological advancement

  • Very rough definitions; mostly the differences won’t matter much for our discussion
  • Some people prefer different terms because of the connotations
  • I will mostly use AGI for ease of reference, but not limited to that

19 of 70

Mesa-Optimization

  • Best way to optimize for objective function requiring wide generalization abilities: Develop your own optimization algorithm!
    • Evolution => brains
    • Humans => thinking strategies, algorithms, computers
  • Very powerful optimizers may therefore likely develop mesa (“sub”) optimization algorithms
  • “Inner” optimizer may not have same objective function as “outer” optimizer!
    • Mesa-objective likely works on training distribution, but may fail catastrophically out of distribution
    • Evolution (pass on genes) => humans (birth control)
    • Sex goal works on training distribution (ancestral environment) but starts breaking down out of distribution (modern technology)
  • Now we need to ensure not only “outer” alignment but also “inner” alignment
  • Also means the AI might develop online learning capabilities even if not originally programmed that way

20 of 70

Theoretical issues

  • A lot of decision theory and game theory breaks down if the agent is itself part of the environment that it's learning about
  • Reasoning correctly about powerful ML systems might become very difficult and lead to mistaken assumptions with potentially dangerous consequences
  • Especially difficult to model and predict the actions of agents that can modify themselves in some way or create other agents

Embedding agents in the environment can lead to a host of theoretical problems

(source: MIRI Embedded Agency sequence)

21 of 70

Social and Governance Challenges

  • “Nobody will be stupid enough to deploy it”
  • Races to the bottom
  • Coordination against rogue actors

22 of 70

Key Uncertainties

23 of 70

Caveats

  • I’m still studying all this – most likely I’ve missed or misunderstood some things
  • There are a lot of potential biases that can make it hard to think about this clearly
    • Might not be entirely fallacious though!
    • We’ll get back to this later
  • As with many topics, there is unfortunately a significant amount of strawmanning involved

24 of 70

Forecasting under extreme uncertainty

  • Are forecasts useless if not based on hard data?
    • How reliable are long-term forecasts and expert intuitions?
    • Looking under the lamppost
  • Subjective probability / credences
    • Bayes rule
    • Reference class forecasting
    • “Reference class tennis”
  • Intuitive probabilistic thinking is hard
    • Linear vs. exponential projections
    • Even experts make mistakes
  • Being aware of your own biases
  • Use of thought experiments?
    • Easy to make mistakes
  • Epistemic modesty
    • Who’s an expert?

25 of 70

What (almost?) all informed experts seem to agree upon

  • Whether the AI is “conscious”* or not isn’t really relevant
    • If it’s a powerful optimizer with sufficiently capable strategic planning ability, that’s all it needs
  • The precise definition of intelligence isn’t really relevant
    • [Caveat: Melanie Mitchell keeps pushing this for reasons I don’t understand]
  • We will probably get to AGI eventually
    • May be hundreds of years though, according to some
  • It is at least possible that AGI / ASI will be power-seeking and misaligned
    • Perhaps not so likely though
  • Alignment is sufficiently important that at least some people should be working on it
  • Malicious actors are potentially a huge problem regardless
  • Races to the bottom are potentially a huge problem regardless

* In the metaphysical “Hard Problem of Consciousness” sense

26 of 70

Key Uncertainties:�Timelines and Takeoff Speeds

27 of 70

Recent progress: Some examples

28 of 70

Recent progress (cont.)

DALL-E 2 (OpenAI 2022)

PaLM (Google 2022)

Explaining jokes

Inference chaining

4 years in image generation:

  1. AttnGAN (2018)
  2. CLIP+VQGAN (2020)
  3. CLIP+Diffusion (2021)
  4. DALL-E 2 (2022)

29 of 70

Scaling trends

30 of 70

Massive investment

31 of 70

Will current paradigms scale up all the way?

  • Big debate!
  • Scaling hypothesis: DL will scale up all the way to AGI
  • Many experts think these will be ingredients in AGI, but it needs to be combined with other techniques
    • Many of the proposed techniques already exist though
  • Others think we will need fundamentally new techniques
  • Success of deep learning may also spur the investment needed to make other techniques work

32 of 70

Biological anchors

  • Basic idea: If we extrapolate trends in various ways, how soon will we hit milestones like models with as many parameters as the human brain, or using computing resources comparable to reproducing all of evolutionary history?
  • These are more about loosely bounding the question and less about accurate forecasts

33 of 70

Economic models

34 of 70

Economic models

35 of 70

Economic models

36 of 70

Economic models

37 of 70

Other reference classes

  • History of technology
  • History of AI
  • “Reference class tennis”

38 of 70

Expert surveys

  • Expert surveys are kind of all over the place
  • Sensitive to framing effects
  • Unclear if they’re actually surveying relevant “experts”
  • Vary widely in survey methodological quality

39 of 70

Prediction markets

  • This is from Metaculus (not actually a full prediction market)
  • Precise definition of “weakly general AI” (in question description) may play a large role
  • Unclear how much weight to give prediction markets

40 of 70

Do timelines actually make much of a difference?

  • Much more urgent if nearer, of course
  • Even if further out, may take a long time to solve alignment
  • Thought experiment:
    • What would you do if you knew we would achieve AGI 10 years from now? 30 years? 100 years?
    • What would you hope the government would do given those timelines?

41 of 70

“Takeoff speeds”: Can we just worry about it later?

  • Will the transition from narrow AI => AGI / TAI be sudden?
    • Exponential progress?
    • Finding that last missing piece / capability?
      • Especially if DL needs to be combined with some other algorithm
      • Might already have a “hardware overhang”
    • If it’s sudden, we may not have time to get our act together later
    • Also means we may not have “warning shots” beforehand
  • Will the transition from AGI => ASI be sudden?
    • Recursive (self-)improvement / PASTA
    • May give powerful first-mover advantage to lead AGI => easier to accumulate power, harder for other AI systems to help control it
    • If sufficiently quick transition, we may literally only have one chance to get it right
  • Reference classes:
    • Has AI progress experienced discontinuous jumps?
    • Discontinuous progress in other technologies
    • AI Impacts has done a lot of research on this
  • Even some of those who say it’ll be a “longer” transition often mean it will still be insanely fast – e.g., global economy doubling in months

Linear vs. exponential growth

Just missing that last skill…

42 of 70

Key Uncertainties:�Alignment

43 of 70

Will misaligned AGI / ASI be power-seeking?

  • Instrumental convergence - for a broad range of objectives it is instrumentally useful to:
    • Gain control of resources
    • Increase cognitive abilities devoted to goal
      • Increased computational resources
      • Self-copying, self-improvement
      • Creating even more powerful systems
    • Self-preservation / resistance to shutdown (“you can’t get the coffee if you’re dead”)
    • Remove potential obstacles to achieving goals (including humans…??)
  • How likely is this?
  • I think some may disagree with this entirely, but I don’t fully understand why

44 of 70

Will AGI have our values by default?

  • Orthogonality Thesis: AI can understand our goals and values and still act against us
    • E.g., sociopaths
    • Broad “basin of attraction” around human values?
    • Melanie Mitchell: AGI will require raising AI like children, and they’ll acquire our values by default

45 of 70

Could misaligned AGI / ASI really take over the world?

  • Assume for the moment we have a superintelligent power-seeking AI that wants to destroy / marginalize humanity. What could it do to achieve that?
  • Partly depends on takeoff speeds
    • How far ahead could the lead project get?
  • Probably doesn’t need physical robots
  • Increasing cognitive / computational capacity: Copying itself over the internet, etc.
  • Social manipulation
    • Individuals
    • Misinformation, control of information technology
  • Lots of potentially destructive technologies it could use…
    • Hacking / destroying key institutions
    • Nuclear codes
    • Biological weapons, engineered pandemics, etc.
    • Nanotech?
  • Cooperation between AGIs?

46 of 70

Key Uncertainties:�“Can’t we just…”

R.I.P.�Humanity

“What could possibly go wrong?”

47 of 70

Will near-term AI safety approaches scale up?

  • Some think yes (“Prosaic AI Safety”)
  • Others think challenges with AGI / ASI will require qualitatively different approaches
    • Application-specific solutions obviously won’t necessarily work for more general AIs
    • Addressing mesa-alignment
    • Control problem for AGI / ASI
    • Deceptive AGI
  • Risks from focusing too much on near-term safety?
    • Greatly increased commercial / strategic value of powerful AI systems => increase race dynamics, etc.?
    • False sense of complacency?

48 of 70

A Few Proposed Solutions / Strategies

  • Alignment
    • “Naïve” alignment strategies
    • Teach it human values (value learning)
    • Use narrow AI to help design safe AGI (iterated amplification)
  • Monitoring and control
    • Give it an off-switch
    • Limit its capabilities
    • Monitor it carefully
    • Use other AIs to help monitor / control it
  • Governance and policy approaches

49 of 70

“Naïve” Alignment Strategies

  • Asimov’s Three Laws of Robotics
    • Whole book is about robots finding ways around those…
  • Game of “what could possibly go wrong?”
    • “Don’t kill humans” (but hurting them incidentally ok?)
    • “Don’t hurt humans” (stick them in cages?)
    • “Make people happy” (plaster smiles on their faces, stick wires in their heads?)
    • Do we really want to play this game against a superintelligent AI optimizing hard against us?
  • How do you specify “be a perfectly friendly AGI” in machine terms?
    • Proxy metrics => specification challenges

50 of 70

Give it an off-switch

  • Incentivized to prevent shutdown (instrumentally convergent goal)
  • Requires careful monitoring, interpretability, etc. – we’re very far from that even for today’s systems

A robot was trained to grasp a ball in a virtual environment. This is hard, so instead it learned to pretend to grasp it, by moving its hand in between the ball and the camera.

https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/

51 of 70

Limit its capabilities

  • Myopia: Focus only on narrow, short-term goals
  • Oracles: Only use to answer questions, not acting in the world
    • Could use humans to implement complex plans – will we understand what it’s trying to do?
  • Strong economic incentive to deploy more capable models?
  • Mesa-optimization

52 of 70

Value learning

  • Have AI optimize for latent human goals
    • [Which humans? Important issue but (relatively) minor compared to “let’s all not die, please”]
  • Problems:
    • Can we perfectly specify this objective?
    • Under-determination of utility function based on observed actions
    • Still might not work for mesa-alignment
  • Some experts optimistic we can make this work, others more skeptical
  • Very active research agenda

53 of 70

Iterated Amplification

  • Basic idea: Ensure safety for weaker AIs, then use those to help bootstrap to more general AIs that retain safety guarantees
  • Still very much WIP – everybody agrees we’ll need much more research to make it work eventually, if it ever works
  • Again, some experts optimistic this could be made to work eventually, others much more skeptical

54 of 70

Deceptive Alignment?

  • With sufficient planning capability + learned knowledge of human psychology, would be incentivized to hide misalignment
  • Can we detect misalignment before it learns all this?
    • Would minor mistakes tip us off during training?
    • Might need excellent monitoring / interpretability even during training – far beyond what we have today

55 of 70

Deceptively Aligned Mesa-Optimizers

56 of 70

Monitoring & Control: Use AI to monitor / control AI

  • Can weak AI monitor / control slightly more capable AI?
  • Will there be large differences in capabilities between leading AGI projects and others further behind?
  • How to ensure that the monitoring AI is aligned?
  • Can friendly AI be used to guard against rogue actors?

57 of 70

Governance and Policy Approaches

  • Reducing AI race dynamics between countries / corporations
  • International treaties / coordination
  • Setting standards and regulations
  • Policing against rogue actors
  • Seems like this will be necessary regardless

58 of 70

Key Uncertainties:�General / Meta Considerations

59 of 70

Comparison to misaligned humans?

  • We haven’t seen global catastrophes (yet!) from humans or human organizations, so why can’t we use similar techniques for AGI?
    • Well-studied area of economics: principal-agent problem
  • May break down if agent is sufficiently more intelligent / capable than principals (ASI vs. humans)
    • AI can also self-improve / copy itself much more rapidly – completely new capability
  • Human psychologies are relatively similar to each other in space of all possible mind / optimizer designs
    • More or less interpretable to each other
    • AGI / ASI more like completely alien intelligence
      • Even GPT-3 feels like doing psychology on a completely alien intelligence that just happens to speak English
  • Human motivations relatively similar to each other
    • Even psychopaths at least have goals we understand
    • Misaligned AGI / ASI may have arbitrary objectives
      • Paperclip maximization
      • Completely alien / uninterpretable mesa-objectives

60 of 70

Other comparisons

  • Human evolution: The “second species” argument
    • Increased intelligence / optimization / planning ability enabled humans to dominate all other species
    • Humans inflict harm on the ecosphere without hatred, just incentives
  • History of technology: Luddites and technology doomsayers have a poor track record
    • Or do they? (not sure)
    • The original Luddites weren’t wrong about their own jobs…
    • Experts vs. non-experts

61 of 70

Valid heuristics or irrational biases?

  • This whole discussion pattern-matches to science fiction and quackery
    • Normalcy bias
    • Availability heuristic
    • Normal compared to what time frame?
    • People felt the same way about early COVID warnings (actually from many of the same people warning about AI!)
  • The existential risk people seem like a weird fringe group of cultish doomsday prophets
    • Most experts don’t seem like they’re paying much attention to them
    • Pattern-matches to quacks or to members of a cult or weird religion
    • Some have lots of weird contrarian views
    • Some existential risk people are biased for monetary / prestige reasons
      • (cuts both ways)
    • Selection bias for futurists to expect dramatic near-term developments
    • Bias against expressing / adopting unorthodox views
    • Groupthink
  • Tech hype and techno-utopianism
    • But also anti-hype trend might go overboard
  • Anthropomorphizing
    • (cuts both ways)
  • Following an argument to its logical conclusions vs. relying on heuristics

62 of 70

Decision Factors:�So what should we do about all this?

63 of 70

Decision making under uncertainty

  • Expected utility and utility functions
  • Risk aversion and uncertainty aversion
  • Future discounting
  • Who’s making the decision? (Government, corporation, research funding organization, individual, etc.)
  • What’s the context of the decision? E.g.:
    • Whether or not to invest in different AI safety research agendas
    • Government policy decision making
    • Whether a corporation or startup should proceed with certain capabilities research
    • Career choice
    • Philanthropic donations
    • Awareness / advocacy
  • Look at both benefits and costs
  • Not doing anything is also a decision

64 of 70

Other factors

  • Can we even do anything about this now?
    • Maybe we need to wait until we know more about how AGI will end up working
  • Opportunity costs
    • Is focusing on these risks taking away from other causes (resources, attention)?
  • Slowing down AGI capabilities research vs. benefits of AI (including for other risks)
    • Comparison to gain-of-function research in medicine / bioengineering
  • Will government / industry / academia solve this on their own by default?
  • Longtermist ethics
  • Leverage argument: This is going to be really big so we should be spending a lot more attention / resources on it just for that reason
    • But in what ways? (AI safety research, capability advancement, proceeding with extreme caution, forecasting / strategy research, etc.?)
  • Pascal’s Wager / Pascal’s Mugging?

65 of 70

What I’m working on (very much WIP though)

66 of 70

Preliminary model – partly expanded

See AI Alignment Forum report for detailed model walkthrough

67 of 70

Getting Involved

68 of 70

Direct work opportunities

  • Theoretical work, engineering work, governance / policy work, strategy, etc.
  • Effective Altruism & funding opportunities
  • 80000hours.org
    • Career advice
    • Job board
    • 1 on 1 meetings for networking / advice

69 of 70

General awareness and advocacy

  • The more people are aware of the issues the better, generally
  • May be opportunities to impact relevant workplace decision-making
  • Potential issue – infohazards?

70 of 70

Join the conversation!

  • Feel free to contact me!
    • Happy to chat about any of this or point you to where you can find more information (books, websites, etc.)
    • I’m happy to help you connect with others if I can
  • Contact info