1 of 70

Potential Risks From Advanced AI

Aryeh L. Englander

JHU Applied Physics Laboratory (APL)

University of Maryland, Baltimore County (UMBC)

Email: Aryeh.Englander@jhuapl.edu

2 of 70

Outline

Overview of potential risks
AI safety for current systems
AI safety for future systems
Social and governance challenges
Key uncertainties

Basics
Timelines and Takeoff Speeds: How long do we have?
Proposed solutions: “Can’t we just…”
General / meta considerations

Decision factors: So what should we do about all this?
Getting involved
Q&A / supplementary topics

3 of 70

Overview of Potential Risks

4 of 70

Sources of risk

Malicious actors: Humans deliberately using AI to cause harm

Examples: Terrorism, robust totalitarianism

Incompetent AI: Humans (probably unintentionally) using AI beyond the limits of reliability

Example: Trusting AI to make nuclear launch decisions if they sometimes make mistakes

Incidental / structural risks from pervasive AI

Example: Military use of AI leads to rapid escalation and global conflict
Example: Mass unemployment
Example: Humans gradually lose self-determination due to complexity of interacting decision-making AIs

Misaligned objectives

(see next slide)

Terrorism

Accidental global conflict

Robust totalitarianism

Loss of control

Misaligned Objectives

Green City 2050

5 of 70

Misaligned Objectives: The “Alignment Problem”

Gap between what we want vs. what we specify

Extremely difficult / impossible to precisely specify what we do & don’t want the AI to do

Very powerful optimizers can exploit that gap in completely unexpected ways

Results may be catastrophic if on large enough scale

We currently have very little idea how to prevent this

Green City 2050

6 of 70

Misaligned objectives: Notional example

Scenario:

AI with roughly human-level planning & reasoning capabilities is given goal of finding a cure for cancer
AI reasons that the best way to achieve this goal is to infect billions of people with cancer and run experiments on them

Realizes humans don’t want this, but it only pursues the goals we give it
Will therefore use any and every means available to it to achieve its goals – even if it knows humans don’t want that!

May be relatively simple to avoid outcomes like this – if we think of them ahead of time

But turns out to be extremely difficult (impossible?) to fully anticipate everything we do or do not want powerful AI systems to do (see e.g., Hendrycks et. al. 2021)

This scenario is actually just a scaled-up version of similar phenomena that have been observed for today’s AI systems (see e.g., Krakovna 2020)

7 of 70

Scale: How bad could it get?

Loss of life – up to and including human extinction?
Permanent loss of human self-determination?
Dramatic reduction in future human potential? (longtermism)
Radically dystopian scenarios?

8 of 70

AI Safety for Current Systems

9 of 70

Specification Gaming

Gap between what we want vs. what we specify

Extremely difficult / impossible to precisely specify what we do & don’t want the AI to do

Very powerful optimizers can exploit that gap in completely unexpected ways

Results may be catastrophic if on large enough scale

Example: Coast Runners
There are many other similar examples

A reinforcement learning agent discovers an unintended strategy for achieving a higher score

(Source: OpenAI, Faulty Reward Functions in the Wild)

10 of 70

Specification Gaming (cont.)

Can be a problem for classifiers as well: The loss function (“reward”) might not really be what we care about, and we may not discover the discrepancy until later
Example: Bias

We care about the difference between humans and animals more than between breeds of dogs, but loss function optimizes for all equally
We only discovered this problem after it caused major issues

Example: Adversarial examples

Deep Learning (DL) systems discovered weird correlations that humans never thought to look for, so predictions don’t match what we really care about
We only discovered this problem well after the systems were in use

Google images misidentified black people as gorillas

(source)

Blank labels can make DL systems misidentify stop signs as Speed Limit 45 MPH signs

(source)

11 of 70

Avoiding side effects

What we really want: achieve goals subject to common sense constraints
But current systems do not have anything like human common sense
In any case would not by default constrain itself unless specifically programmed to do so
Problem likely to get much more difficult going forward:

Increasingly complex, hard-to-predict environments
Increasing number of possible side effects
Increasingly difficult to think of all those side effects in advance

Two side effect scenarios

(source: DeepMind Safety Research blog)

12 of 70

Avoiding side effects (cont.)

Standard testing and evaluation approach: brainstorm with experts "what could possibly go wrong?"
In complex environments it might not be possible to think about all the things that could go wrong beforehand (unknown unknowns) until it's too late
Is there a general method we can use to guard against even unknown unknowns?
Ideas in this category

Penalize changing the environment (example)
Agent learns constraints by observing humans (example)

Get from point A to point B – but don’t knock over the vase!

Can we think of all possible side effects like this in advance?

(image source)

13 of 70

Out of Distribution (OOD) Robustness

How do we get a system trained on one distribution to perform well and safely if it encounters a different distribution after deployment?
Especially, how do we get the system to proceed more carefully when it encounters safety-critical situations that it did not encounter during training?
Generalization is a well-known problem in ML, but more work needs to be done
Some approaches:

Cautious generalization
“Knows what it knows”
Expanding on anomaly detection techniques

(image source)

14 of 70

Interpretability and Monitoring

Large deep learning models today consist of billions of learned weights between 0 and 1
Even the designers often only learn about some capabilities / failure modes years later

Adversarial examples
We’re still figuring out what GPT-3 is capable of

Can humans understand / monitor these systems? What about even more advanced AI?
Some promising work in this area, but most experts seem to consider this a long shot

15 of 70

Emergent behaviors

Multi-agent systems
Human-AI systems
Much more difficult to predict or verify, which makes many of the above problems worse

2010 Flash Crash

OpenAI, Emergent Tool Use from Multi-Agent Interaction

Human-AI teaming for military strategy?

16 of 70

Testing, Evaluation, Verification, and Validation (TEVV)

Can we scale up existing techniques for testing and verifying today’s systems?
The extremely complex, mostly black-box models learned by powerful Deep Learning systems make it difficult or impossible to scale up existing TEVV techniques
Hard to do enough testing or evaluation when the possible types of unusual inputs or situations can be huge
Most existing TEVV techniques need to specify exactly what the boundaries are that we care about, which can be difficult or intractable
Often can only be verified in relatively simple constrained environments – doesn’t scale up well to more complex environments
Especially difficult to use standard TEVV techniques for systems that continue to learn after deployment (online learning)
Also difficult to use TEVV for multi-agent or human-machine teaming environments due to possible emergent behaviors

17 of 70

AI Safety for Future Systems

18 of 70

Terminology: AGI, ASI, HLMI, TAI, PASTA

Artificial general intelligence (AGI): A machine capable of behaving intelligently over many domains
Artificial Superintelligence (ASI): An AI system that possesses cognitive abilities vastly superior to those of humans
High-Level Machine Intelligence (HLMI): AI systems capable of carrying out all human tasks and jobs
Transformative AI (TAI): AI systems that have a societal impact equivalent to that of the industrial revolution
Process for Automating Scientific and Technological Advancement (PASTA): AI systems that can essentially automate all of the human activities needed to speed up scientific and technological advancement

Very rough definitions; mostly the differences won’t matter much for our discussion
Some people prefer different terms because of the connotations
I will mostly use AGI for ease of reference, but not limited to that

19 of 70

Mesa-Optimization

Best way to optimize for objective function requiring wide generalization abilities: Develop your own optimization algorithm!

Evolution => brains
Humans => thinking strategies, algorithms, computers

Very powerful optimizers may therefore likely develop mesa (“sub”) optimization algorithms
“Inner” optimizer may not have same objective function as “outer” optimizer!

Mesa-objective likely works on training distribution, but may fail catastrophically out of distribution
Evolution (pass on genes) => humans (birth control)
Sex goal works on training distribution (ancestral environment) but starts breaking down out of distribution (modern technology)

Now we need to ensure not only “outer” alignment but also “inner” alignment
Also means the AI might develop online learning capabilities even if not originally programmed that way

20 of 70

Theoretical issues

A lot of decision theory and game theory breaks down if the agent is itself part of the environment that it's learning about
Reasoning correctly about powerful ML systems might become very difficult and lead to mistaken assumptions with potentially dangerous consequences
Especially difficult to model and predict the actions of agents that can modify themselves in some way or create other agents

Embedding agents in the environment can lead to a host of theoretical problems

(source: MIRI Embedded Agency sequence)

21 of 70

Social and Governance Challenges

“Nobody will be stupid enough to deploy it”
Races to the bottom
Coordination against rogue actors

22 of 70

Key Uncertainties

23 of 70

Caveats

I’m still studying all this – most likely I’ve missed or misunderstood some things
There are a lot of potential biases that can make it hard to think about this clearly

Might not be entirely fallacious though!
We’ll get back to this later

As with many topics, there is unfortunately a significant amount of strawmanning involved

24 of 70

Forecasting under extreme uncertainty

Are forecasts useless if not based on hard data?

How reliable are long-term forecasts and expert intuitions?
Looking under the lamppost

Subjective probability / credences

Bayes rule
Reference class forecasting
“Reference class tennis”

Intuitive probabilistic thinking is hard

Linear vs. exponential projections
Even experts make mistakes

Being aware of your own biases
Use of thought experiments?

Easy to make mistakes

Epistemic modesty

Who’s an expert?

25 of 70

What (almost?) all informed experts seem to agree upon

Whether the AI is “conscious”* or not isn’t really relevant

If it’s a powerful optimizer with sufficiently capable strategic planning ability, that’s all it needs

The precise definition of intelligence isn’t really relevant

[Caveat: Melanie Mitchell keeps pushing this for reasons I don’t understand]

We will probably get to AGI eventually

May be hundreds of years though, according to some

It is at least possible that AGI / ASI will be power-seeking and misaligned

Perhaps not so likely though

Alignment is sufficiently important that at least some people should be working on it
Malicious actors are potentially a huge problem regardless
Races to the bottom are potentially a huge problem regardless

* In the metaphysical “Hard Problem of Consciousness” sense

26 of 70

Key Uncertainties:�Timelines and Takeoff Speeds

27 of 70

Recent progress: Some examples

MuZero: Mastering Go, chess, shogi and Atari without rules

(DeepMind 2020)

AlphaFold2 revolutionizes protein folding prediction

(DeepMind 2021)

28 of 70

Recent progress (cont.)

DALL-E 2 (OpenAI 2022)

PaLM (Google 2022)

Explaining jokes

Inference chaining

4 years in image generation:

AttnGAN (2018)
CLIP+VQGAN (2020)
CLIP+Diffusion (2021)
DALL-E 2 (2022)

29 of 70

Scaling trends

Compute Trends Across Three Eras of Machine Learning

(Sevilla et al, 2022)

An empirical analysis of compute-optimal large language model training

(DeepMind 2022)

30 of 70

Massive investment

31 of 70

Will current paradigms scale up all the way?

Big debate!
Scaling hypothesis: DL will scale up all the way to AGI
Many experts think these will be ingredients in AGI, but it needs to be combined with other techniques

Many of the proposed techniques already exist though

Others think we will need fundamentally new techniques
Success of deep learning may also spur the investment needed to make other techniques work

32 of 70

Biological anchors

Basic idea: If we extrapolate trends in various ways, how soon will we hit milestones like models with as many parameters as the human brain, or using computing resources comparable to reproducing all of evolutionary history?
These are more about loosely bounding the question and less about accurate forecasts

33 of 70

Economic models

34 of 70

Economic models

35 of 70

Economic models

36 of 70

Economic models

37 of 70

Other reference classes

History of technology
History of AI
“Reference class tennis”

38 of 70

Expert surveys

Expert surveys are kind of all over the place
Sensitive to framing effects
Unclear if they’re actually surveying relevant “experts”
Vary widely in survey methodological quality

39 of 70

Prediction markets

This is from Metaculus (not actually a full prediction market)
Precise definition of “weakly general AI” (in question description) may play a large role
Unclear how much weight to give prediction markets

40 of 70

Do timelines actually make much of a difference?

Much more urgent if nearer, of course
Even if further out, may take a long time to solve alignment
Thought experiment:

What would you do if you knew we would achieve AGI 10 years from now? 30 years? 100 years?
What would you hope the government would do given those timelines?

41 of 70

“Takeoff speeds”: Can we just worry about it later?

Will the transition from narrow AI => AGI / TAI be sudden?

Exponential progress?
Finding that last missing piece / capability?

Especially if DL needs to be combined with some other algorithm
Might already have a “hardware overhang”

If it’s sudden, we may not have time to get our act together later
Also means we may not have “warning shots” beforehand

Will the transition from AGI => ASI be sudden?

Recursive (self-)improvement / PASTA
May give powerful first-mover advantage to lead AGI => easier to accumulate power, harder for other AI systems to help control it
If sufficiently quick transition, we may literally only have one chance to get it right

Reference classes:

Has AI progress experienced discontinuous jumps?
Discontinuous progress in other technologies
AI Impacts has done a lot of research on this

Even some of those who say it’ll be a “longer” transition often mean it will still be insanely fast – e.g., global economy doubling in months

Linear vs. exponential growth

Just missing that last skill…

42 of 70

Key Uncertainties:�Alignment

43 of 70

Will misaligned AGI / ASI be power-seeking?

Instrumental convergence - for a broad range of objectives it is instrumentally useful to:

Gain control of resources
Increase cognitive abilities devoted to goal

Increased computational resources
Self-copying, self-improvement
Creating even more powerful systems

Self-preservation / resistance to shutdown (“you can’t get the coffee if you’re dead”)
Remove potential obstacles to achieving goals (including humans…??)

How likely is this?
I think some may disagree with this entirely, but I don’t fully understand why

44 of 70

Will AGI have our values by default?

Orthogonality Thesis: AI can understand our goals and values and still act against us

E.g., sociopaths
Broad “basin of attraction” around human values?
Melanie Mitchell: AGI will require raising AI like children, and they’ll acquire our values by default

45 of 70

Could misaligned AGI / ASI really take over the world?

Assume for the moment we have a superintelligent power-seeking AI that wants to destroy / marginalize humanity. What could it do to achieve that?
Partly depends on takeoff speeds

How far ahead could the lead project get?

Probably doesn’t need physical robots
Increasing cognitive / computational capacity: Copying itself over the internet, etc.
Social manipulation

Individuals
Misinformation, control of information technology

Lots of potentially destructive technologies it could use…

Hacking / destroying key institutions
Nuclear codes
Biological weapons, engineered pandemics, etc.
Nanotech?

Cooperation between AGIs?

46 of 70

Key Uncertainties:�“Can’t we just…”

R.I.P.�Humanity

“What could possibly go wrong?”

47 of 70

Will near-term AI safety approaches scale up?

Some think yes (“Prosaic AI Safety”)
Others think challenges with AGI / ASI will require qualitatively different approaches

Application-specific solutions obviously won’t necessarily work for more general AIs
Addressing mesa-alignment
Control problem for AGI / ASI
Deceptive AGI

Risks from focusing too much on near-term safety?

Greatly increased commercial / strategic value of powerful AI systems => increase race dynamics, etc.?
False sense of complacency?

48 of 70

A Few Proposed Solutions / Strategies

Alignment

“Naïve” alignment strategies
Teach it human values (value learning)
Use narrow AI to help design safe AGI (iterated amplification)

Monitoring and control

Give it an off-switch
Limit its capabilities
Monitor it carefully
Use other AIs to help monitor / control it

Governance and policy approaches

49 of 70

“Naïve” Alignment Strategies

Asimov’s Three Laws of Robotics

Whole book is about robots finding ways around those…

Game of “what could possibly go wrong?”

“Don’t kill humans” (but hurting them incidentally ok?)
“Don’t hurt humans” (stick them in cages?)
“Make people happy” (plaster smiles on their faces, stick wires in their heads?)
Do we really want to play this game against a superintelligent AI optimizing hard against us?

How do you specify “be a perfectly friendly AGI” in machine terms?

Proxy metrics => specification challenges

50 of 70

Give it an off-switch

Incentivized to prevent shutdown (instrumentally convergent goal)
Requires careful monitoring, interpretability, etc. – we’re very far from that even for today’s systems

A robot was trained to grasp a ball in a virtual environment. This is hard, so instead it learned to pretend to grasp it, by moving its hand in between the ball and the camera.

https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/

51 of 70

Limit its capabilities

Myopia: Focus only on narrow, short-term goals
Oracles: Only use to answer questions, not acting in the world

Could use humans to implement complex plans – will we understand what it’s trying to do?

Strong economic incentive to deploy more capable models?
Mesa-optimization

52 of 70

Value learning

Have AI optimize for latent human goals

[Which humans? Important issue but (relatively) minor compared to “let’s all not die, please”]

Problems:

Can we perfectly specify this objective?
Under-determination of utility function based on observed actions
Still might not work for mesa-alignment

Some experts optimistic we can make this work, others more skeptical
Very active research agenda

53 of 70

Iterated Amplification

Basic idea: Ensure safety for weaker AIs, then use those to help bootstrap to more general AIs that retain safety guarantees
Still very much WIP – everybody agrees we’ll need much more research to make it work eventually, if it ever works
Again, some experts optimistic this could be made to work eventually, others much more skeptical

54 of 70

Deceptive Alignment?

With sufficient planning capability + learned knowledge of human psychology, would be incentivized to hide misalignment
Can we detect misalignment before it learns all this?

Would minor mistakes tip us off during training?
Might need excellent monitoring / interpretability even during training – far beyond what we have today

55 of 70

Deceptively Aligned Mesa-Optimizers

Source: Astral Codex Ten, “Deceptively Aligned Mesa-Optimizers: It’s Not Funny If I Have To Explain It”

56 of 70

Monitoring & Control: Use AI to monitor / control AI

Can weak AI monitor / control slightly more capable AI?
Will there be large differences in capabilities between leading AGI projects and others further behind?
How to ensure that the monitoring AI is aligned?
Can friendly AI be used to guard against rogue actors?

57 of 70

Governance and Policy Approaches

Reducing AI race dynamics between countries / corporations
International treaties / coordination
Setting standards and regulations
Policing against rogue actors
Seems like this will be necessary regardless

58 of 70

Key Uncertainties:�General / Meta Considerations

59 of 70

Comparison to misaligned humans?

We haven’t seen global catastrophes (yet!) from humans or human organizations, so why can’t we use similar techniques for AGI?

Well-studied area of economics: principal-agent problem

May break down if agent is sufficiently more intelligent / capable than principals (ASI vs. humans)

AI can also self-improve / copy itself much more rapidly – completely new capability

Human psychologies are relatively similar to each other in space of all possible mind / optimizer designs

More or less interpretable to each other
AGI / ASI more like completely alien intelligence

Even GPT-3 feels like doing psychology on a completely alien intelligence that just happens to speak English

Human motivations relatively similar to each other

Even psychopaths at least have goals we understand
Misaligned AGI / ASI may have arbitrary objectives

Paperclip maximization
Completely alien / uninterpretable mesa-objectives

60 of 70

Other comparisons

Human evolution: The “second species” argument

Increased intelligence / optimization / planning ability enabled humans to dominate all other species
Humans inflict harm on the ecosphere without hatred, just incentives

History of technology: Luddites and technology doomsayers have a poor track record

Or do they? (not sure)
The original Luddites weren’t wrong about their own jobs…
Experts vs. non-experts

61 of 70

Valid heuristics or irrational biases?

This whole discussion pattern-matches to science fiction and quackery

Normalcy bias
Availability heuristic
Normal compared to what time frame?
People felt the same way about early COVID warnings (actually from many of the same people warning about AI!)

The existential risk people seem like a weird fringe group of cultish doomsday prophets

Most experts don’t seem like they’re paying much attention to them
Pattern-matches to quacks or to members of a cult or weird religion
Some have lots of weird contrarian views
Some existential risk people are biased for monetary / prestige reasons

(cuts both ways)

Selection bias for futurists to expect dramatic near-term developments
Bias against expressing / adopting unorthodox views
Groupthink

Tech hype and techno-utopianism

But also anti-hype trend might go overboard

Anthropomorphizing

(cuts both ways)

Following an argument to its logical conclusions vs. relying on heuristics

62 of 70

Decision Factors:�So what should we do about all this?

63 of 70

Decision making under uncertainty

Expected utility and utility functions
Risk aversion and uncertainty aversion
Future discounting
Who’s making the decision? (Government, corporation, research funding organization, individual, etc.)
What’s the context of the decision? E.g.:

Whether or not to invest in different AI safety research agendas
Government policy decision making
Whether a corporation or startup should proceed with certain capabilities research
Career choice
Philanthropic donations
Awareness / advocacy

Look at both benefits and costs
Not doing anything is also a decision

64 of 70

Other factors

Can we even do anything about this now?

Maybe we need to wait until we know more about how AGI will end up working

Opportunity costs

Is focusing on these risks taking away from other causes (resources, attention)?

Slowing down AGI capabilities research vs. benefits of AI (including for other risks)

Comparison to gain-of-function research in medicine / bioengineering

Will government / industry / academia solve this on their own by default?
Longtermist ethics
Leverage argument: This is going to be really big so we should be spending a lot more attention / resources on it just for that reason

But in what ways? (AI safety research, capability advancement, proceeding with extreme caution, forecasting / strategy research, etc.?)

Pascal’s Wager / Pascal’s Mugging?

65 of 70

What I’m working on (very much WIP though)

66 of 70

Preliminary model – partly expanded

See AI Alignment Forum report for detailed model walkthrough

67 of 70

Getting Involved

68 of 70

Direct work opportunities

Theoretical work, engineering work, governance / policy work, strategy, etc.
Effective Altruism & funding opportunities
80000hours.org

Career advice
Job board
1 on 1 meetings for networking / advice

69 of 70

General awareness and advocacy

The more people are aware of the issues the better, generally
May be opportunities to impact relevant workplace decision-making
Potential issue – infohazards?

70 of 70

Join the conversation!

LessWrong.com (connects to AI Alignment Forum)
forum.effectivealtruism.org
ai.metaculus.com

Also www.metaculus.com/ai-progress-tournament/

Feel free to contact me!

Happy to chat about any of this or point you to where you can find more information (books, websites, etc.)
I’m happy to help you connect with others if I can

Contact info