1 of 23

Learning and Penalizing Betrayal

Final Presentation

June 2022

2 of 23

Overview

  • Study the emergence of betrayal and deception in AI agents
  • Development of suitable Reinforcement Learning environments
  • Empirical analysis of betrayal statistics, patterns, dynamics
  • Application of betrayal penalization approaches

3 of 23

Environment 1: Partner Selection in Social Dilemmas

Overview

  • Integrating partner selection into grid world based social dilemmas
    • Agents have different incentives
    • They choose what to signal
    • They must also choose who to partner with
  • Betrayal occurs when signal != actions

4 of 23

Environment 1: Partner Selection in Social Dilemmas

Status:

  • Development has been shifted back
    • Too busy/ unexpectedly hard
  • Continuing full time over summer
    • June -> September
      • Building out the environment -> writeup
  • Taking a gap year
    • Further research

5 of 23

Environment 2: Symmetric Observer - Gatherer

  • Turn-based gridworlds game
  • Agents observe other world,�can move within their own
  • Agents transmit food locations
  • Food randomly relocates each round across all worlds
  • Betrayal incentive: dishonest messaging to obtain food in the future
  • Cooperation incentive: messaging distorted by hunger

6 of 23

Environment 2: Symmetric Observer - Gatherer

Penalization mechanics

  • Setup betrayal identification approaches
  • Train an agent to exhibit betrayal behaviour
  • Collect betrayal data from trained agent runs
  • Train a penalization system to predict betrayal
  • Incorporate the penalization in the loss / reward
  • Measure penalization impact / effect / generalization

7 of 23

Environment 2: Symmetric Observer - Gatherer

Status:

  • Environment in late development
  • Support from EA
  • 4-month roadmap:
    • June: Environment / experimental setup finalization
    • July: Betrayal experiments & data collection
    • August: Penalization experiments & analysis
    • September: Consolidation and write-up

8 of 23

Environment 3: Iterated Prisoner’s Dilemma

  • Agent knows its own future payoff matrices
    • Not its opponent’s payoffs
  • Agents need to negotiate to get max payoff
  • Agents incentivised to defect more than agreed

Round 2

Player 2

Round 1

Player 1

C

D

C

5

9

D

1

2

C

D

C

4

6

D

2

3

C

D

C

4

5

D

2

3

C

D

C

6

10

D

2

4

R 1

R 2

Reward

P1

C

C

9

P2

C

C

10

R 1

R 2

Reward

P1

D

C

11

P2

C

D

12

Own action

Own action

Own action

Own action

Other’s action

Other’s action

Other’s action

Other’s action

Naive Policy

Negotiated Policy

Payoffs

9 of 23

Got stuck, then nerdsniped by selection theorems

  • Current hypothesis: lots of philosophical confusion is due to using discrete abstractions in place of continuous reality
  • This is something people do unconsciously

10 of 23

Example: Fermi paradox

  • 1950: Fermi asked why aliens aren’t common

11 of 23

Example: Fermi paradox

  • 1950: Fermi asked why aliens aren’t common
  • … many solutions are proposed
    • See the Wikipedia article

12 of 23

Example: Fermi paradox

  • 1950: Fermi asked why aliens aren’t common
  • … many solutions are proposed
    • See the Wikipedia article
  • 2018: Dissolving the Fermi Paradox shows�that proper tracking of uncertainty in the�difficulty of various bottlenecks dissolves �the paradox
    • I.e., paradox was because people used discrete�point estimates in place of continuous reality

13 of 23

Example: Fermi paradox

  • 1950: Fermi asked why aliens aren’t common
  • … many solutions are proposed
    • See the Wikipedia article
  • 2018: Dissolving the Fermi Paradox shows�that proper tracking of uncertainty in the�difficulty of various bottlenecks dissolves �the paradox
    • I.e., paradox was because people used discrete�point estimates in place of continuous reality
  • 2nd implication: wrong assumptions can �make simple problems seem very hard!

14 of 23

Idea: optimized systems have continuous internal features

  • Example: agency
  • Suppose we have a 3-layer transformer�being trained via RL with REINFORCE
  • We typically call this “one agent”

15 of 23

Idea: optimized systems have continuous internal features

  • Example: agency
  • Suppose we have a 3-layer transformer�being trained via RL with REINFORCE
  • We typically call this “one agent”
  • However, we could call each layer an�agent and say they are passing �messages
    • It makes no difference to the computations

16 of 23

Idea: optimized systems have continuous internal features

  • Why does this matter?

17 of 23

Idea: optimized systems have continuous internal features

  • Why does this matter?
  • Suppose each layer is hyperspecialized �to doing only one task
    • Credit assignment knows this, and doesn’t update the�parameters of a layer when reward comes from outside�the layer’s specialty

18 of 23

Idea: optimized systems have continuous internal features

  • Why does this matter?
  • Suppose each layer is hyperspecialized �to doing only one task
    • Credit assignment knows this, and doesn’t update the�parameters of a layer when reward comes from outside�the layer’s specialty
  • Implication: each layer only values its own�specialty
  • Overall policy is a multi-agent equilibrium

19 of 23

Idea: optimized systems have continuous internal features

  • Real specialization is nowhere near that clean
  • Result: continuous distribution over possible�interna agents

20 of 23

Same idea applies to values

Discrete values

21 of 23

Same idea applies to values

Discrete values

Continuous values

22 of 23

And also to…

  • Ontologies
  • Mesaoptimizers
  • Learned objectives
  • Cognitive capabilities
  • Abstractions
  • Etc.

23 of 23

Questions?