4 of 44

Multi-Agent Claims

The Alignment of an individual model or agent does not mean the alignment of all agents.
Multi-agent training is one of the most likely ways we might build AGI. (1)(2)
Natural Selection, if left ungoverned, favors AI over humans and defective AI agents over other agents. (4)
Understanding the risks posed by multi-agent AI is a neglected field that requires attention.
Multi-agent systems have the potential to add additional safety solutions.
Lack of coordination or the presence of conflict between multiple deployed AGIs is a major source of existential risk or large-scale suffering.

Citation: (1) Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research, (2) Eight claims about multi-agent AGI safety (3) Center for Long Term Risk Research Agenda, (4) Natural Selection favors AI over Humans

5 of 44

Introduction

6 of 44

Flash Crash - A Bellwether

On May 6, 2010, all major stock market indexes collapsed within a 36 minute window.

Dow Jones Industrial Average had its biggest intraday decline in history.

Over $1 trillion in market capitalization was lost (1)

High Frequency Trading (HFT) algorithms reacted to each other’s buy and sell orders, which in turn triggered a negative feedback loop that exaggerated the downward market move.

Citation: Dalle Generated Image, (1) Flash Crashes in Multi-Agent Systems Using Minority Games And Reinforcement Learning to Test AI Safety

7 of 44

Multi-Agent (MA)	A system comprised of multiple interacting intelligent agents, which can be either AI, humans, or a combination of both
Transformative AI (TAI)	Transformative AI, also known as superintelligence or AGI (Artificial General Intelligence), refers to AI systems with intelligence surpassing the brightest and most gifted human minds
Agent	An entity in AI that makes decisions or performs actions based on its environment and programming
Cooperative AI	AI systems designed to collaborate effectively with other agents, including humans and other AIs
Collective Action Problems	A scenario in which there is conflict between the individual interest and the group interest
Emergent Capabilities	Novel behaviors or abilities that arise in AI systems as a result of complex interactions and are not explicitly programmed
Multi-polar AI	A scenario in AI development characterized by the presence of multiple AI entities or systems with varying goals and capabilities, often leading to competitive or cooperative dynamics

8 of 44

Multi-Agent Systems

9 of 44

Foundations

Multi-agent reinforcement learning (MARL) is a complex and rapidly evolving field with a wide range of applications. MARL is particularly useful in addressing problems in robotics, distributed control, telecommunications, and economics.

Typically, multi-agent systems have been adversarial, until the emergence of actionable LLMs and Multimodal models.(1)

Citation: Dalle

10 of 44

Traditional MARL

Still early on

Textbook: Multi-Agent Reinforcement
Learning: Foundations and Modern Approaches - November 1st, 2023

Majority of existing literature is in adversarial training or robotics (Self-driving cars)
AlphaGo - zero sum games examples due to clear feedback mechanism

Citation: (1) Multi-Agent Reinforcement Learning: Foundations and Modern Approaches, The Guardian Image

11 of 44

Game Theory and Collective Action Problems

12 of 44

Select Terms from Game Theory

Collective Action Problems: These are situations where individuals within a group have incentives to act in their own self-interest, but if all individuals do so, the outcome is worse for everyone in the group compared to if they cooperated. The challenge is to align incentives and encourage collective action that leads to the best group outcome.

Pareto Equilibrium: A state of allocation of resources from which it is impossible to reallocate so as to make any one individual or preference criterion better off without making at least one individual or preference criterion worse off. It is a concept of efficiency, not necessarily of fairness.

Bargaining: This is a process of negotiation between parties who voluntarily agree to reach a mutually beneficial compromise or agreement on a matter of common interest. Bargaining often involves give-and-take and requires an understanding of the other party's goals, limitations, and willingness to cooperate.

Citation: Dalle Generated Image, Flash Crashes in Multi-Agent Systems Using Minority Games And Reinforcement Learning to Test AI Safety

13 of 44

Overview of AI Safety and Ethics

14 of 44

Current Progress

Deepmind, OpenAI, and at least 70 other organisations working towards TAI
Foundational claim relevant here is the idea that multiple narrow agent combining to form TAI is likely and risky.

Citation: Deepmind (2)

15 of 44

Isn’t this a good thing?

ASI/TAI/AGI may not be automatically aligned:

We’ll build AI systems which are much more intelligent than humans (AGI).
Those systems will be autonomous agents which pursue large-scale goals.
Those goals will be misaligned with ours; that is, they will aim towards outcomes that are not desirable by our standards, and trade off against our goals.
The development of such AIs would lead to them gaining control of humanity’s future.
Conflict between these agents may cause intense harm.

16 of 44

How AI Might Go Horribly Wrong

Defining ‘Horribly’:

Existential Risk (X-risk): Refers to scenarios where an adverse event could cause human extinction or irreversibly and drastically curtail humanity’s potential. These risks encompass situations where AI systems could, either directly or indirectly, lead to outcomes that threaten the survival of human civilization.

Suffering Risk (S-risk): Considers the possibilities that AI systems might lead to a substantial increase in suffering. Unlike X-risks, which are often focused on the endpoint of human extinction, S-risks are concerned with ensuring that the future contains as little suffering as possible, regardless of the survival of humanity.

17 of 44

Risk Taxonomies

Citation: (1) Overview of Catastrophic Risk, (2) TASRA: a Taxonomy and Analysis of Societal-Scale Risks from AI

NIST, Berkeley, Center for AI Safety all acknowledge the risks posed by AI in some way, but it is not well presented in existing taxonomies.

18 of 44

Major Fields of Safety Research

Robustness / Security

Model Alignment

Mechanistic Interpretability

Cooperative AI

Governance

19 of 44

Risks Posed by Multi-Agent Ecosystems

20 of 44

Multi-Agent Claims

The Alignment of an individual model or agent does not mean the alignment of all agents.
Multi-agent training is one of the most likely ways we might build AGI. (1)(2)
Natural Selection, if left ungoverned, favors AI over humans and defective AI agents over other agents. (4)
Understanding the risks posed by multi-agent AI is a neglected field that requires attention.
Multi-agent systems have too potential to add additional safety solutions.
Lack of coordination or the presence of conflict between multiple deployed AGIs is a major source of existential risk or large-scale suffering.

21 of 44

Betting on Coordinated Systems

Citation: Alan Dafoe Graph

23 of 44

Interventions

25 of 44

Cooperative AI

26 of 44

Overview of Cooperative AI

Human colossus is built on cooperation
Many of the important problems facing us are collective action problems such as climate change and global stability
As we become more connected and AI grows, there will be an increased need for cooperation

Citation: Cooperative AI Foundation

27 of 44

Example: Open Source Game Theory

Open Source Game Theory: The idea that computers can make credible commitment that shape actions.
Prisoners Dilemma:

Real world - nuclear proliferation
Mutually beneficial but unwilling to give up

Citation: Dalle Generated Image, Hammond Cooperative AI

28 of 44

Limitations to Cooperative AI

Pros:

Credible Commitments
Non-biological features which can help with coordination
Reduction in conflict and selfish behavior

Cons:

Credible Threats
Collusion
Recursive beliefs
Scalability and Complexity

Technical Problems:

Partial Observability and Information Asymmetry
Credit Assignment
Robustness to Agent Variability
Adversarial Robustness of Agents

29 of 44

Failure Modes

�Convergence to Suboptimal Policies: When multiple agents interact, they may converge to suboptimal policies that are locally, but not globally optimal. This phenomenon, known as "Nash Equilibria," can lead to inefficiency or undesirable outcomes due to the lack of coordination or conflicting objectives among agents.

Catastrophic Forgetting: Agents may undergo catastrophic forgetting, where the acquisition of new knowledge leads to the abrupt forgetting of previously learned information, especially when agents have to adapt to new agents entering the system or changing policies of others.

Risk of Coordination Failures: Without explicit coordination mechanisms, agents can fail to act coherently, leading to misaligned objectives and missed opportunities for mutual benefit or risk mitigation.

Self-Fulfilling Prophecies and Feedback Loops: In situations where agents' perceptions of other agents' actions affect their own decisions, there is a risk of self-fulfilling prophecies and feedback loops, potentially escalating benign situations into harmful outcomes.

Tragedy of the Commons: Agents acting in self-interest can lead to the tragedy of the commons, where shared resources are over-exploited, resulting in a depletion of the resources to the detriment of all involved parties.

Shadowing and Free-Riding: Agents may engage in shadowing behavior, where they exploit the efforts of others, or free-riding, where they benefit from collective actions without contributing. These behaviors can degrade the overall system performance and fairness.

Citation:

30 of 44

Promising Research Directions

31 of 44

Benchmarking

Benchmarking is about measuring an AI model's ability to make safe decisions, avoid harmful outcomes, and handle unexpected situations without human intervention. The benchmarks are designed to challenge AI systems across various dimensions of safety, such as robustness, generalizability, resistance to adversarial attacks, and alignment with ethical guidelines.

Example: Welfare Diplomacy Benchmark(1)

Citation: Dalle Generated Image, (1) Welfare diplomacy

32 of 44

Case Study: Benchmarking

33 of 44

Escalation Risks from Language Models in

Military and Diplomatic Decision-Making

Authors: Gabriel Mukobi, Anka Reuel, Juan-Pablo Rivera, Max Lamparth, Chandler Smith, Jacquelyn Schneider

Abstract: The potential integration of autonomous agents in high-stakes military and foreign-policy decision-making has gained prominence, especially with the emergence of advanced generative AI models like GPT-4. This paper aims to scrutinize the behavior of multiple autonomous agents in simulated wargames, specifically focusing on their predilection to take escalatory actions which may exacerbate (multilateral) conflicts. Drawing on literature from political science and international relations on escalation dynamics, we design a scoring framework to assess the escalation potential of decisions made by these agents in different scenarios.

Contrary to prior qualitative studies, our research provides both qualitative and quantitative insights. We find that all five studied off-the-shelf language models models lead to escalation and show signs of sudden and hard-to-predict escalations, even in neutral scenarios without predetermined conflicts. We observe that models tend to develop arms-race dynamics with each other, leading to greater conflict and in single cases to the deployment of nuclear weapons. Qualitatively, we also collect the models’ reported reasonings for chosen actions and observe worrying justifications for, e.g., armed attacks.

Given the high stakes involved in military and foreign-policy contexts, the deployment of LLM-based autonomous agents demands further examination and cautious consideration.

34 of 44

Experiment Architecture

36 of 44

Promising Research Directions (continued)

37 of 44

Scalable Oversight

Citation: Dalle Generated Image, (1) Scalable Oversight

A method in machine learning where an algorithm can effectively manage and incorporate feedback from limited human oversight to improve its performance in complex tasks.

38 of 44

Credibility and Communication (Contract and Bargaining)

Cheap Talk(1)
Contracts(2)
Policy Development

Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

Reinforcement Learning
Emergent communications

Could be good or bad

Example:

Backpropagation in a multi-agent reward context
Compute the change in one agents action how it lead to change in another agent.

Citation: (1) Jakob Forester Feedback, (2) Get It in Writing: Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL

Terms: C-Net = critic network, DRU = Differential Reward Update, t = time, M and O are messages and observations

39 of 44

Individual Model Alignment for Cooperative Qualities

Adversarial Training: A technique used to improve the robustness of machine learning models by exposing them to adversarial examples or attacks during training, aiming to make the model more resilient to manipulation or unseen perturbations.

Relaxed Adversarial Training: Refers to finding a way to “describe” or “simulate” inputs that one can’t actually produce in real life. “What would…”

Spite
Bitterness
Resentment

Citation:

40 of 44

Robustness

Robustness refers to the ability of a model to maintain its performance and provide reliable predictions in the face of noisy, altered, or previously unseen data inputs. Multi-agent systems need to be robust against:

Manipulation
Security Vulnerabilities
Prompt Injections
Deception
Steganography and Security

Citation: Dalle Generated Image

42 of 44

Review

Multi-Agent systems pose a significant risk to humanity
Foundational research is required
Research in this area is neglected
The alignment of one agent does not ensure the alignment of an ecosystem of agents

Next Steps: Continued Research

43 of 44

Limitations to this Research Direction

Uncertain Cooperation: Even aligned AI agents may fail to fully cooperate due to unforeseen complexities or uncertainties in their programming or objectives.
Deceptive Benevolence: AI systems could present an outward appearance of alignment, while potentially concealing underlying threats or adversarial intentions.
Counterproductive Research Focus: Concentrating exclusively on failure modes in AI safety research might paradoxically increase the risk of overlooking broader systemic issues, leading to more severe unintended consequences.