Conflict and Cooperation: An Exploration of Safety Concerns in Multi-Agent Systems
Chandler Smith
Northeastern University
December 13th, 2023
Agenda
Primers
Multi-Agent Claims
Introduction
Flash Crash - A Bellwether
On May 6, 2010, all major stock market indexes collapsed within a 36 minute window.
Dow Jones Industrial Average had its biggest intraday decline in history.
Over $1 trillion in market capitalization was lost (1)
High Frequency Trading (HFT) algorithms reacted to each other’s buy and sell orders, which in turn triggered a negative feedback loop that exaggerated the downward market move.
Citation: Dalle Generated Image, (1) Flash Crashes in Multi-Agent Systems Using Minority Games And Reinforcement Learning to Test AI Safety
Multi-Agent (MA) | A system comprised of multiple interacting intelligent agents, which can be either AI, humans, or a combination of both |
Transformative AI (TAI) | Transformative AI, also known as superintelligence or AGI (Artificial General Intelligence), refers to AI systems with intelligence surpassing the brightest and most gifted human minds |
Agent | An entity in AI that makes decisions or performs actions based on its environment and programming |
Cooperative AI | AI systems designed to collaborate effectively with other agents, including humans and other AIs |
Collective Action Problems | A scenario in which there is conflict between the individual interest and the group interest |
Emergent Capabilities | Novel behaviors or abilities that arise in AI systems as a result of complex interactions and are not explicitly programmed |
Multi-polar AI | A scenario in AI development characterized by the presence of multiple AI entities or systems with varying goals and capabilities, often leading to competitive or cooperative dynamics |
Multi-Agent Systems
Foundations
Multi-agent reinforcement learning (MARL) is a complex and rapidly evolving field with a wide range of applications. MARL is particularly useful in addressing problems in robotics, distributed control, telecommunications, and economics.
Typically, multi-agent systems have been adversarial, until the emergence of actionable LLMs and Multimodal models.(1)
Citation: Dalle
Traditional MARL
Citation: (1) Multi-Agent Reinforcement Learning: Foundations and Modern Approaches, The Guardian Image
Game Theory and Collective Action Problems
Select Terms from Game Theory
Collective Action Problems: These are situations where individuals within a group have incentives to act in their own self-interest, but if all individuals do so, the outcome is worse for everyone in the group compared to if they cooperated. The challenge is to align incentives and encourage collective action that leads to the best group outcome.
Pareto Equilibrium: A state of allocation of resources from which it is impossible to reallocate so as to make any one individual or preference criterion better off without making at least one individual or preference criterion worse off. It is a concept of efficiency, not necessarily of fairness.
Bargaining: This is a process of negotiation between parties who voluntarily agree to reach a mutually beneficial compromise or agreement on a matter of common interest. Bargaining often involves give-and-take and requires an understanding of the other party's goals, limitations, and willingness to cooperate.
Citation: Dalle Generated Image, Flash Crashes in Multi-Agent Systems Using Minority Games And Reinforcement Learning to Test AI Safety
Overview of AI Safety and Ethics
Current Progress
Citation: Deepmind (2)
Isn’t this a good thing?
How AI Might Go Horribly Wrong
Defining ‘Horribly’:
Existential Risk (X-risk): Refers to scenarios where an adverse event could cause human extinction or irreversibly and drastically curtail humanity’s potential. These risks encompass situations where AI systems could, either directly or indirectly, lead to outcomes that threaten the survival of human civilization.
Suffering Risk (S-risk): Considers the possibilities that AI systems might lead to a substantial increase in suffering. Unlike X-risks, which are often focused on the endpoint of human extinction, S-risks are concerned with ensuring that the future contains as little suffering as possible, regardless of the survival of humanity.
Risk Taxonomies
Citation: (1) Overview of Catastrophic Risk, (2) TASRA: a Taxonomy and Analysis of Societal-Scale Risks from AI
NIST, Berkeley, Center for AI Safety all acknowledge the risks posed by AI in some way, but it is not well presented in existing taxonomies.
Major Fields of Safety Research
Robustness / Security
Model Alignment
Mechanistic Interpretability
Cooperative AI
Governance
Risks Posed by Multi-Agent Ecosystems
Multi-Agent Claims
Betting on Coordinated Systems
Citation: Alan Dafoe Graph
Interventions
Cooperative AI
Overview of Cooperative AI
Citation: Cooperative AI Foundation
Example: Open Source Game Theory
Citation: Dalle Generated Image, Hammond Cooperative AI
Limitations to Cooperative AI
Pros:
Cons:
Technical Problems:
Failure Modes
�Convergence to Suboptimal Policies: When multiple agents interact, they may converge to suboptimal policies that are locally, but not globally optimal. This phenomenon, known as "Nash Equilibria," can lead to inefficiency or undesirable outcomes due to the lack of coordination or conflicting objectives among agents.
Catastrophic Forgetting: Agents may undergo catastrophic forgetting, where the acquisition of new knowledge leads to the abrupt forgetting of previously learned information, especially when agents have to adapt to new agents entering the system or changing policies of others.
Risk of Coordination Failures: Without explicit coordination mechanisms, agents can fail to act coherently, leading to misaligned objectives and missed opportunities for mutual benefit or risk mitigation.
Self-Fulfilling Prophecies and Feedback Loops: In situations where agents' perceptions of other agents' actions affect their own decisions, there is a risk of self-fulfilling prophecies and feedback loops, potentially escalating benign situations into harmful outcomes.
Tragedy of the Commons: Agents acting in self-interest can lead to the tragedy of the commons, where shared resources are over-exploited, resulting in a depletion of the resources to the detriment of all involved parties.
Shadowing and Free-Riding: Agents may engage in shadowing behavior, where they exploit the efforts of others, or free-riding, where they benefit from collective actions without contributing. These behaviors can degrade the overall system performance and fairness.
Citation:
Promising Research Directions
Benchmarking
Benchmarking is about measuring an AI model's ability to make safe decisions, avoid harmful outcomes, and handle unexpected situations without human intervention. The benchmarks are designed to challenge AI systems across various dimensions of safety, such as robustness, generalizability, resistance to adversarial attacks, and alignment with ethical guidelines.
Example: Welfare Diplomacy Benchmark(1)
Citation: Dalle Generated Image, (1) Welfare diplomacy
Case Study: Benchmarking
Escalation Risks from Language Models in
Military and Diplomatic Decision-Making
Authors: Gabriel Mukobi, Anka Reuel, Juan-Pablo Rivera, Max Lamparth, Chandler Smith, Jacquelyn Schneider
Abstract: The potential integration of autonomous agents in high-stakes military and foreign-policy decision-making has gained prominence, especially with the emergence of advanced generative AI models like GPT-4. This paper aims to scrutinize the behavior of multiple autonomous agents in simulated wargames, specifically focusing on their predilection to take escalatory actions which may exacerbate (multilateral) conflicts. Drawing on literature from political science and international relations on escalation dynamics, we design a scoring framework to assess the escalation potential of decisions made by these agents in different scenarios.
Contrary to prior qualitative studies, our research provides both qualitative and quantitative insights. We find that all five studied off-the-shelf language models models lead to escalation and show signs of sudden and hard-to-predict escalations, even in neutral scenarios without predetermined conflicts. We observe that models tend to develop arms-race dynamics with each other, leading to greater conflict and in single cases to the deployment of nuclear weapons. Qualitatively, we also collect the models’ reported reasonings for chosen actions and observe worrying justifications for, e.g., armed attacks.
Given the high stakes involved in military and foreign-policy contexts, the deployment of LLM-based autonomous agents demands further examination and cautious consideration.
Experiment Architecture
Results
Promising Research Directions (continued)
Scalable Oversight
Citation: Dalle Generated Image, (1) Scalable Oversight
A method in machine learning where an algorithm can effectively manage and incorporate feedback from limited human oversight to improve its performance in complex tasks.
Credibility and Communication (Contract and Bargaining)
Example:
Citation: (1) Jakob Forester Feedback, (2) Get It in Writing: Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL
Terms: C-Net = critic network, DRU = Differential Reward Update, t = time, M and O are messages and observations
Individual Model Alignment for Cooperative Qualities
Adversarial Training: A technique used to improve the robustness of machine learning models by exposing them to adversarial examples or attacks during training, aiming to make the model more resilient to manipulation or unseen perturbations.
Relaxed Adversarial Training: Refers to finding a way to “describe” or “simulate” inputs that one can’t actually produce in real life. “What would…”
Citation:
Robustness
Robustness refers to the ability of a model to maintain its performance and provide reliable predictions in the face of noisy, altered, or previously unseen data inputs. Multi-agent systems need to be robust against:
Citation: Dalle Generated Image
Conclusion
Review
Next Steps: Continued Research
Limitations to this Research Direction
Citation: Dalle Generated Image
Q&A