1 of 39

Collective Artificial Intelligence

From Independent to Cooperative Models

Alfonso Amayuelas

Committee: William Wang (chair), Shiyu Chang, Xifeng Yan

PhD Major Area Exam, Spring 2025

05/02/25

Department of Computer Science

Department of Computer Science

2 of 39

Agenda

  1. Intro: LLM Multi-Agent Systems (MAS)
    1. Agents
    2. Multi-Agent Systems
    3. Principles
    4. Current State
  2. Creating MAS’s
    • Team Optimization Dynamic Agent Networks: DyLAN
    • Working MAS Application: MAGIS
    • MAS Learning: Learning cooperatively with MARL
  3. What is next?
    • My Research
    • Automated Society
    • Roadmap

Department of Computer Science

3 of 39

LLM Agents

Environment

Reasoning

Tool/Action Selection

  • LLM Agents = Autonomous Systems powered by LLMs and grounded in Natural Language
  • Receive information from the environment, and interact with tool or actions
  • Applications = workflow automation: data analysis, software development, robot control

Action Execution

Department of Computer Science

4 of 39

Multi-Agent Systems (MAS)

  • Multi-Agent LLM systems involve multiple agents working together to solve a task
    • Multi-Agent debate has shown performance increase in factuality and reasoning [1]
  • Each agent can take specialized roles
    • These tasks can eventually be taken by task-specific LLMs
  • Agents communicate and coordinate to solve the task and enable robustness and scalability

[1] Du, Yilun, et al. "Improving factuality and reasoning in language models through multiagent debate." Forty-first International Conference on Machine Learning. 2023.

“A multi-agent system (MAS) is then defined as a

collection of agents designed to interact through orchestration, enabling collective intelligence.”

Department of Computer Science

5 of 39

Key Principles

  • Agent Initialization
    • Profiling via persona or expertise
  • Orchestration Process
    • Coordinate different agents
  • Agent team optimization
    • Select correct agents

[2 ] Su, Yu, et al. "Language Agents: Foundations, Prospects, and Risks." Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts. 2024.

Social Simulations

Software Development

Increased reasoning at inference time

Current Application Areas:

Department of Computer Science

6 of 39

Where are we?

[3] Cemri, Mert, et al. "Why do multi-agent llm systems fail?." arXiv preprint arXiv:2503.13657 (2025).

Why?

An analysis over > 200 traces across 7 MAS frameworks with humans and LLM judge shows:

  • Specification issues (42%) : Poor prompt design, role misalignment, context loss
  • Inter-Agent Misalignment (37%): Coordination breadowns, ignored input, task derailment
  • Task Verification Failures (21%): Incorrect, Incomplete or missing checks

Despite growing enthusiasm, performance gains of MAS remain minimal

Implications?

  • Some attribute this to present-day LLM limitations
  • But others argue that good MAS design required organization understanding.

MAS Frameworks: MetaGPT, ChatDev, HyperAgent, Appworld, AG2, Magnetic-One, OpenManus…

Department of Computer Science

7 of 39

Towards Collective AI

2022

2023

2024

2025

2026+

Release of Chatbots

Complex Organizations

& Cooperative Learning

Multi-Agent LLM Frameworks

Advanced Post-Training

Large-Scale RL

Moving towards AI agents that interact and learn with the environment and other AI models

Agents

MCP

PhD

Beginning

Department of Computer Science

8 of 39

A Dynamic LLM-powered Agent Network for Task-Oriented Agent Collaboration (DyLAN) [4]

Zijun Liu

Yanzhe Zhang

Peng Li

Yang Liu

Diyi Yang

Presented at COLM 2024

Department of Computer Science

9 of 39

Optimizing Multi-Agent System Creation

Contributions

  • DyLAN: A framework for task-oriented agent collaboration in 2 stages:
    • Team Optimization
    • Task Solving

Question: How can we automatically optimize LLM-MAS creation?

→ Does dynamic agent selection and communication improve task performance?

  • New Agent Collaboration Formulation
    • Introduce Agent Collaborations as a temporal feed-forward networks

  • Increased performance
    • Results better accuracy and efficiency

Department of Computer Science

10 of 39

DyLAN Framework

  • 2-stage approach
  • Team Optimization
  • Task Solving
  • Dynamic Selection & Structure
  • Temporal Feed-Forward Network (T-FFN) as communication/computation graph

Dynamic LLM-Powered Agent Network

Department of Computer Science

11 of 39

Temporal Feed-Forward Networks

  • Components:
  • Nodes = Agents at time steps (LLM-agent or independent tool)

  • Edges = Communication Links

  • Layers: time steps
  • Dynamic message passing
  • Forward: during task solving
  • Backward: agent evaluation and team optimization

Definition

Department of Computer Science

12 of 39

How T-FFN Enables Dynamic Collaboration (1/2)

  • Early Stopping: Terminates interactions if ≥ 2/3 agents agree

Forward Message Passing = Inference

  • At each time step, agents receive inputs (messages) from previous agents and generate responses.
  • Responses are generated using prompts or tool outputs.

Agent Team Reformation

  • A Ranking system scores agent outputs → selects top agents for the next time step.
  • Filter ineffective agents, making the network dynamic and task-specific.

Department of Computer Science

13 of 39

Agent Selection Using T-FFN (Team Optimization) (2/2)

Backward Message Passing (Evaluation):

  • Agents scores predecessor’s outputs
  • Ratings propagate backward to compute Agent’s Importance Score

Selection Algorithm:

  1. Propagation: Rate predecessor’s outputs
  2. Aggregation: Backpropagate ratings to compute Importance Scores
  3. Selection: Choose top-k agents

→ Provides a task-oriented and dynamic teams

Department of Computer Science

14 of 39

Experimental Results

Model = GPT-4

DyLAN outperforms strong baselines:

  • +17.7% over Direct Execution (WebShop)
  • +10.2% over LLM Debate (MMLU)
  • Uses fewer API calls despite higher accuracy.

Accuracy on WebShop

Accuracy on MMLU

Accuracy on HumanEval

Benchmarks: HumanEval (coding), WebShop (decision-making), MMLU (general-reasoning), MATH (arithmetic-reasoning).

Department of Computer Science

15 of 39

Ablations & Analysis

  • Team Optimization works
  • Selecting fewer, high-contributing agents improves accuracy
  • 25% improvement in college math

  • Robustness of Agent Importance Score
  • Evaluated imbalance of roles

  • Early Stopping Mechanism
  • Reduces cost with minimal accuracy impact
  • Reduces API Calls by 45% (math), 66.2% (mmlu), 11.3% (humanEval), 54.2% (webshop)

Performance improvement on MMLU

Varying the number of

agents after optimization

Ablation w/o early stopping (es) or

team reformation (atr)

Department of Computer Science

16 of 39

Contributions & Discussion

Discussion

  1. Do the tasks required multi-agent communication?
  2. These tasks were meant for single-agents

  • Are the agents evaluated representative enough?
  • How much of a difference does the system prompt make.

Contributions

  1. DyLAN Framework
  2. Novel 2-stage system for task-oriented multi-agent collaboration
  3. Agent Temporal Feed-forward Network (T-FFN) and Agent Importance Score
  4. Unsupervised backprop-inspired metric for agent’s contributions

Department of Computer Science

17 of 39

MAGIS:

LLM-Based Multi-Agent Framework For Github Issue Resolution [5]

Wei Tao

Yucheng Zhou

Yanlin Wang

Wenqiang Zhang

Hongyu Zhang

Yu Cheng

Presented at NeurIPS 2024

Department of Computer Science

18 of 39

MAS Application

Question: Can LLMs help solve issues better at scale and pass@1?

→ e.g LLMs are solving 2-4% of issues without agentic behaviors

Motivation: Solving real-world Github Issues is a challenging problem for LLMs

Why: LLMs struggle with�

  • Repository-scale understanding
  • Long-context code reasoning
  • Accurate code modification

Department of Computer Science

19 of 39

Empirical Finding on LLM Failures

Why do LLMs Fail on Github issues?

Model

# Files Corr.

# Functions Corr.

GPT-4

-25.15*

-25.15*

MAGIS

-1.55*

-1.55*

GPT4 solve 2% of issues on SWE-Bench compared to 67% on HumanEval (function-level code generation)

(RQ1) Why is the performance on Github issues limited?

  • Locating files to be modified tradeoff
    • More files = higher recall but more context
  • Locating lines
    • Coverage ratio correlates with resolution success

(Claude-2 statistically significant positive relation)

  • Complexity
    • More files changes → lower success rate

Logistic Regression Coefficient

Coverage Ratio

Department of Computer Science

20 of 39

MAGIS Framework

Introduces 4 agent roles, mirroring real software teams:

    • Manager: Generates Plan Task Flow
    • Custodian: Locates relevant files via BM25 + memory-based filtering
    • Developer: Performs code edits in small, localized steps
    • QA Engineer: Reviews edits iteratively and provides feedback.

Workflow: Planning → Coding → Review → Merge

MAGIS: A collaborative Multi-Agent Framework

Department of Computer Science

21 of 39

Experiments

MAGIS Significantly Improves Issue Resolution

Evaluation: SWE-Bench comprises 2294 real issues

(RQ2) Framework Effectiveness

  • MAGIS (GPT4) obtains 8x improvement over GPT4

Ablations

  • w/o QA, and w/o hints (comments) demonstrate better performance than base LLM

(RQ3) Planning Effectiveness

  • Manager: Evaluator shows high correlation with plans and reference changes.
  • Custodian: MAGIS outperforms BM25 in recall-to-file ratio

(RQ4) Coding Effectiveness

  • Resolved ratio increases when line coverage > 0.6

Evaluation: SWE-Bench comprises 2294 real issues

(RQ2) Framework Effectiveness

  • MAGIS (GPT4) obtains 8x improvement over GPT4

Ablations

  • w/o QA, and w/o hints (comments) demonstrate better performance than base LLM

(RQ3) Planning Effectiveness

  • Manager: Evaluator shows high correlation with plans and reference changes.
  • Custodian: MAGIS outperforms BM25 in recall-to-file ratio

(RQ4) Coding Effectiveness

  • Resolved ratio increases when line coverage > 0.6

Department of Computer Science

22 of 39

Contributions & Discussion

Discussion

  • There can be more specialized agents
  • Better team formation and optimization
  • This kind of systems are deployed in the wild
    • Codex by OpenAI is based on similar multi-agent approaches

Takeaways

  • Empirical failure analysis of Github issues
  • New agent-based LLM framework
  • Demonstrated performance increase on SWE-Bench

Department of Computer Science

23 of 39

Coevolving with the Other You:

Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning [6]

Hao Ma

Tianyi Hu

Zhiqiang Pu

Boyin Liu

Xiaolin Ali

yanyan Liang

Min Chen

Presented at NeurIPS 2024

Department of Computer Science

24 of 39

Motivation

Why is RL fine-tuning hard for LLMs?

  • Standard approach: RL methods like PPO

  • Key problems:
  • Large action space (+32k vocabulary tokens)
  • Sparse reward (reward only after the full output)
  • Distribution collapse (output diverge from pre-trained behavior)

This leads to the need for stable and adaptive training paradigm → New training methods

Department of Computer Science

25 of 39

Background: Proximal Policy Optimization (PPO)

Definition: An RL Algorithm used to trained LLMs (typically on Human Feedback = RLHF)

Intuition: We want to avoid having too large of a policy update (using only 1st order optimization)

How it works

  1. LLM generates output in response to prompts
  2. Reward Model scores the outputs
  3. PPO is used to adjust the LLM’s policy (output distribution) to maximize reward.

Key Contributions

  • Clipping: Prevents the policy from shifting too far from current model
  • KL Penalty: Added to maintain similarity to the pre-trained model

Department of Computer Science

26 of 39

The CORY Framework

  • Duplicate LLM into 2 agents
  • Pioneer: answers using prompt
  • Observer: answer using prompt

+ pioneer’s response

Cooperative Multi-Agent RL for LLMs: CORY

  • Training Loop
  • Shared reward for both agents
  • Periodic role switching: pioneer ↔ observer

Principles

Department of Computer Science

27 of 39

Mechanisms in CORY

Role Exchange

  • Every few iterations, roles are switched Pioneer ↔ Observer
  • Prevents overfitting to a fixed prompt structure
  • Ensure both agents learn make progress towards the objective

Collective Reward

Knowledge Transfer

  • Observer sees the query and the pioneer’s output
  • Learns to generate better responses via in-context learning

Department of Computer Science

28 of 39

CORY as a multi-objective RL

Tradeoff in traditional single-agent RL → Multi-objective RL

  • Maximize task reward
  • Minimize KL divergence

Pareto Frontier Perspective:

  • CORY shifts the learned policies closer to the Pareto frontier
  • Empirical results support this idea

What is the hypothesis behind CORY’s surpassing single-agent RL?

Department of Computer Science

29 of 39

Experiments (1/2): Objective Task

Evaluation on GSM8k

Task: GSM8K

  • Binary reward if the prediction matches the ground-truth

Metrics: Task reward, KL divergence, Combined reward, Pass@k

Model: Llama 7b-chat

Results

  • PPO is unstable and drops after 50 iterations.
  • KL divergence explodes
  • CORY shows: (1) Stable rewards, (2) Lower KL divergence, (3) Higher pass@k

Takeaway: CORY generalizes better

Department of Computer Science

30 of 39

Experiments (2/2): Subjective Task

Task: IMDB sentiment completion (GPT2-Large)

  • Reward model trained to recognize positive sentiment

Metrics: Task reward, KL divergence, Combined reward

Results

  • CORY and PPO match on task reward
  • PPO shows 2x higher KL divergence → distribution collapse
  • Pioneer and Observer converge to similar performance

Department of Computer Science

31 of 39

Ablation study

Ablations (IMDM reviews):

  • Model size: PPO on GPT2-XL (1.5B) does not outperform CORY on GPT2 (774M)
  • Knowledge Transfer: Remove knowledge transfer → Unstable reward and KL divergence increases
  • Role Exchange: KL divergence increase, but lower for the observer.

Department of Computer Science

32 of 39

Contributions & Discussion

Discussion

  • They did not run experiments on “agentic” tasks
  • Can it generalize to more than 2 agents?
  • Potential for learning cooperation/competition between agents

Contributions

  • Introduced CORY: a simple cooperative MARL framework for LLM fine-tuning
  • Key contributions:
  • Knowledge Transfer
  • Role Exchange
  • Robust optimization results under multi-objective RL

Department of Computer Science

33 of 39

About my PhD

Department of Computer Science

34 of 39

PhD Research so far

  • Attack on agent dialogue [3]
  • [mentor] Improving Multi-Agent debate with Uncertainty and Attention Scaling [4]
  • Multiagent Self-Organization [5]

  • [mentor] AI4Code - How to transfer coding abilities to smaller models on LRPL [6]
  • AI4Science: Materials, Physics [ongoing]

Agent Communication

and Cooperation

Understanding LLMs

Investigate the capabilities of Largage Language Models. How they gain them? Why they work? And how to improve them? (Focus on reasoning and robustness)

In the future, LLMs will accomplish multiple tasks communicating with other LLMs. How robust and reliable is it?

How can we enable better cooperation?

LLMs are paradigm change that enable a wide range of new applications. What are these applications?

AI Applications

  • Knowledge of Knowledge [1]
  • Reasoning capabilities [2]
  • Political views [ongoing]
  • Attention Shift [mentor]

Department of Computer Science

35 of 39

Massive Potential for Automated Society

https://princeton-nlp.github.io/language-agent-impact/

Step 1 – Automate Repetitive Digital Work: Agents can learn routine tasks but still lack human-level reliability.

Step 2 – Collaborate with Humans: Success in hybrid tasks needs strong communication and social skills.

Step 3 – Explore Creatively: Advanced tasks require self-driven exploration and innovation.

Department of Computer Science

36 of 39

Roadmap

MAE

Spring ‘25

Summer ‘25

Fall ‘25

Winter ‘26

Spring ‘26

Summer ‘26

AI4Science Project

Generating better synthesis recipes

Summer Internship

@ Morgan Stanley

Multi-Agent Learning

Research Visit

@ KAUST

Multi-Agent Systems Creation

Project on MAS Application

Last Internship

TBD

PhD

Theoretical Approaches

Department of Computer Science

37 of 39

Thank you

Q & A

Alfonso Amayuelas

www.amayuelas.me

@alfonamayuelas

Department of Computer Science

38 of 39

Department of Computer Science

39 of 39

References

[1] Du, Yilun, et al. "Improving factuality and reasoning in language models through multiagent debate." Forty-first International Conference on Machine Learning. 2023.

[2] Su, Yu, et al. "Language Agents: Foundations, Prospects, and Risks." Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts. 2024.

[3] Cemri, Mert, et al. "Why do multi-agent llm systems fail?." arXiv preprint arXiv:2503.13657 (2025).

[4] Liu, Zijun, et al. "A dynamic LLM-powered agent network for task-oriented agent collaboration." First Conference on Language Modeling. 2024, COLM 2024

[5] Tao, Wei, et al. "Magis: Llm-based multi-agent framework for github issue resolution." Advances in Neural Information Processing Systems 37 (2024): 51963-51993.

[6] Coevolving with the other you: Fine-tuning llm with sequential cooperative multi-agent reinforcement learning." Advances in Neural Information Processing Systems 37 (2024): 15497-15525, NeurIPS 2024

Department of Computer Science