1 of 39

Collective Artificial Intelligence

From Independent to Cooperative Models

Alfonso Amayuelas

Committee: William Wang (chair), Shiyu Chang, Xifeng Yan

PhD Major Area Exam, Spring 2025

05/02/25

Department of Computer Science

2 of 39

Agenda

Intro: LLM Multi-Agent Systems (MAS)

Agents
Multi-Agent Systems
Principles
Current State

Creating MAS’s

Team Optimization Dynamic Agent Networks: DyLAN
Working MAS Application: MAGIS
MAS Learning: Learning cooperatively with MARL

What is next?

My Research
Automated Society
Roadmap

Department of Computer Science

3 of 39

LLM Agents

Environment

Reasoning

Tool/Action Selection

LLM Agents = Autonomous Systems powered by LLMs and grounded in Natural Language
Receive information from the environment, and interact with tool or actions
Applications = workflow automation: data analysis, software development, robot control

Action Execution

Department of Computer Science

4 of 39

Multi-Agent Systems (MAS)

Multi-Agent LLM systems involve multiple agents working together to solve a task

Multi-Agent debate has shown performance increase in factuality and reasoning [1]

Each agent can take specialized roles

These tasks can eventually be taken by task-specific LLMs

Agents communicate and coordinate to solve the task and enable robustness and scalability

[1] Du, Yilun, et al. "Improving factuality and reasoning in language models through multiagent debate." Forty-first International Conference on Machine Learning. 2023.

“A multi-agent system (MAS) is then defined as a

collection of agents designed to interact through orchestration, enabling collective intelligence.”

Department of Computer Science

5 of 39

Key Principles

Agent Initialization

Profiling via persona or expertise

Orchestration Process

Coordinate different agents

Agent team optimization

Select correct agents

[2 ] Su, Yu, et al. "Language Agents: Foundations, Prospects, and Risks." Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts. 2024.

Social Simulations

Software Development

Increased reasoning at inference time

Current Application Areas:

Department of Computer Science

6 of 39

Where are we?

[3] Cemri, Mert, et al. "Why do multi-agent llm systems fail?." arXiv preprint arXiv:2503.13657 (2025).

Why?

An analysis over > 200 traces across 7 MAS frameworks with humans and LLM judge shows:

Specification issues (42%) : Poor prompt design, role misalignment, context loss
Inter-Agent Misalignment (37%): Coordination breadowns, ignored input, task derailment
Task Verification Failures (21%): Incorrect, Incomplete or missing checks

Despite growing enthusiasm, performance gains of MAS remain minimal

Implications?

Some attribute this to present-day LLM limitations
But others argue that good MAS design required organization understanding.

MAS Frameworks: MetaGPT, ChatDev, HyperAgent, Appworld, AG2, Magnetic-One, OpenManus…

Department of Computer Science

7 of 39

Towards Collective AI

2022

2023

2024

2025

2026+

Release of Chatbots

Complex Organizations

& Cooperative Learning

Multi-Agent LLM Frameworks

Advanced Post-Training

Large-Scale RL

Moving towards AI agents that interact and learn with the environment and other AI models

Agents

MCP

PhD

Beginning

Department of Computer Science

8 of 39

A Dynamic LLM-powered Agent Network for Task-Oriented Agent Collaboration (DyLAN) [4]

Zijun Liu

Yanzhe Zhang

Peng Li

Yang Liu

Diyi Yang

Presented at COLM 2024

Department of Computer Science

9 of 39

Optimizing Multi-Agent System Creation

Contributions

DyLAN: A framework for task-oriented agent collaboration in 2 stages:

Team Optimization
Task Solving

Question: How can we automatically optimize LLM-MAS creation?

→ Does dynamic agent selection and communication improve task performance?

New Agent Collaboration Formulation

Introduce Agent Collaborations as a temporal feed-forward networks

Increased performance

Results better accuracy and efficiency

Department of Computer Science

10 of 39

DyLAN Framework

2-stage approach
Team Optimization
Task Solving
Dynamic Selection & Structure
Temporal Feed-Forward Network (T-FFN) as communication/computation graph

Dynamic LLM-Powered Agent Network

Department of Computer Science

11 of 39

Temporal Feed-Forward Networks

Components:
Nodes = Agents at time steps (LLM-agent or independent tool)

Edges = Communication Links

Layers: time steps
Dynamic message passing
Forward: during task solving
Backward: agent evaluation and team optimization

Definition

Department of Computer Science

12 of 39

How T-FFN Enables Dynamic Collaboration (1/2)

Early Stopping: Terminates interactions if ≥ 2/3 agents agree

Forward Message Passing = Inference

At each time step, agents receive inputs (messages) from previous agents and generate responses.
Responses are generated using prompts or tool outputs.

Agent Team Reformation

A Ranking system scores agent outputs → selects top agents for the next time step.
Filter ineffective agents, making the network dynamic and task-specific.

Department of Computer Science

13 of 39

Agent Selection Using T-FFN (Team Optimization) (2/2)

Backward Message Passing (Evaluation):

Agents scores predecessor’s outputs
Ratings propagate backward to compute Agent’s Importance Score

Selection Algorithm:

Propagation: Rate predecessor’s outputs
Aggregation: Backpropagate ratings to compute Importance Scores
Selection: Choose top-k agents

→ Provides a task-oriented and dynamic teams

Department of Computer Science

14 of 39

Experimental Results

Model = GPT-4

DyLAN outperforms strong baselines:

+17.7% over Direct Execution (WebShop)
+10.2% over LLM Debate (MMLU)
Uses fewer API calls despite higher accuracy.

Accuracy on WebShop

Accuracy on MMLU

Accuracy on HumanEval

Benchmarks: HumanEval (coding), WebShop (decision-making), MMLU (general-reasoning), MATH (arithmetic-reasoning).

Department of Computer Science

15 of 39

Ablations & Analysis

Team Optimization works
Selecting fewer, high-contributing agents improves accuracy
25% improvement in college math

Robustness of Agent Importance Score
Evaluated imbalance of roles

Early Stopping Mechanism
Reduces cost with minimal accuracy impact
Reduces API Calls by 45% (math), 66.2% (mmlu), 11.3% (humanEval), 54.2% (webshop)

Performance improvement on MMLU

Varying the number of

agents after optimization

Ablation w/o early stopping (es) or

team reformation (atr)

Department of Computer Science

16 of 39

Contributions & Discussion

Discussion

Do the tasks required multi-agent communication?
These tasks were meant for single-agents

Are the agents evaluated representative enough?
How much of a difference does the system prompt make.

Contributions

DyLAN Framework
Novel 2-stage system for task-oriented multi-agent collaboration
Agent Temporal Feed-forward Network (T-FFN) and Agent Importance Score
Unsupervised backprop-inspired metric for agent’s contributions

Department of Computer Science

17 of 39

MAGIS:

LLM-Based Multi-Agent Framework For Github Issue Resolution [5]

Wei Tao

Yucheng Zhou

Yanlin Wang

Wenqiang Zhang

Hongyu Zhang

Yu Cheng

Presented at NeurIPS 2024

Department of Computer Science

18 of 39

MAS Application

Question: Can LLMs help solve issues better at scale and pass@1?

→ e.g LLMs are solving 2-4% of issues without agentic behaviors

Motivation: Solving real-world Github Issues is a challenging problem for LLMs

Why: LLMs struggle with�

Repository-scale understanding
Long-context code reasoning
Accurate code modification

Department of Computer Science

19 of 39

Empirical Finding on LLM Failures

Why do LLMs Fail on Github issues?

Model	# Files Corr.	# Functions Corr.
GPT-4	-25.15*	-25.15*
MAGIS	-1.55*	-1.55*

GPT4 solve 2% of issues on SWE-Bench compared to 67% on HumanEval (function-level code generation)

(RQ1) Why is the performance on Github issues limited?

Locating files to be modified tradeoff

More files = higher recall but more context

Locating lines

Coverage ratio correlates with resolution success

(Claude-2 statistically significant positive relation)

Complexity

More files changes → lower success rate

Logistic Regression Coefficient

Coverage Ratio

Department of Computer Science

20 of 39

MAGIS Framework

Introduces 4 agent roles, mirroring real software teams:

Manager: Generates Plan Task Flow
Custodian: Locates relevant files via BM25 + memory-based filtering
Developer: Performs code edits in small, localized steps
QA Engineer: Reviews edits iteratively and provides feedback.

Workflow: Planning → Coding → Review → Merge

MAGIS: A collaborative Multi-Agent Framework

Department of Computer Science

21 of 39

Experiments

MAGIS Significantly Improves Issue Resolution

Evaluation: SWE-Bench comprises 2294 real issues

(RQ2) Framework Effectiveness

MAGIS (GPT4) obtains 8x improvement over GPT4

Ablations

w/o QA, and w/o hints (comments) demonstrate better performance than base LLM

(RQ3) Planning Effectiveness

Manager: Evaluator shows high correlation with plans and reference changes.
Custodian: MAGIS outperforms BM25 in recall-to-file ratio

(RQ4) Coding Effectiveness

Resolved ratio increases when line coverage > 0.6

Evaluation: SWE-Bench comprises 2294 real issues

(RQ2) Framework Effectiveness

MAGIS (GPT4) obtains 8x improvement over GPT4

Ablations

w/o QA, and w/o hints (comments) demonstrate better performance than base LLM

(RQ3) Planning Effectiveness

Manager: Evaluator shows high correlation with plans and reference changes.
Custodian: MAGIS outperforms BM25 in recall-to-file ratio

(RQ4) Coding Effectiveness

Resolved ratio increases when line coverage > 0.6

Department of Computer Science

22 of 39

Contributions & Discussion

Discussion

There can be more specialized agents
Better team formation and optimization
This kind of systems are deployed in the wild

Codex by OpenAI is based on similar multi-agent approaches

Takeaways

Empirical failure analysis of Github issues
New agent-based LLM framework
Demonstrated performance increase on SWE-Bench

Department of Computer Science

23 of 39

Coevolving with the Other You:

Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning [6]

Hao Ma

Tianyi Hu

Zhiqiang Pu

Boyin Liu

Xiaolin Ali

yanyan Liang

Min Chen

Presented at NeurIPS 2024

Department of Computer Science

24 of 39

Motivation

Why is RL fine-tuning hard for LLMs?

Standard approach: RL methods like PPO

Key problems:
Large action space (+32k vocabulary tokens)
Sparse reward (reward only after the full output)
Distribution collapse (output diverge from pre-trained behavior)

This leads to the need for stable and adaptive training paradigm → New training methods

Department of Computer Science

25 of 39

Background: Proximal Policy Optimization (PPO)

Definition: An RL Algorithm used to trained LLMs (typically on Human Feedback = RLHF)

Intuition: We want to avoid having too large of a policy update (using only 1st order optimization)

How it works

LLM generates output in response to prompts
Reward Model scores the outputs
PPO is used to adjust the LLM’s policy (output distribution) to maximize reward.

Key Contributions

Clipping: Prevents the policy from shifting too far from current model
KL Penalty: Added to maintain similarity to the pre-trained model

Department of Computer Science

26 of 39

The CORY Framework

Duplicate LLM into 2 agents
Pioneer: answers using prompt
Observer: answer using prompt

+ pioneer’s response

Cooperative Multi-Agent RL for LLMs: CORY

Training Loop
Shared reward for both agents
Periodic role switching: pioneer ↔ observer

Principles

Department of Computer Science

27 of 39

Mechanisms in CORY

Role Exchange

Every few iterations, roles are switched Pioneer ↔ Observer
Prevents overfitting to a fixed prompt structure
Ensure both agents learn make progress towards the objective

Collective Reward

Knowledge Transfer

Observer sees the query and the pioneer’s output
Learns to generate better responses via in-context learning

Department of Computer Science

28 of 39

CORY as a multi-objective RL

Tradeoff in traditional single-agent RL → Multi-objective RL

Maximize task reward
Minimize KL divergence

Pareto Frontier Perspective:

CORY shifts the learned policies closer to the Pareto frontier
Empirical results support this idea

What is the hypothesis behind CORY’s surpassing single-agent RL?

Department of Computer Science

29 of 39

Experiments (1/2): Objective Task

Evaluation on GSM8k

Task: GSM8K

Binary reward if the prediction matches the ground-truth

Metrics: Task reward, KL divergence, Combined reward, Pass@k

Model: Llama 7b-chat

Results

PPO is unstable and drops after 50 iterations.
KL divergence explodes
CORY shows: (1) Stable rewards, (2) Lower KL divergence, (3) Higher pass@k

Takeaway: CORY generalizes better

Department of Computer Science

30 of 39

Experiments (2/2): Subjective Task

Task: IMDB sentiment completion (GPT2-Large)

Reward model trained to recognize positive sentiment

Metrics: Task reward, KL divergence, Combined reward

Results

CORY and PPO match on task reward
PPO shows 2x higher KL divergence → distribution collapse
Pioneer and Observer converge to similar performance

Department of Computer Science

31 of 39

Ablation study

Ablations (IMDM reviews):

Model size: PPO on GPT2-XL (1.5B) does not outperform CORY on GPT2 (774M)
Knowledge Transfer: Remove knowledge transfer → Unstable reward and KL divergence increases
Role Exchange: KL divergence increase, but lower for the observer.

Department of Computer Science

32 of 39

Contributions & Discussion

Discussion

They did not run experiments on “agentic” tasks
Can it generalize to more than 2 agents?
Potential for learning cooperation/competition between agents

Contributions

Introduced CORY: a simple cooperative MARL framework for LLM fine-tuning
Key contributions:
Knowledge Transfer
Role Exchange
Robust optimization results under multi-objective RL

Department of Computer Science

33 of 39

About my PhD

Department of Computer Science

34 of 39

PhD Research so far

Attack on agent dialogue [3]
[mentor] Improving Multi-Agent debate with Uncertainty and Attention Scaling [4]
Multiagent Self-Organization [5]

[mentor] AI4Code - How to transfer coding abilities to smaller models on LRPL [6]
AI4Science: Materials, Physics [ongoing]

Agent Communication

and Cooperation

Understanding LLMs

Investigate the capabilities of Largage Language Models. How they gain them? Why they work? And how to improve them? (Focus on reasoning and robustness)

In the future, LLMs will accomplish multiple tasks communicating with other LLMs. How robust and reliable is it?

How can we enable better cooperation?

LLMs are paradigm change that enable a wide range of new applications. What are these applications?

AI Applications

Knowledge of Knowledge [1]
Reasoning capabilities [2]
Political views [ongoing]
Attention Shift [mentor]

Department of Computer Science

35 of 39

Massive Potential for Automated Society

https://princeton-nlp.github.io/language-agent-impact/

Step 1 – Automate Repetitive Digital Work: Agents can learn routine tasks but still lack human-level reliability.

Step 2 – Collaborate with Humans: Success in hybrid tasks needs strong communication and social skills.

Step 3 – Explore Creatively: Advanced tasks require self-driven exploration and innovation.

Department of Computer Science

A study on Github Coplito with 95 people showed that it can reduce average coding time by 50%

Step One – Robust Automation for Repetitive Digital Tasks: Language agents can already handle routine tasks (e.g., form filling, bug fixing) with a few examples, but achieving human-level reliability remains the key challenge for widespread adoption.
Step Two – Communication for Hybrid Collaboration: Tasks that combine tool use and human interaction (e.g., sales, project management) require agents to develop robust communication and social reasoning skills to gain trust and coordinate over long horizons.
Step Three – Exploration for Creative Work: For scientific, artistic, or knowledge-driven tasks, agents must go beyond execution and communication to self-motivate, explore, and innovate in sparse-reward environments.

36 of 39

Roadmap

MAE

Spring ‘25

Summer ‘25

Fall ‘25

Winter ‘26

Spring ‘26

Summer ‘26

AI4Science Project

Generating better synthesis recipes

Summer Internship

@ Morgan Stanley

Multi-Agent Learning

Research Visit

@ KAUST

Multi-Agent Systems Creation

Project on MAS Application

Last Internship

TBD

PhD

Theoretical Approaches

Department of Computer Science

37 of 39

Thank you

Q & A

Alfonso Amayuelas

www.amayuelas.me

@alfonamayuelas

Department of Computer Science

38 of 39

Department of Computer Science

39 of 39

References

[1] Du, Yilun, et al. "Improving factuality and reasoning in language models through multiagent debate." Forty-first International Conference on Machine Learning. 2023.

[2] Su, Yu, et al. "Language Agents: Foundations, Prospects, and Risks." Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts. 2024.

[3] Cemri, Mert, et al. "Why do multi-agent llm systems fail?." arXiv preprint arXiv:2503.13657 (2025).

[4] Liu, Zijun, et al. "A dynamic LLM-powered agent network for task-oriented agent collaboration." First Conference on Language Modeling. 2024, COLM 2024

�[5] Tao, Wei, et al. "Magis: Llm-based multi-agent framework for github issue resolution." Advances in Neural Information Processing Systems 37 (2024): 51963-51993.

[6] Coevolving with the other you: Fine-tuning llm with sequential cooperative multi-agent reinforcement learning." Advances in Neural Information Processing Systems 37 (2024): 15497-15525, NeurIPS 2024

Department of Computer Science