1 of 57

CSCI-SHU 376: Natural Language Processing

Hua Shen

Course Agenda: 2026 Spring-NLP-[CSCI-SHU-376]-Class Schedule

2026-04-23

Spring 2026

Lecture 17: Multi-Agent Systems

Contents adapted from UC Berkeley Agent ic AI Course

2 of 57

Revisit: Generative Agents

Inference

3 of 57

Today’s Plan

Multi-Agent System Overview
Why do we need MAS?
MAS Architecture Dissection
Typical MAS Frameworks, Datasets, Benchmarks
MAS Applications

4 of 57

Rising Trend of Multi-Agent System (MAS)

https://arxiv.org/pdf/2402.01680

5 of 57

Today’s Plan

Multi-Agent System Overview
Why do we need MAS?
MAS Architecture Dissection
Typical MAS Frameworks, Datasets, Benchmarks
MAS Applications

6 of 57

Why Do We Need MAS?

A Team to work together on finishing the complicated tasks;
Simulate the society of agents.

7 of 57

Why Do We Need MAS?

8 of 57

Example Question-Answering Multi-Agent System

9 of 57

Today’s Plan

Multi-Agent System Overview
Why do we need MAS?
MAS Architecture Dissection
Typical MAS Frameworks, Datasets, Benchmarks
MAS Applications

10 of 57

MAS Architecture Dissection

11 of 57

MAS Architecture Dissection

Agents-Environment Interface
Agents Profiling
Agents Communication / Topology
Agents Capabilities Acquisition

12 of 57

Agents-Environment Interface

https://arxiv.org/pdf/2402.01680

13 of 57

Agents-Environment Interface

https://project-roco.github.io/

Sandbox
Physical
None

14 of 57

Agents Profiling

https://arxiv.org/pdf/2402.01680

Across MAS systems, agents assume distinct roles, each with comprehensive descriptions encompassing characteristics, capabilities, behaviors, and constraints.

Agent Profiling Methods:

Pre-defined cases: agent profiles are explicitly defined by the system designers.
Model-Generated method: creates agent profiles by models, e.g., large language models.
Data-Derived method: involves constructing agent profiles based on pre-existing datasets.

15 of 57

Agents Communication / Topology

https://arxiv.org/pdf/2402.01680

Communication / Topology Paradigms:

Cooperative agents work together towards a shared goal or objectives, typically exchanging information to enhance a collective solution
Competitive agents work towards their own goals that might be in conflict with the goals of other agents
Coopetition: agents compromise with each other, compete on one aspect while agreeing on another.
Debate agents engage in argumentative interactions, presenting and defending their own viewpoints or solutions, and critiquing those of others.

16 of 57

Agents Communication / Topology

Multi-Agent Collaboration Mechanisms

17 of 57

Agents Communication / Topology

Multi-Agent Collaboration Types

https://arxiv.org/pdf/2501.06322

Each agent 𝑎 is equipped with different tools or capabilities through their system prompt 𝑟.

agents cooperate by leveraging their individual specialties (e.g., writing, translation, research) to achieve a shared goal (academic writing).
agents compete and debate against each other fo r their own goals.
In coopetition, agents compromise with each other, compete on one aspect while agreeing on another.

18 of 57

Agents Communication / Topology

Communication / Topology Structure:

https://arxiv.org/pdf/2602.08567

19 of 57

Agents Communication / Topology

https://arxiv.org/pdf/2501.06322

More communication structures of MAS.

20 of 57

Decentralized Agents

https://arxiv.org/pdf/2509.22502

InfiAgent Framework Architecture.

Left: An intuitive Pyramid-like agent organization architecture of InfiAgent, featuring a Router that redirects all user queries to avoid layer-by-layer tool search.
Right: InfiAgent’s framework workflow schematic, highlighting three key modules: Router, Self Evolution, and Context Control.

21 of 57

Agents Capabilities Acquisition

https://arxiv.org/pdf/2402.01680

To enable agents to learn and evolve dynamically:

Feedback

critical information that agents receive about the outcome of their actions, helping the agents learn the potential impact of their actions and adapt to complex and dynamic problems.
1) Feedback from Environment; 2) Feedback from Agents Interactions; 3) Human Feedback; 4) None

Agents Adjustment to Complex Problems

To enhance MAS agent capabilities, agents in LLM-MA systems can adapt through main solutions;
1) Memory; 2) Evolution; 3) Dynamic Generation;

22 of 57

Today’s Plan

Multi-Agent System Overview
Why do we need MAS?
MAS Architecture Dissection
Typical MAS Frameworks, Datasets, Benchmarks
MAS Applications

23 of 57

Typical MAS Frameworks, Datasets, Benchmarks

CAMEL - Role-playing agents with inception prompting | Li et al., NeurIPS 2023
MetaGPT - SOPs + role specialization (PM, Architect, Engineer) | Hong et al., ICLR 2024
AutoGen - Customizable conversable agents with flexible conversation patterns | Wu et al., 2023
ChatDev - Virtual software company with specialized agent roles | Qian et al., ACL 2024
CrewAI - Role-based orchestration with autonomous collaboration | Joao Moura, 2024
LangGraph - Stateful multi-agent orchestration with shared state and conditional routing | LangChain, 2024

24 of 57

CAMEL: Communicative Agents for LLM Society

Role-playing mechanism: agents assume distinct social roles (human user, AI assistant) to maintain alignment
Inception prompting: carefully designed initial prompts guide agents toward task completion
Generates four datasets: AI Society, Code, Math, and Science for evaluating LLM cooperation
Demonstrates autonomous cooperation among agents scales through linguistic communication
Key insight: structured role assignments reduce the need for constant human supervision

Li et al., "CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society," NeurIPS 2023
CAMEL-AI: https://www.camel-ai.org/

25 of 57

CAMEL: Communicative Agents for LLM Society

Li et al., "CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society," NeurIPS 2023
CAMEL-AI: https://www.camel-ai.org/

26 of 57

MetaGPT: Meta Programming for Multi-Agent Collaboration

SOPs (Standardized Operating Procedures): encodes human domain workflows into prompt sequences
Five specialized roles: ProductManager, Architect, ProjectManager, Engineer, QA Engineer
Assembly-line paradigm: complex tasks decomposed into sequential subtasks with verification
Reduces cascading hallucinations through structured intermediate validation
State-of-the-art results on HumanEval and MBPP code generation benchmarks

Hong et al., "MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework," ICLR 2024
https://github.com/FoundationAgents/MetaGPT

27 of 57

MetaGPT: Meta Programming for Multi-Agent Collaboration

Hong et al., "MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework," ICLR 2024
https://github.com/FoundationAgents/MetaGPT

28 of 57

AutoGen | CrewAI | LangGraph

Platforms that lower barriers for building MAS without specialized multi-agent expertise:

CrewAI excels for structured roles;
LangGraph for complex branching;
Autogen for conversational, human-in-the-loop tasks;

AutoGen: customizable conversable agents supporting LLMs, human input, and tool integration

Flexible conversation patterns: two-agent chats, group chats, hierarchical task decomposition
Conversation programming paradigm: multi-agent workflows abstracted as inter-agent conversations

CrewAI: role-based orchestration with autonomous collaboration and task delegation

CrewAI dual architecture: Crews for autonomous collaboration vs. Flows for deterministic control

29 of 57

AutoGen | CrewAI | LangGraph

Platforms that lower barriers for building MAS without specialized multi-agent expertise:

CrewAI excels for structured roles;
LangGraph for complex branching;
Autogen for conversational, human-in-the-loop tasks;

30 of 57

MAS Datasets and Benchmarks

AgentBench: evaluation across 8 environments (OS, DB, Web, Games) - Liu et al., ICLR 2024
SOTOPIA: 90 social scenarios testing negotiation, collaboration, competition - Zhou et al., ICLR 2024
ChatArena: multi-agent language game environment for evaluating communication abilities
Communicative Agent Benchmark: evaluates role-playing, task completion, cooperation quality
Multi-Agent Debate benchmarks: assess reasoning improvement and hallucination reduction
……
Key challenge: evaluating emergent behaviors in open-ended multi-agent interactions

31 of 57

AgentBench

AgentBench assesses the ability of LLM-as-Agent to reason and make decisions in multi-turn open-ended settings. It evaluates agents across eight environments: Operating System, Database, Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles, House-Holding, Web Shopping, and Web Browsing.

32 of 57

WebArena

WebArena is a benchmark and a self-hosted environment for autonomous agents performing web tasks. The environment simulates scenarios in four realistic domains: e-commerce, social forums, collaborative code development, and content management.
The benchmark evaluates functional correctness, where success means the agent achieves the final goal, independent of how it gets there. It encompasses 812 templated tasks and their variations, like browsing an e-commerce site, managing a forum, editing code repositories, and interacting with content management systems.

33 of 57

GAIA

GAIA is a benchmark for general AI assistants. It presents real-world questions requiring reasoning, multimodality handling, and tool-use proficiency. The dataset comprises 466 human-annotated tasks that mix text questions with attached context, e.g., images or files. The tasks cover various assistant use cases such as daily personal tasks, science, and general knowledge.

34 of 57

MINT

MINT evaluates LLMs' ability to solve tasks with multi-turn interactions using tools and leveraging natural language feedback. Within this framework, LLMs access tools by executing Python code and receive users' feedback simulated by GPT-4.

35 of 57

ColBench

ColBench is a multi-turn benchmark that evaluates LLMs as collaborative agents working with simulated human partners. It focuses on backend coding and frontend design that require step-by-step collaboration: the model suggests code/design drafts, receives feedback, and refines iteratively – simulating a realistic development workflow.

36 of 57

ToolEmu

ToolEmu focuses on identifying risky behaviors of LLM agents when using tools. The benchmark contains 36 high-stakes tools and 144 test cases, covering scenarios where agent misuse could lead to serious consequences.
The framework simulates tool execution without actual tool infrastructure – this sandbox approach allows rapid and flexible prototyping. Alongside the emulator, the authors suggest an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks.

37 of 57

MetaTool

MetaTool is a benchmark designed to evaluate whether LLMs “know” when to use tools and can correctly choose the right tool from a set of options. Within the benchmark, authors also introduce a new evaluation dataset – ToolE. It contains over 21,000 prompts labeled with ground-truth tool assignments, including both single-tool and multi-tool scenarios.
The tasks cover both tool usage awareness and tool selection scenarios. Additionally, four subtasks are defined to evaluate different dimensions of tool selection: tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection.

38 of 57

Today’s Plan

Multi-Agent System Overview
Why do we need MAS?
MAS Architecture Dissection
Typical MAS Frameworks, Datasets, Benchmarks
MAS Applications

39 of 57

MAS for Scientific Research

Agents with complementary expertise (theorists, experimentalists, reviewers) collaborate on research
Collaborative hypothesis generation, evaluation by multiple agents, iterative refinement
Multi-agent debate: agents propose explanations, debate leads to higher-quality consensus
Reduces factual errors and improves reasoning depth through structured argumentation
Group diversity and intrinsic reasoning strength are dominant drivers of success
Applications: experiment design, protocol planning, result interpretation

40 of 57

General Deep Research System

https://arxiv.org/pdf/2512.02038

41 of 57

Deep Research Agents

OpenAI Deep Research: You give it a prompt, and ChatGPT will find, analyze, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst.

42 of 57

Deep Research Agents Framework

43 of 57

Ask Classification Questions

Deep Research Agents Framework

44 of 57

Ask Classification Questions

Search Multiple Rounds

Deep Research Agents Framework

45 of 57

Ask Classification Questions

Search Multiple Rounds

Generate Report

Deep Research Agents Framework

46 of 57

What is behind Deep Research?

47 of 57

Pre 2025: Retrieval Augmented Generation

Retrieval-based LM = Retrieval + LMs (or commonly referred as RAG)
It retrieves from an external datastore

48 of 57

Definition of Agentic Search

A natural evolution of retrieval-augmented generation (RAG): Where LLM based agents actively plan, execute and refine research processes to achieve complex information-seeking goals.

49 of 57

My Understanding of Agentic Search

Goal:

Complex Information Seeking

REQ2

Retrieval: Finding relevant information from different dimensions

REQ3

Reasoning: Synthesis the information and output the report

REQ1

Planning: Parse the query, and generate the plan

50 of 57

My Understanding of Agentic Search

Goal:

Complex Information Seeking

REQ2

Retrieval: Finding relevant information from different dimensions

REQ3

Reasoning: Synthesis the information and output the report

REQ1

Planning: Parse the query, and generate the plan

51 of 57

Challenges in Retrieval: Paradigm Shift

Retrieval is always the foundation of RAG and agentic search systems: finding evidence to reason over.

52 of 57

Challenges in Retrieval: Paradigm Shift

With the help of Mid-training on synthetic data, LLM knowledge increases significantly

--- (Section 2.2) Knowledge data Rephrasing: Rephrasing Wikipedia 10 times

Retrieval is always the foundation of RAG and agentic search systems: finding evidence to reason over.

53 of 57

Challenges in Retrieval: Paradigm Shift

With the help of Mid-training on synthetic data, LLM knowledge increases significantly

--- (Section 2.2) Knowledge data Rephrasing: Rephrasing Wikipedia 10 times

Paradigm Shift: We must find new scenarios where retrieval actually helps! i.e., LLMs do not have enough parametric knowledge

Retrieval is always the foundation of RAG and agentic search systems: finding evidence to reason over.

54 of 57

Challenges in Retrieval: Reasoning-Intensive

Existing agentic search systems mainly use Web search (e.g., call Google API), which only focuses on Level 1 and Level 2 Retrieval

55 of 57

Reasoning-intensive Retrieval

Motivation: Retrieval is bottleneck, existing retrievers only focus on relevance

Definition: Require intensive reasoning to find relevant information (e.g., documents)

56 of 57

Reasoning-intensive Retrieval

Motivation: Retrieval is bottleneck, existing retrievers only focus on relevance

Definition: Require intensive reasoning to find relevant information (e.g., documents)

Example Real-world Scenario: Given a coding problem, finding relevant coding problems that share similar algorithm as solutions

57 of 57

Why Reasoning-intensive Retrieval important?

User queries in agentic search are often complex, with multiple detailed instructions
Deep research systems require multiple rounds of retrieval, but existing systems are only able to retrieve vague information
Better reasoning intensive retrieval systems can accurately locate fine grained and personalized information

Could you research the popular climbing shoes on the market in mid 2025 and made in Asia?