1 of 57

CSCI-SHU 376: Natural Language Processing

Hua Shen

2026-04-23

Spring 2026

Lecture 17: Multi-Agent Systems

Contents adapted from UC Berkeley Agentic AI Course

2 of 57

Revisit: Generative Agents

Inference

2

3 of 57

Today’s Plan

  • Multi-Agent System Overview
  • Why do we need MAS?
  • MAS Architecture Dissection
  • Typical MAS Frameworks, Datasets, Benchmarks
  • MAS Applications

4 of 57

Rising Trend of Multi-Agent System (MAS)

https://arxiv.org/pdf/2402.01680

5 of 57

Today’s Plan

  • Multi-Agent System Overview
  • Why do we need MAS?
  • MAS Architecture Dissection
  • Typical MAS Frameworks, Datasets, Benchmarks
  • MAS Applications

6 of 57

Why Do We Need MAS?

  • A Team to work together on finishing the complicated tasks;
  • Simulate the society of agents.

7 of 57

Why Do We Need MAS?

8 of 57

Example Question-Answering Multi-Agent System

9 of 57

Today’s Plan

  • Multi-Agent System Overview
  • Why do we need MAS?
  • MAS Architecture Dissection
  • Typical MAS Frameworks, Datasets, Benchmarks
  • MAS Applications

10 of 57

MAS Architecture Dissection

11 of 57

MAS Architecture Dissection

  • Agents-Environment Interface
  • Agents Profiling
  • Agents Communication / Topology
  • Agents Capabilities Acquisition

12 of 57

Agents-Environment Interface

https://arxiv.org/pdf/2402.01680

13 of 57

Agents-Environment Interface

https://project-roco.github.io/

  • Sandbox
  • Physical
  • None

14 of 57

Agents Profiling

https://arxiv.org/pdf/2402.01680

Across MAS systems, agents assume distinct roles, each with comprehensive descriptions encompassing characteristics, capabilities, behaviors, and constraints.

Agent Profiling Methods:

  • Pre-defined cases: agent profiles are explicitly defined by the system designers.
  • Model-Generated method: creates agent profiles by models, e.g., large language models.
  • Data-Derived method: involves constructing agent profiles based on pre-existing datasets.

15 of 57

Agents Communication / Topology

https://arxiv.org/pdf/2402.01680

Communication / Topology Paradigms:

  • Cooperative agents work together towards a shared goal or objectives, typically exchanging information to enhance a collective solution
  • Competitive agents work towards their own goals that might be in conflict with the goals of other agents
  • Coopetition: agents compromise with each other, compete on one aspect while agreeing on another.
  • Debate agents engage in argumentative interactions, presenting and defending their own viewpoints or solutions, and critiquing those of others.

16 of 57

Agents Communication / Topology

Multi-Agent Collaboration Mechanisms

17 of 57

Agents Communication / Topology

Multi-Agent Collaboration Types

https://arxiv.org/pdf/2501.06322

Each agent 𝑎 is equipped with different tools or capabilities through their system prompt 𝑟.

  1. agents cooperate by leveraging their individual specialties (e.g., writing, translation, research) to achieve a shared goal (academic writing).
  2. agents compete and debate against each other fo r their own goals.
  3. In coopetition, agents compromise with each other, compete on one aspect while agreeing on another.

18 of 57

Agents Communication / Topology

Communication / Topology Structure:

https://arxiv.org/pdf/2602.08567

19 of 57

Agents Communication / Topology

https://arxiv.org/pdf/2501.06322

More communication structures of MAS.

20 of 57

Decentralized Agents

https://arxiv.org/pdf/2509.22502

InfiAgent Framework Architecture.

  • Left: An intuitive Pyramid-like agent organization architecture of InfiAgent, featuring a Router that redirects all user queries to avoid layer-by-layer tool search.
  • Right: InfiAgent’s framework workflow schematic, highlighting three key modules: Router, Self Evolution, and Context Control.

21 of 57

Agents Capabilities Acquisition

https://arxiv.org/pdf/2402.01680

To enable agents to learn and evolve dynamically:

  • Feedback
    • critical information that agents receive about the outcome of their actions, helping the agents learn the potential impact of their actions and adapt to complex and dynamic problems.
    • 1) Feedback from Environment; 2) Feedback from Agents Interactions; 3) Human Feedback; 4) None
  • Agents Adjustment to Complex Problems
    • To enhance MAS agent capabilities, agents in LLM-MA systems can adapt through main solutions;
    • 1) Memory; 2) Evolution; 3) Dynamic Generation;

22 of 57

Today’s Plan

  • Multi-Agent System Overview
  • Why do we need MAS?
  • MAS Architecture Dissection
  • Typical MAS Frameworks, Datasets, Benchmarks
  • MAS Applications

23 of 57

Typical MAS Frameworks, Datasets, Benchmarks

  • CAMEL - Role-playing agents with inception prompting | Li et al., NeurIPS 2023
  • MetaGPT - SOPs + role specialization (PM, Architect, Engineer) | Hong et al., ICLR 2024
  • AutoGen - Customizable conversable agents with flexible conversation patterns | Wu et al., 2023
  • ChatDev - Virtual software company with specialized agent roles | Qian et al., ACL 2024
  • CrewAI - Role-based orchestration with autonomous collaboration | Joao Moura, 2024
  • LangGraph - Stateful multi-agent orchestration with shared state and conditional routing | LangChain, 2024

24 of 57

CAMEL: Communicative Agents for LLM Society

  • Role-playing mechanism: agents assume distinct social roles (human user, AI assistant) to maintain alignment
  • Inception prompting: carefully designed initial prompts guide agents toward task completion
  • Generates four datasets: AI Society, Code, Math, and Science for evaluating LLM cooperation
  • Demonstrates autonomous cooperation among agents scales through linguistic communication
  • Key insight: structured role assignments reduce the need for constant human supervision
  • Li et al., "CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society," NeurIPS 2023
  • CAMEL-AI: https://www.camel-ai.org/

25 of 57

CAMEL: Communicative Agents for LLM Society

  • Li et al., "CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society," NeurIPS 2023
  • CAMEL-AI: https://www.camel-ai.org/

26 of 57

MetaGPT: Meta Programming for Multi-Agent Collaboration

  • SOPs (Standardized Operating Procedures): encodes human domain workflows into prompt sequences
  • Five specialized roles: ProductManager, Architect, ProjectManager, Engineer, QA Engineer
  • Assembly-line paradigm: complex tasks decomposed into sequential subtasks with verification
  • Reduces cascading hallucinations through structured intermediate validation
  • State-of-the-art results on HumanEval and MBPP code generation benchmarks
  • Hong et al., "MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework," ICLR 2024
  • https://github.com/FoundationAgents/MetaGPT

27 of 57

MetaGPT: Meta Programming for Multi-Agent Collaboration

  • Hong et al., "MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework," ICLR 2024
  • https://github.com/FoundationAgents/MetaGPT

28 of 57

AutoGen | CrewAI | LangGraph

Platforms that lower barriers for building MAS without specialized multi-agent expertise:

  • CrewAI excels for structured roles;
  • LangGraph for complex branching;
  • Autogen for conversational, human-in-the-loop tasks;

  • AutoGen: customizable conversable agents supporting LLMs, human input, and tool integration
    • Flexible conversation patterns: two-agent chats, group chats, hierarchical task decomposition
    • Conversation programming paradigm: multi-agent workflows abstracted as inter-agent conversations
  • CrewAI: role-based orchestration with autonomous collaboration and task delegation
    • CrewAI dual architecture: Crews for autonomous collaboration vs. Flows for deterministic control

29 of 57

AutoGen | CrewAI | LangGraph

Platforms that lower barriers for building MAS without specialized multi-agent expertise:

  • CrewAI excels for structured roles;
  • LangGraph for complex branching;
  • Autogen for conversational, human-in-the-loop tasks;

30 of 57

MAS Datasets and Benchmarks

  • AgentBench: evaluation across 8 environments (OS, DB, Web, Games) - Liu et al., ICLR 2024
  • SOTOPIA: 90 social scenarios testing negotiation, collaboration, competition - Zhou et al., ICLR 2024
  • ChatArena: multi-agent language game environment for evaluating communication abilities
  • Communicative Agent Benchmark: evaluates role-playing, task completion, cooperation quality
  • Multi-Agent Debate benchmarks: assess reasoning improvement and hallucination reduction
  • ……
  • Key challenge: evaluating emergent behaviors in open-ended multi-agent interactions

31 of 57

AgentBench

  • AgentBench assesses the ability of LLM-as-Agent to reason and make decisions in multi-turn open-ended settings. It evaluates agents across eight environments: Operating System, Database, Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles, House-Holding, Web Shopping, and Web Browsing.

32 of 57

WebArena

  • WebArena is a benchmark and a self-hosted environment for autonomous agents performing web tasks. The environment simulates scenarios in four realistic domains: e-commerce, social forums, collaborative code development, and content management.
  • The benchmark evaluates functional correctness, where success means the agent achieves the final goal, independent of how it gets there. It encompasses 812 templated tasks and their variations, like browsing an e-commerce site, managing a forum, editing code repositories, and interacting with content management systems.

33 of 57

GAIA

  • GAIA is a benchmark for general AI assistants. It presents real-world questions requiring reasoning, multimodality handling, and tool-use proficiency. The dataset comprises 466 human-annotated tasks that mix text questions with attached context, e.g., images or files. The tasks cover various assistant use cases such as daily personal tasks, science, and general knowledge.

34 of 57

MINT

  • MINT evaluates LLMs' ability to solve tasks with multi-turn interactions using tools and leveraging natural language feedback. Within this framework, LLMs access tools by executing Python code and receive users' feedback simulated by GPT-4.

35 of 57

ColBench

  • ColBench is a multi-turn benchmark that evaluates LLMs as collaborative agents working with simulated human partners. It focuses on backend coding and frontend design that require step-by-step collaboration: the model suggests code/design drafts, receives feedback, and refines iteratively – simulating a realistic development workflow.

36 of 57

ToolEmu

  • ToolEmu focuses on identifying risky behaviors of LLM agents when using tools. The benchmark contains 36 high-stakes tools and 144 test cases, covering scenarios where agent misuse could lead to serious consequences.
  • The framework simulates tool execution without actual tool infrastructure – this sandbox approach allows rapid and flexible prototyping. Alongside the emulator, the authors suggest an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks.

37 of 57

MetaTool

  • MetaTool is a benchmark designed to evaluate whether LLMs “know” when to use tools and can correctly choose the right tool from a set of options. Within the benchmark, authors also introduce a new evaluation dataset – ToolE. It contains over 21,000 prompts labeled with ground-truth tool assignments, including both single-tool and multi-tool scenarios.
  • The tasks cover both tool usage awareness and tool selection scenarios. Additionally, four subtasks are defined to evaluate different dimensions of tool selection: tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection.

38 of 57

Today’s Plan

  • Multi-Agent System Overview
  • Why do we need MAS?
  • MAS Architecture Dissection
  • Typical MAS Frameworks, Datasets, Benchmarks
  • MAS Applications

39 of 57

MAS for Scientific Research

  • Agents with complementary expertise (theorists, experimentalists, reviewers) collaborate on research
  • Collaborative hypothesis generation, evaluation by multiple agents, iterative refinement
  • Multi-agent debate: agents propose explanations, debate leads to higher-quality consensus
  • Reduces factual errors and improves reasoning depth through structured argumentation
  • Group diversity and intrinsic reasoning strength are dominant drivers of success
  • Applications: experiment design, protocol planning, result interpretation

40 of 57

General Deep Research System

https://arxiv.org/pdf/2512.02038

41 of 57

Deep Research Agents

OpenAI Deep Research: You give it a prompt, and ChatGPT will find, analyze, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst. 

42 of 57

Deep Research Agents Framework

43 of 57

Ask Classification Questions

Deep Research Agents Framework

44 of 57

Ask Classification Questions

Search Multiple Rounds

Deep Research Agents Framework

45 of 57

Ask Classification Questions

Search Multiple Rounds

Generate Report

Deep Research Agents Framework

46 of 57

What is behind Deep Research?

47 of 57

Pre 2025: Retrieval Augmented Generation

  • Retrieval-based LM = Retrieval + LMs (or commonly referred as RAG)
  • It retrieves from an external datastore

48 of 57

Definition of Agentic Search

A natural evolution of retrieval-augmented generation (RAG): Where LLM based agents actively plan, execute and refine research processes to achieve complex information-seeking goals.

49 of 57

My Understanding of Agentic Search

Goal:

Complex Information Seeking

REQ2

Retrieval: Finding relevant information from different dimensions

REQ3

Reasoning: Synthesis the information and output the report

REQ1

Planning: Parse the query, and generate the plan

50 of 57

My Understanding of Agentic Search

Goal:

Complex Information Seeking

REQ2

Retrieval: Finding relevant information from different dimensions

REQ3

Reasoning: Synthesis the information and output the report

REQ1

Planning: Parse the query, and generate the plan

51 of 57

Challenges in Retrieval: Paradigm Shift

Retrieval is always the foundation of RAG and agentic search systems: finding evidence to reason over.

52 of 57

Challenges in Retrieval: Paradigm Shift

  • With the help of Mid-training on synthetic data, LLM knowledge increases significantly

--- (Section 2.2) Knowledge data Rephrasing: Rephrasing Wikipedia 10 times

Retrieval is always the foundation of RAG and agentic search systems: finding evidence to reason over.

53 of 57

Challenges in Retrieval: Paradigm Shift

  • With the help of Mid-training on synthetic data, LLM knowledge increases significantly

--- (Section 2.2) Knowledge data Rephrasing: Rephrasing Wikipedia 10 times

Paradigm Shift: We must find new scenarios where retrieval actually helps! i.e., LLMs do not have enough parametric knowledge

Retrieval is always the foundation of RAG and agentic search systems: finding evidence to reason over.

54 of 57

Challenges in Retrieval: Reasoning-Intensive

Existing agentic search systems mainly use Web search (e.g., call Google API), which only focuses on Level 1 and Level 2 Retrieval

55 of 57

Reasoning-intensive Retrieval

Motivation: Retrieval is bottleneck, existing retrievers only focus on relevance

Definition: Require intensive reasoning to find relevant information (e.g., documents)

56 of 57

Reasoning-intensive Retrieval

Motivation: Retrieval is bottleneck, existing retrievers only focus on relevance

Definition: Require intensive reasoning to find relevant information (e.g., documents)

Example Real-world Scenario: Given a coding problem, finding relevant coding problems that share similar algorithm as solutions

57 of 57

Why Reasoning-intensive Retrieval important?

  • User queries in agentic search are often complex, with multiple detailed instructions
  • Deep research systems require multiple rounds of retrieval, but existing systems are only able to retrieve vague information
  • Better reasoning intensive retrieval systems can accurately locate fine grained and personalized information

Could you research the popular climbing shoes on the market in mid 2025 and made in Asia?