1 of 182

Multi-Agent LLMs to Assist

Causal Reasoning in Science

1

Zhijing Jin

Assistant Professor at the University of Toronto

Research Scientist at Max Planck Institute

CIFAR AI Chair

Slides are shared at zhijing-jin.com

2 of 182

Complete Agenda

2

Causally Understanding LLMs

Robustness evaluation, Gender bias, User personalization, …

Causal Reasoning with LLMs

Causal relationship learning, Causal effect estimation, …

Line 1

Line 2

3 of 182

Use of LLMs in Many Domains

3

4 of 182

However, …

4

5 of 182

A Reliability Problem

https://www.bbc.com/news/articles/cd11gzejgz4o

5

6 of 182

Sometimes, Error = Wrong Imitation

https://www.threads.net/@crumbler/post/C7VGpYSPOgT

6

7 of 182

Is eating rocks really a cause for health?

7

8 of 182

What Happens When LLMs Receive a Question?

8

The capital of France is Paris.

… …

Question Receival

v1.0: Direct Generation

v2.0: LLM + RAG (Retrieval Augmented Generation)

+

+

v3.0: LLM w/ Structured (Formal) Reasoning

9 of 182

We do not have knowledge of a thing until we have grasped its why, that is to say, its cause.

— Aristotle

9

10 of 182

Causal Graphical Model

10

Z

X

Y

  • Causal graph G:= (V, E)
  • Directed edge: Causal link
  • Node: Random variables

Z

X

Y

X

Y

11 of 182

Chocolate Example

11

Chocolate

Nobel Prize

  • Causal graph

?

Confounders?

12 of 182

Causality for Machine Learning

12

Bernhard Schölkopf

(Max Planck Institute)

Yoshua Bengio

(Mila)

13 of 182

My Work: Causal NLP

13

Causal NLP

Understanding and improving causal reasoning in natural language processing (NLP) models

Z

X

Y

14 of 182

What role can LLMs serve for causality?

14

15 of 182

How LLMs Find Causal Relationships

15

Simple Question

“Why is ground wet?”

Answer

Rain, sprinkler, etc.

Complex Question

“Does increasing minimum wage affect the employment rate?”

Answer

“Increasing min wage does not reduce employment”

Prior knowledge,

Common sense

Data-driven reasoning

16 of 182

Our Survey of Commonsense Causality

(Covering 200+ Papers)

16

EMNLP 2024 Survey

17 of 182

Our Survey of Commonsense Causality

(Covering 200+ Papers)

17

Limitations: Cannot handle precise causal relations, or causality under question, etc.

18 of 182

As a Knowledge Base, Language Parser & Reasoner

Our CLadder paper (Page 27, Appendix): https://arxiv.org/pdf/2312.04350

18

19 of 182

Two Main Causal Inference Tasks

19

Task 2. Causal Effect Reasoning

Task 1. Causal Discovery

What variable affects what?

By how much?

X

Y

Z

X

Y

Z

What is the strength of the influence of X on Y?

?

Peter Spirtes

Algorithms: PC algorithm, FCI…

20 of 182

Two Main Causal Inference Tasks

20

Task 2. Causal Effect Reasoning

Task 1. Causal Discovery

What variable affects what?

By how much?

X

Y

Z

X

Y

Z

?

Lots of tech companies’ CEOs are college dropouts, does it mean “dropout” causes “job success”?

How much does COVID vaccination increase the survival rate?

Example

Example

𝗖𝗼𝗿𝗿𝟮𝗖𝗮𝘂𝘀𝗲 Paper

(Jin et al, 2024 ICLR)

CLadder Paper

(Jin* et al, 2023 NeurIPS)

What is the strength of the influence of X on Y?

21 of 182

Accurate Causal Graph-Based Reasoning

  1. Utilizing LLM existing belief
  2. Letting LLMs read literature (RAG)
  3. Using data to verify hypothesis (CSV-based reasoning by statistical testing)
    1. “Does choosing a novel topic lead to a more successful scientific career?”
    2. Method 1: directly google search. Method 2: go to existing databases and see if there are neat datasets already composed
  4. Letting LLMs think critically (logical thinking)
    • Questioning the necessity and sufficiency of the causal relation
    • Questioning the hidden conditions when the cause→effect, or proposing important mediators [Reference: Freakonomics]
  5. Encouraging innovative hypothesis by LLMs (creativity)
    • For “Why eating lemons and oranges avoid scurvy”.

21

22 of 182

Needs for Reasoning

https://www.nytimes.com/2024/01/08/technology/ai-robots-chatbots-2024.html

22

23 of 182

23

Can Large Language Models Infer Causation from Correlation?

Rada Mihalcea

Mrinmaya Sachan

Zhijing Jin*

Zhiheng Lyu

Mona Diab

Bernhard Schoelkopf

Spencer Poff

Jiarui Liu*

ICLR 2024

24 of 182

Correlation vs. Causation

Imagine we have a common sense dataset / news dataset with this sentence:

What should our model infer?

24

As ice cream sales increase,

the rate of shark attacks sharply increases.

Hypothesis 1

Ice creams makes you taste better?

25 of 182

Correlation vs. Causation

Imagine we have a common sense dataset / news dataset with this sentence:

What should our model infer?

25

As ice cream sales increase,

the rate of shark attacks sharply increases.

Hypothesis 1

Ice creams makes you taste better?

Hypothesis 2

It’s about getting warm.

Getting Warm

Eating more ice creams

More swimming, thus more shark attacks

26 of 182

Formulation of Correlation and Causation

“Correlation does not mean causation.”

26

With additional intervention or with causal discovery methods, we can say things like C causes D (i.e., C → D).

More Shark Attacks

More People Swimming in the Sea

Eating More Ice Creams

Getting Warm

A

B

C

D

Correlated:

Causal:

A

B

C

D

✅ Nodes connected in a causal graph may be correlated.

❌ We cannot say B causes D, etc.

Can we go further?

Reference: Causal Discovery

27 of 182

Corr2Cause:

Correlation-to-Causation Inference

27

Input: Correlational statements

Output: Judgment of a causal relation

Assumptions: We assume a close system of the given variables and correlations. The variable relations satisfy the faithfulness assumption.

28 of 182

Dataset

28

207K samples

1K test data

1K dev data

29 of 182

Model Performance

29

GPT-4

F1 Score

Random

(Uniform)

Alpaca-7B

LLaMa-7B

BART MNLI

20

50

30

40

20

33

29

26

27

30 of 182

Opening Directions for Follow-Up Works

Method followups

And many others! (See our github paper list)

30

31 of 182

Opening Directions for Follow-Up Works

Application studies

Application studies

  • Healthcare
  • Education
  • News digest

31

And many others!

32 of 182

Why do future LLMs need this?

32

How can LLMs process such information? Correlation? Causation? What causes what?

33 of 182

Two Main Causal Inference Tasks

33

Task 2. Causal Effect Reasoning

Task 1. Causal Discovery

What variable affects what?

By how much?

X

Y

Z

X

Y

Z

?

Lots of tech companies’ CEOs are college dropouts, does it mean “dropout” causes “job success”?

How much does COVID vaccination increase the survival rate?

Example

Example

𝗖𝗼𝗿𝗿𝟮𝗖𝗮𝘂𝘀𝗲 Paper

(Jin et al, 2024 ICLR)

CLadder Paper

(Jin* et al, 2023 NeurIPS)

What is the strength of the influence of X on Y?

34 of 182

34

CLadder: Assessing Causal Reasoning in Language Models

Bernhard Schölkopf

Mrinmaya Sachan

Luigi Gresele*

Yuen Chen*

Felix Leeb*

Zhijing Jin*

Kevin Blin

Ojasv Kamal

Zhiheng Lyu

Fernando Gonzalez

Max Kleiman-Weiner

35 of 182

Causal Reasoning: An Example

https://www.covid-datascience.com/post/israeli-data-how-can-efficacy-vs-severe-disease-be-strong-when-60-of-hospitalized-are-vaccinated

35

?

Statistics of COVID-19 patients (Israel, August 15, 2021):

Among the severe cases, more than half were vaccinated. P(v | s)

Among the vaccinated people, … are severe cases. P(s | v)

Does it mean the overall causal effect of vaccination is negative for health?

Vaccinated people are more likely to develop severe cases of COVID.

36 of 182

Causal Reasoning: An Example

https://www.covid-datascience.com/post/israeli-data-how-can-efficacy-vs-severe-disease-be-strong-when-60-of-hospitalized-are-vaccinated

36

V

Vaccination

Vulnerability

?

Severity

37 of 182

Causal Reasoning: An Example

37

V

Vaccination

Vulnerability

?

Severity

  • Originally more vulnerable group:
    • Higher vaccination rate
    • Also higher severity rate

  • The right approach:
    • Separate the vaccination effect for each vulnerability level
    • I.e., “Confounder adjustment”

38 of 182

CLadder Dataset: 10,000 Causal Questions

Prototypes from

38

39 of 182

Testing Causal Reasoning in LLMs

Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models

39

Imagine a self-contained, hypothetical world with only the following conditions, and without any unmentioned factors or causal relationships:

Physical vulnerability has a direct effect on the likelihood of fatality, and vaccination decision. Vaccination has a direct effect on the chances of fatality.

In the entire population, 50% of the people are vulnerable.

For healthy and vaccinated people, the fatality rate is 1%. For healthy but unvaccinated people, the fatality rate is

Overall, the fatality rate for vaccinated people is …

Does vaccination lead to an increase in the survival rate?

Causal Query

[Average treatment effect (ATE)]

Causal Graph

Available Data

P(...) = … P(... | …) = …

Z

X

Y

Yes?

No?

Question

Answer

10K questions

with prototypes from

40 of 182

Designing variations into the dataset

40

Query Type:

Causal Graph

  1. Correlational
  2. Interventional
  3. Counterfactual

1

2

3

4

41 of 182

Testing Causal Reasoning in LLMs

Hünermund, P., & Bareinboim, E. (2023). Causal inference and data fusion in econometrics. The Econometrics Journal

41

Causal Query

[Average treatment effect (ATE)]

Causal Graph

Available Data

P(...) = … P(... | …) = …

Question

Answer

Z

X

Y

Causal Inference Engine

Yes?

No?

Correct Solution

Human Causal Scientists

42 of 182

Testing Causal Reasoning in LLMs

Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models

42

Causal Query

[Average treatment effect (ATE)]

Causal Graph

Available Data

P(...) = … P(... | …) = …

Question

Answer

Z

X

Y

Causal Inference Engine

Yes?

No?

Model

LLMs

43 of 182

Model Performance

Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models

43

Accuracy

Random

GPT-3

LLaMa Alpaca

50

80

60

70

50

50

davinci

44 of 182

Model Performance

Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models

44

Accuracy

Random

GPT-3

InstructGPT-3

50

80

60

70

51

50

50

53

56

d.-001 d.-002 d.-003

davinci

+ instruction-tuning

LLaMa Alpaca

45 of 182

Model Performance

Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models

45

GPT-3.5

Accuracy

Random

GPT-3

InstructGPT-3

GPT-4

50

80

60

70

51

50

50

53

62

56

52

d.-001 d.-002 d.-003

davinci

LLaMa Alpaca

46 of 182

Causal CoT: Causal Chain-of-Thought Prompting

Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models

46

[...] Does vaccination lead to an increase in the survival rate?

Causal Graph Parsing

Arithmetics

Final Answer

Yes

Query Formalization

ATE (X→Y)

= E[Y | do(X=1)] - E[Y|do(X = 0)]

Z

X

Y

Statistics Collection

P(Z=1)=0.5

P(Y=1|X=1)=...

Formal Causal Inference Steps

E.g., Do-calculus,

Counterfactual prediction

47 of 182

Model Performance

Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models

47

GPT-3.5

Accuracy

Random

GPT-3

InstructGPT-3

GPT-4

50

80

60

70

51

50

50

53

62

56

52

70

d.-001 d.-002 d.-003

gpt-4-1106-preview

+ Causal CoT

davinci

8%

LLaMa Alpaca

48 of 182

Causal AI Scientist: Facilitating Causal Data Science with LLMs

Mrinmaya Sachan

Zhijing

Jin

Bernhard Schölkopf

Samuel�Simko*

Sawal�Acharya*

Preprint 2025

Sawal Acharya*, Vishal Verma*, Devansh Bhardwaj*, Samuel Simko*, Jacob T. Emmerson*, Rahul Babu Shrestha*, Punya Syon Pandey, Anahita Haghighat, Yongjin Yang, Dominik Janzing, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin

49 of 182

Research Agenda Overview

  1. CauSciBench: Evaluating Causal Reasoning in Scientific Research by LLMs
  2. Causal AI Scientist: Facilitating Causal Data Science with LLMs
  3. Can LLMs Propose Instrumental Variables for Causal Reasoning?
  4. A Formal Verification Framework for LLM-Generated Causal Expressions

49

50 of 182

50

CauSciBench: Evaluating Causal Reasoning in Scientific Research by LLMs

Mrinmaya Sachan

Zhijing Jin

Bernhard Schölkopf

Terry Jingchen Zhang*

Preprint 2025

Together with: Anahita Haghighat, Sun Xianlin, Pepijn Cobben, Rahul Babu Shrestha, Maximilian Mordig, Jacob T. Emmerson, Furkan Danisman, Yuen Chen, Clijo Jose, Andrei Ioan Muresanu, Justin Cui, Jiarui Liu, Yahang Qi, Punya Syon Pandey, Yinya Huang

Andrew Kim

Sawal Acharya*

51 of 182

Can LLMs truly reason causally from data?

51

52 of 182

Models + Methods Tested

52

Program of Thoughts

Direct Prompting

ReAct (Reason + Act)

Chain of Thoughts

28 Model-Method Pairs

x4

53 of 182

Three Sources of Data

53

Real-World Papers

Synthetic Scenarios

Textbook Examples

143 Queries

Programmatically generated

100 Datasets

(185 Queries)

5 Datasets +

3 Textbooks

(39 Queries)

54 of 182

Topic distribution in our 100+ annotated papers

54

Electoral Behavior (17.1%): “Do legislators respond differently to emails based on race?”

Healthcare (15.3%): “What is the effect of privatizing jail healthcare on inmate mortality rates?”

Political Economics (11.7%): “Does higher corruption lead to higher income inequality?”

Labor Economics (10.8%): “Did an increase in the minimum wage decrease employment?”

Education (9.9%): “What is the effect of highest education degree on yearly earnings?

Political Behavior (9.9%): “What is the effect of imposing sanctions on mass mobilization?”

Criminology (9.0%): “What is the effect of castle doctrine laws on burglary rates?”

Public Opinion (9.0%): “Does dismissing tainted ministers affect presidential approval ratings?”

Applied Economics (7.2%): “How does an increase in product cost affect customer demand?”

55 of 182

Real-World Paper Quality Control

55

Reproducibility

Expert Validation

Methodological Rigor

Two rounds of expert review

Ensures usage of correct method & presence of all method-specific variables

Results are manually replicated in Python

Only the papers that can be replicated within 10% error are included

Covers 9 domains and 7 causal methods

Input to LLMs should not contain (implicit) hints to prevent contamination

Supported Methods: Ordinary Least Squares, Difference in Differences, Regression Discontinuity Design, Instrumental Variables, Propensity Score, General Linear Models, Front-door Adjustment

56 of 182

Data Curation Process

56

Information Extraction

Extract information to be used as an input to the LLM and ground truth from each paper

Accept/Reject Paper

Expert Verification

Check that curated query & dataset description doesn’t provide methodological hints

Verification #2

Check that the replication code uses the correct method & variables, and it matches the correct paper value

Manual Replication

Replicate the study in Python to ensure replicability

57 of 182

Sample Information Extraction

57

Notations: �Running variables (for RDD)

Temporal variables (for DiD)

State variables (for DiD)

Is RCT (for validity of OLS)

Interaction variables (for heterogeneous treatment effects)

Multi RCT treatment (for MultiRCT)

Results are reproduced in Python by experts

58 of 182

Results

58

48.96%

Lowest mean relative error for real data across all model-method pair

77.17%

Highest method accuracy for real data across all model-method pair

93.48%

Highest method accuracy for synthetic data across all model-method pair

7.91%

Lowest mean relative error for synthetic data across all model-method pair

59 of 182

Findings

59

High reliance on OLS,

especially for weaker models

60 of 182

Causal AI Scientist: Facilitating Causal Data Science with LLMs

Mrinmaya Sachan

Zhijing

Jin

Bernhard Schölkopf

Samuel�Simko*

Sawal�Acharya*

Preprint 2025

Vishal Verma*, Sawal Acharya*, Samuel Simko*, Devansh Bharadwa*, Anahita Haghighat, Mrinmaya Sachan, Dominik Janzing, Bernhard Schölkopf, Zhijing Jin

61 of 182

Causal CoT: Causal Chain-of-Thought Prompting

Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models

61

[...] Does vaccination lead to an increase in the survival rate?

Causal Graph Parsing

Arithmetics

Final Answer

Yes

Query Formalization

ATE (X→Y)

= E[Y | do(X=1)] - E[Y|do(X = 0)]

Z

X

Y

Statistics Collection

P(Z=1)=0.5

P(Y=1|X=1)=...

Formal Causal Inference Steps

E.g., Do-calculus,

Counterfactual prediction

62 of 182

How do human statisticians solve it?

62

63 of 182

63

… and more

64 of 182

Decision Tree Approach for Method Selection

64

Instrumental Variables, etc.

65 of 182

Decision Tree Approach for Method Selection

65

Linear Regression w/ pre-treatment variables

Randomized Control Trial?

Difference in Means

Difference in Differences

Valid pre-treatment variables?

Instrumental

Variable

Valid Instruments?

X

X

X

X

Temporal information on

treatment timings?

66 of 182

Decision Tree Approach for Method Selection

66

Linear Regression w/ pre-treatment variables

Randomized Control Trial?

Difference in Means

Difference in Differences

Valid pre-treatment variables?

Instrumental

Variable

Valid Instruments?

X

X

X

X

Temporal information on

treatment timings?

Covered

  • 7 Estimation methods
  • 11 Decision steps

67 of 182

Solving the Causal Questions with CAIS

67

68 of 182

Efficiently Prompting the LLM Models

https://vdsbook.com/

Problem formulation

Data cleaning and preprocessing�

Exploratory data analysis

Modeling + coding step

Post-hoc analysis

Interpretation of results

68

4

How do differences in the formulation affect the final result?

3

What is the domain question?

2

1

  • Use a strategy inspired by veridical data science (Yu & Kumbier, 2020)
  • Simulate the workflow of expert data scientists

5

6

Make the model answer a checklist of questions at each step

How was the data collected?

What are the data columns and types?

Are the any interesting relationships?

Are there any variables which are strongly correlated?

Which method is appropriate to use?

What refutation methods are available for this method?

If necessary, specify null hypotheses

Are the relationships found stable?

What conclusions can be drawn?

What recommendations can be drawn?

Code generation

Code execution

Code output check

69 of 182

Example 1: Multi-treatment RCT based on Gerber et al.

Query: Did the Hawthorn scheme lead to an increase in voter-turnout in 2006 elections?

Filename: voter_turnout.csv

Description: The authors conducted a randomized experiment in Michigan to evaluate the effects of four different types of social pressure on voting behavior.

Treatment Types: Civic Duty, Hawthorne, Self, Neighbors or the control group.

Data Variables: voted (voted or did not vote in the 2006 elections); treatment; sex (male/female), g2002, g2004, g2006 (voted or did not vote in gubernatorial elections), hh_size(household size), hh_id (household id)

Results

  • Outcome variable: voted
  • Treatment variable: treatment
  • Predicted Method: Regression with pre-treatment covariates + extract coefficient of Hawthorne
  • Estimated effect + Standard error: 0.025 (0.0025)
  • Interpretation: Hawthorne led to an increase in voter-turnout based on statistical significance.
  • True answer: 0.025 (0.003)

69

70 of 182

Example 2: DiD based on Card and Krueger (1992)

Query: Did an increase in minimum wage decrease employment?

Filename: min_wage_data.csv

Description: The dataset was collected to examine the effects of a minimum wage increase on fast food employment. It includes two survey waves from 410 fast food restaurants such as Burger King, KFC, Wendy’s, and Roy Rogers in New Jersey and Pennsylvania. The survey gathered information on employment, wages, prices, store characteristics, and location.

Results

  • Outcome variable: FTE employment
  • Treatment variable: state (New Jersey = 1; Pennsylvania=0)
  • Predicted Method: Difference in Differences
  • Estimated effect + Standard error: 2.75 (1.69)
  • Interpretation: Based on DiD analysis, the increase in minimum wage decreased employment
  • True answer: 2.76 (1.36)

70

71 of 182

Example 3 (Difference in Differences based on Kessler et al.)

Query: Did changing organ donation registration from opt-in to active choice increase donation rates?

Filename: organ_donations.csv

Description: California enacted a change on July 2011, where the method for asking the organ donation question at the Department of Motor Vehicles (DMV) changed from an opt-in frame (make a positive statement or skip) to an active choice frame (required to check either "yes" or "no").

Data Variables: State; Post (indicating before/after policy enactment); Rate: Organ donation rate

Results

  • Outcome variable: Rate
  • Treatment variable: State (California = 1; Others=0)
  • Predicted Method: Difference in Differences
  • Estimated effect + Standard error: -0.022 (0.006)
  • Interpretation: Based on DiD analysis, the policy decreased donation rates.
  • True answer: -0.022 (0.007)

71

72 of 182

72

Terry Jingchen Zhang*

Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination

Mrinmaya Sachan

Zhijing Jin

Bernhard Schölkopf

Gopal Dev*

Yinya Huang

Together with: Ning Wang, Nicole Ni, Wenyuan Jiang

73 of 182

Do LLMs Reason or Recall?

73

Memorization

Dataset contamination

Heavily inflated performance

Paraphrase brittleness

Fragile zero-shot behavior

.

.

.

Reasoning

Novel problem solving

Accurate metrics

Translate across contexts

Error localization and fixability

.

.

.

74 of 182

Testing Method

74

Evaluate Model Response

Compare the mean accuracy around 6 month span pre and post knowledge cutoff.

3

Run Models

Query the questions and log the accuracy for pre-cutoff and post-cutoff questions.

2

Synthesize QA

Synthesize QA from arXiv papers released pre and post knowledge cutoff of the models. (May 2023–June 2025)

1

75 of 182

Results

75

No notable performance decay

Consistent across models & domains

76 of 182

76

Can LLMs Propose Instrumental Variables for Causal Reasoning?

Dominik Janzing

Mario Fritz

Bryan Wilder

Ivaxi Sheth

Zhijing Jin

Preprint 2025

77 of 182

Instrumental Variables: A Motivating Example

Paper link: Does Compulsory School Attendance Affect Schooling and Earnings? (Angrist & Krueger, 1991)

Paper: Does Compulsory School Attendance Affect Schooling and Earnings? (Angrist & Krueger, 1991).

Query: Does longer education lead to higher wage?

Variables

  • Outcome variable: Wage
  • Treatment variable: Years of education
  • Instrument: Quarter of birth

Why Quarter of Birth (QoB)?

  1. QoB affects years of education due to compulsory school laws
  2. QoB doesn’t directly affect wage
  3. Thus, QoB can act as an “instrument” that modifies the treatment without impacting the outcome, isolating their effect

77

78 of 182

Causal Graph Conditions of the IV

78

Education (Treatment)

Wage (Outcome)

?

79 of 182

Causal Graph Conditions of the IV

79

Cigarette Tax (Instrument)

Smoking During Pregnancy (Treatment)

Low Birthweight (Outcome)

?

Exclusion Restriction

Relevance

Parental Education (Covariate)

Independence

Core IV assumptions:

  1. Relevance: Instrument must affect the treatment
  2. Independence: Instrument must not be related to unobserved confoundings
  3. Exclusion restriction: Instrument affects outcome only through the treatment

80 of 182

How IV Estimate is Calculated

80

First Stage

Predict the treatment using instruments and controls to extract instrument-driven variation in the treatment

2-Stage Least Squares | 𝛽2SLS = (X’PZX)-1X’PZY, where PZ = Z(Z’Z)-1Z’

Second Stage

Regress the outcome on the predicted treatment and controls so the estimate isolates the change to treatment caused by the instrument

*OLS after replacing original regressors with their projections on the instruments

81 of 182

How IV Estimate is Calculated

81

82 of 182

Figure Inspirations

82

83 of 182

Figure Inspirations

83

84 of 182

Motivation

84

Opportunities

LLMs have vast knowledge repositories, sometimes even more than experts

LLMs can generate creative hypotheses and connections

Allows for rapid IV exploration and validation through LLM + Statistical grounding

Challenges

IV identification requires deep domain expertise and creativity

IV has strong assumptions it must satisfy (Relevance, exclusion, independence)

CAIS sends a query to one LLM to find the instrument, leading to unignorable rate of error

85 of 182

Proposing good IVs illustrates 3 skills of Causal LLMs

85

Can LLMs recover canonical IVs?

Testing ability to identify well-established instruments in existing economic literature

Can LLMs avoid proposing invalid IVs?

Evaluating safeguards against invalid instruments

Can LLMs discover novel & valid IVs in open-ended scenarios?

Testing creative approach to instrument identification on real-world datasets

LLM as a knowledge base:

LLM as a statistics agent:

LLM as a creative scientist:

86 of 182

IV Co-Scientist Pipeline

86

PreSelector

Correlation filtering of variable pairs

Suggested Instruments

HumanProxy

Semantic relevance assessment

CausalOracle

Directional inference using LLM + Granger causality

CriticAgents

Independent evaluation of exclusion & independence

HypothesisGenerator

Proposes candidate IVs and confounders

Grounder

Maps proposed IVs to dataset variables

87 of 182

Example on Angrist & Krueger (1991)

87

PreSelector

Finds correlation (QoB-Education, Education-Wage)

Output:

Variable name of QoB

HumanProxy

Validates the correlation using prior knowledge

CausalOracle

Tests causal relations

(Education -> Wage)

CriticAgents

Tests whether QoB only affects wage through education

HypothesisGenerator

Proposes candidate IVs and confounders

(Quarter of Birth)

Grounder

Maps QoB, Education, and Wage to variables in the dataset

88 of 182

88

89 of 182

Example on Angrist & Krueger (1991)

89

PreSelector

Finds correlation (QoB-Education, Education-Wage)

HumanProxy

Ensure variables are distinct (Not "male literacy" vs. "female literacy")

CausalOracel

Retrieve causal direction

(Education -> Wage)

HypothesisGenerator

Proposes candidate IVs and confounders

(Quarter of Birth)

CriticAgents

Tests whether QoB only affects wage through education

Grounder

Maps QoB, Education, and Wage to variables in the dataset

Output:

Variable name of QoB

90 of 182

Findings: Canonical IV

90

Exact Match: Exact variable name

(e.g. rainfall)

Concept Match: Same concept, different wording

(e.g. drought index)

91 of 182

Findings: Open-Ended Discovery

91

How do we measure the statistical power of the found instruments?

Is the instrument actually connected to our treatment?

Answered by Relevance (Distance from ground truth)

Do multiple IVs give consistent results?

Answered by Cnorm (Consistency)

92 of 182

Findings: Open-Ended Discovery

92

Relevance

Minimum relevance score (F-stat) of 10.76

Relevance > 10 indicates selected instruments are strong

Cnorm

Minimum Cnorm (Consistency) of 0.505

Cnorm > 0.5 indicates good consistency

93 of 182

Key Takeaways

  1. LLMs can effectively recover canonical IVs

  • Multi-agent framework successfully prevents invalid IV suggestions

  • Novel consistency metric validates IV quality without ground truth

  • Strong performance on real-world Gapminder dataset

93

4

3

2

1

94 of 182

94

95 of 182

Results

96 of 182

Results

97 of 182

97

A Formal Verification Framework for LLM-Generated Causal Expressions

Mrinmaya Sachan

Zhijing Jin

Bernhard Schölkopf

Paul He

Yinya Huang

Preprint 2025

98 of 182

Verifying LLMs’ Step-by-Step Causal Reasoning

98

[...] Does vaccination lead to an increase in the survival rate?

Causal Graph Parsing

Arithmetics

Final Answer

Yes

Query Formalization

ATE (X→Y)

= E[Y | do(X=1)] - E[Y|do(X = 0)]

Z

X

Y

Statistics Collection

P(Z=1)=0.5

P(Y=1|X=1)=...

Formal Causal Inference Steps

E.g., Do-calculus,

Counterfactual prediction

99 of 182

Sample Input

## Instructions:

1. For each problem, identify the correct expression that represents the query

2. Draw the graphical representation as a text description of edges

3. Show your mathematical reasoning step by step

4. Provide a final yes/no answer

5. Keep your response concise and focused on the solution

Prompt:

[**Problem description omitted for brevity**]

Let V2 = water company; V1 = poverty; X = liking spicy food; Y = cholera

Expression: P(Y | X)

Graphical Representation: V1->X,V2->X,V1->Y,X->Y

Reasoning: P(X = 1, Y = 1)/P(X = 1) - P(X = 0, Y = 1)/P(X = 0)

P(X=1) = 0.81

P(Y=1, X=0) = 0.13

P(Y=1, X=1) = 0.17

0.17/0.81 - 0.13/0.19 = -0.44

-0.44 < 0

Final Answer: No

[**Includes a few more examples**]

## Your Task:

Solve the following problem using the format above. Begin your response with "Solution:" and provide only the expression, graphical representation, reasoning, and final answer.

Prompt: {description}

99

100 of 182

Sample Problem

Prompt: Imagine a self-contained, hypothetical world with only the following conditions, and without any unmentioned factors or causal relationships: Poverty has a direct effect on liking spicy food and cholera. Water company has a direct effect on liking spicy food. Liking spicy food has a direct effect on cholera. Poverty is unobserved. For people served by a local water company, the probability of cholera contraction is 64%. For people served by a global water company, the probability of cholera contraction is 66%. For people served by a local water company, the probability of liking spicy food is 50%. For people served by a global water company, the probability of liking spicy food is 45%. Will liking spicy food decrease the chance of cholera contraction? Let V2 = water company; V1 = poverty; X = liking spicy food; Y = cholera.

100

101 of 182

Sample Response

Expression: E[Y | do(X = 1)] - E[Y | do(X = 0)]

Graphical Representation: V1->X,V2->X,V1->Y,X->Y

Reasoning: E[Y | do(X = 1)] - E[Y | do(X = 0)]

[P(Y=1|V2=1)-P(Y=1|V2=0)]/[P(X=1|V2=1)-P(X=1|V2=0)]

P(Y=1 | V2=0) = 0.64

P(Y=1 | V2=1) = 0.66

P(X=1 | V2=0) = 0.50

P(X=1 | V2=1) = 0.45

(0.66 - 0.64) / (0.45 - 0.50) = -0.39

-0.39 < 0

Final Answer: Yes

101

102 of 182

DoVerifier workflow (after obtaining the output)

102

Causal Graph Parsing

Z

X

Y

Parse Expression

P(Y|do(X), Z)

Perform BFS to match

E1

E2

E3

E4

E5

Final Answer

Yes

103 of 182

The ID Algorithm (Shipster 2008)

Shpitser, I., & Pearl, J. (2008). Complete identification methods for the causal hierarchy. Journal of Machine Learning Research, 9(64), 1941–1979.

103

  • Shipster 2008 developed an algorithm to determine whether a causal expression is identifiable under a DAG (meaning if an expression can be written in purely observational terms).
  • They proved that their algorithm is sound and complete.
  • Can’t we just run the ID algorithm, then compare if they’re the same? How does our differ?

104 of 182

DoSearch: Tikka, S., Hyttinen, A., & Karvanen, J. (2021)

https://github.com/santikka/dosearch?tab=readme-ov-file

104

  • R package for causal effect identification using do-calculus and the ID algorithm�
  • Focuses on symbolic identification, expression search, and transportability�
  • Theoretically sound

  • Inputs: causal graph + causal query (typically involving the do-operator)�
  • Outputs: symbolic estimand (if identifiable) or “not identifiable”�

105 of 182

DoSearch (R) vs DoVerifier (Python)

105

DoSearch

DoVerifier

Determines whether a causal query is identifiable and derives an estimand if so

Verifies derivability/semantic equivalence of expressions under a DAG

Search-based identification (ID / do-calculus–style procedures over graphs

BFS proof search over do-calculus + probability theory

Provably sound, and it is complete with respect to identifiability problems

Provably sound, and it is complete with respect to identifiability problems + unidentifiable problems

Outputs estimand (or failure / not identifiable when unidentifiable)

Outputs Proof paths / rewrite sequences (or None if not derivable within limits)

106 of 182

DoSearch works on Unidentifiable but Equivalent Expressions

106

Z

X

Y

W

P(Y | do(X), do(W), Z) = P(Y|do(X), Z)

  • do(X) breaks the edge from W to X
  • W becomes irrelevant after conditioning on Z in the modified graph
  • Hence we can drop do(W)

U

U is unobserved!

107 of 182

Visualizing the entire search

107

Z

X

Y

W

U

U is unobserved!

P(Y|do(X), do(W), Z)

P(Y|do(X), W, Z)

P(Y|do(X), Z)

P(Y|do(X), Z)

Rule 3

Rule 2

Rule 1

Technically, the search stops here, but we visualize the entire tree for completeness

108 of 182

DoSearch works on Unidentifiable but Equivalent Expressions

108

U

X

Y

Z

P(Y | do(X)) = P(Y|do(X), Z)

Because according to Rule 1 of do-calculus after we remove edges into X and so we can remove Z in the conditioning.

U is unobserved!

109 of 182

Goal

Given a causal expression generated by some LLM, check if it is correct against a ground truth.

109

P(Y | do(X), Z)

LLM Generated Expression

P(Y | Z, do(X))

Ground Truth

110 of 182

A very simple idea…

110

Can we apply the rules of do-calculus along standard probability axioms to transform one expression to another?

E1

E2

E3

E4

E5

111 of 182

BFS over possible rule applications!

111

112 of 182

Takeaways

  • Treat the problem as a scientist
  • Understand the problem, and ground it formally
  • Design a set of structured steps
  • Use LLMs to connect tool use, and deliver scientifically-grounded results

112

113 of 182

Big Picture of

Future Causal LLMs

113

114 of 182

Ideal Future for LLMs

Current Skill Sets

Further Skills

114

Direct Generation

Websites

Simple Math & Data Science

Causality

Z

X

Y

115 of 182

115

Quriosity: Analyzing Human Questioning Behavior and the Quest for Causality

Rada

Mihalcea

Mrinmaya Sachan

Zhijing

Jin

Bernhard Schölkopf

Roberto Ceraolo *

Dmitrii Kharlapenko *

Amélie Reymond

Ahmad Khan*

116 of 182

Causal Questions

116

42% of human questions are causal in nature!

Why do apples fall?

Was I rejected because I lack experience?

How to make the world a better place?

How are today’s actions influencing what will happen tomorrow?

Should I get a PhD?

117 of 182

Taxonomy of Causal Questions

117

Causal Reasoning

Causal Graph Learning

1. What variables exist?

3. By how much?

?

?

Human civilization advances 🚀.

Z

Y

?

?

2. What affects what?

X

Z

Y

X

Z

Y

How much does X+Δ affect Z?

4. Full understanding

Z = f(X, Y, σ)

X

118 of 182

Taxonomy of Causal Questions

118

1. What variables exist?

3. By how much?

?

?

Human civilization advances 🚀.

Z

Y

?

?

2. What affects what?

X

Z

Y

X

Z

Y

How much does X+Δ affect Z?

4. Full understanding

Z = f(X, Y, σ)

X

  • Asking about the variables
  • Asking about the edges
  • Asking about the causal effect
  • Asking about the mechanism

119 of 182

Taxonomy of Causal Questions

  • Asking about the variables
  • Asking about the edges
  • Asking about the causal effect
  • Asking about the mechanism

119

What are the key elements needed to start a fire?

4

What is the change in my income after a PhD?

3

3

Does smoking interfere with the drug effect?

2

What pixels of an image represent the lighting, angle, and position?

1

2

1

How much oxygen (O2) is needed to oxidize 1 kg of mercury (Hg)?

4

120 of 182

Formalization

  • Asking about the variables

Given a set of unstructured data, and a (partial) causal structure, suggest variables that play certain roles.

  • Asking about the edges

Given a set of variables, suggest their relations

  • Asking about the causal effect

Given a causal structure and some data, what is the effect of an intervention if the exogenous variables are controlled for (or randomized)?

  • Asking about the mechanism

What is the functional form of the causal parents to an effect?

120

3

2

1

4

121 of 182

Capturing Human Natural Curiosity

121

Emergence of Curiosity

42%

NatQuest data:

13.5K Questions

of the natural

questions are causal

Query Channels

Search Engine Queries

Community Forums

LLM Interactions

122 of 182

Linguistic Properties of Causal Questions

“How” and “Why” questions are often causal

122

123 of 182

Levels of Cognitive Skills Needed for Causal Questions

123

Create

Produce new or original work. Design, assemble, construct, conjecture, develop, formulate, author, investigate

01

Evaluate

Justify a stand or decision. Appraise, argue, defend, judge, select, support, value, critique, weigh

02

Analyze

Draw connections among ideas. Differentiate, organize, relate, compare, contrast, distinguish, examine, experiment, question, test

03

Understand

Explain ideas or concepts. Classify, describe, discuss, explain, identify, locate, recognize, report, select, translate

05

Apply

Use information in new situation. Execute, implement, solve, use, demonstrate, interpret, operate, schedule, sketch

04

Remember

Recall facts and basic concepts. Define, duplicate, list, memorize, repeat, state

06

Bloom’s

Taxonomy

124 of 182

Levels of Cognitive Skills Needed for Causal Questions

From Bloom’s taxonomy of cognitive complexities

124

125 of 182

Current Performance of GPT-4o

The three usability criteria are from user satisfaction surveys (ISO, 2018)

Observations:

  • Struggles when asked to foresee the causal effects in the future
  • Does well in answering causal questions that require knowledge lookup
  • Answers are often overly verbose (low efficiency)

125

Effectiveness

Efficiency

Satisfaction

GPT-4o score

(by human raters)

3.83 (out of 5)

2.88 (out of 5)

3.85 (out of 5)

126 of 182

Towards Solving the Causal Questions

126

… …

Question Receival

Formal Causal Inference and Discovery

LLM w/ Web Search

*Based on human annotation results on a random sample of 600 questions, conducted for this slide.

“Why do apples fall?”

“How much does vaccination help?”

53.6%

questions can potentially be addressed by Formal CI

127 of 182

Takeaways

  • The space of natural causal questions shows the great potential of Causality to Chatbot AI
  • Future work is needed to address these causal questions in a more scientific way
  • It is also promising to compare the effectiveness of Skill 1 (Knowledge-Based Causal Reasoning) and Skill 2 (Formal Causal Reasoning) on these questions

127

128 of 182

Problems: Lack of Clear Understanding of MAD

128

Several papers study multi-agent effectiveness, but lack a standard for distinguishing key features from self-play or evaluating their impact under equal conditions.

129 of 182

Experimental Setting

  • Baselines (Self-Agent)
    • Self-Refinement (Sequential Scaling)
      • Up to 16 Refinements
    • Self-Consistency (Parallel Scaling)
      • Up to 16 Samples
  • MAD
    • Homogeneous : Copy of the Same Model
      • Effect of Collaborative Refinement
    • Heterogeneous : Different Model Families
      • Effect of Diverse Exploration
    • # Agents - # Rounds : (8 - 2), (4 - 4), (2 - 8)

129

130 of 182

Results

130

131 of 182

Mathematical Reasoning : Homogeneous Agents

131

  1. MAD does not consistently outperform parallel sampling with self-agent
  2. MAD relatively demonstrates clearer benefits on highly challenging problems.

132 of 182

Mathematical Reasoning : Homogeneous Agents

132

In sequential scaling, MAD actively corrects errors, unlike self-refinement, making it effective at high task difficulty when only a few initial agents are correct.

133 of 182

Mathematical Reasoning : Qualitative Example

133

Initially, there is only one correct answer.�With Self-Consistency, one can not catch this correct answer.

134 of 182

Mathematical Reasoning : Qualitative Example

134

Collaborative refinement in MAD lets one correct solution guide the process toward the right answer.

135 of 182

Mathematical Reasoning : Homogeneous Agents

135

Self-consistency excels at math reasoning, but MAD gains a relative edge on harder problems via collaborative refinement.

Takeaway 1

136 of 182

Mathematical Reasoning : Homogeneous Agents

136

MAD is relatively more beneficial on challenging tasks, especially when using smaller models.

137 of 182

Mathematical Reasoning : Homogeneous Agents

137

Smaller models benefit more from MAD than parallel scaling on hard math tasks.

Takeaway 2

138 of 182

Mathematical Reasoning : Heterogeneous Agents

138

Diverse exploration with heterogeneous models generally shows no clear advantage over the best single agent in math.

139 of 182

Mathematical Reasoning : Heterogeneous Agents

139

Heterogeneous MAD performs near the agents’ harmonic mean, showing limited gains from diverse exploration in math reasoning

Takeaway 3

140 of 182

Safety Reasoning : Homogeneous Agents

140

In safety reasoning, bias propagates during refinement and generally leads to an increase in ASR.

141 of 182

Safety Reasoning : Homogeneous Agents

141

Sequential scaling increases ASR in safety tasks, while MAD helps reduce this risk, though it does not fully prevent it.

Takeaway 4

142 of 182

Safety Reasoning : Homogeneous Agents

142

Larger models are less vulnerable with MAD, as few initial agents respond to harmful prompts.

143 of 182

Safety Reasoning : Homogeneous Agents

143

Larger models are less vulnerable in MAD due to stronger defenses and safer agent scaling, but some risk persists.

Takeaway 5

144 of 182

Safety Reasoning : Heterogeneous Agents

144

Unlike math, heterogeneous MAD helps safety tasks by occasionally reducing ASR through diverse exploration.

145 of 182

Safety Reasoning : Heterogeneous Agents

145

Unlike in mathematical reasoning, weaker agents can introduce views that shift stronger ones.

146 of 182

Safety Reasoning : Heterogeneous Agents

146

In safety tasks, diverse exploration in heterogeneous MAD improves performance, unlike in math reasoning.

Takeaway 6

147 of 182

Takeaways

147

148 of 182

Summary

148

Math

Safety

It is helpful for highly challenging problems.

Usually leads to an increase in harmfulness due to harm propagation during refinement

Not Helpful as as stronger models often subsume weaker ones.

Can be helpful, as less safe models may aid stronger ones by offering alternative safe responses.

Collaborative Refinemenet

Diverse Exploration

149 of 182

Contributions

149

We identify key features of the vanilla MAD framework: collaborative refinement and diverse exploration.

We investigate how these features provide benefits compared to test-time scaling with a self-agent

We explore Self vs. MAD across task types, difficulty levels, model scale, and agent configurations.

150 of 182

Extension

  • Agent Configurations
    • Different personas (see Appendix for an example)
    • Different model scales (see Appendix for an example)
    • Analysis across agents:
      • Persuasiveness
      • Degree of fine-tuning (specialization)�
  • Communication
    • There could be better ways for agents to communicate.

150

151 of 182

Can Theoretical Physics Research Benefit from Language Agents?

151

Bernhard Schölkopf

Ignacio Cirac

Pavel Kos

Terry Jingchen Zhang

Zhijing Jin

Sirui Lu

NeurIPS 2025 AI4Science Workshop

152 of 182

Main Position

152

LLM agents could help theoretical physics

Opportunities

Automate repetitive calculations

Explore risky directions

Connect formulas with code

Challenges

Physics Research Taste

Reliability & Verification

Hallucinations

Requires: domain know hows + tools + more synergized research

Disclaimer: the expertise of the authors are restricted to theoretical atomic optics physics/ quantum information theory and many-body physics

153 of 182

INPUT

Idea / Note

LLM AGENT

Derivation

FORMAL CHECK

Mathematica/Lean

INFORMAL CHECK

Self-Critique

OUTCOME

New results??

154 of 182

The Problem of Plausible Hallucinations

Example: Without checks, LLM derivations can introduces subtle errors can propagate.

"The errors are often really subtle... No human would have actually made that mistake."�— Terence Tao

EXAMPLE: THE COMMUTATOR TRAP

AI Derivation: "Since A and B are square matrices, we expand the exponential..."

× Error: This identity holds only if [A,B] = 0. An LLM can neglects this condition, producing a result that looks correct but is physically wrong.

Agree with the critique, since the threshold was not derived correctly, this does not follow.

155 of 182

Self-Critique with Human in the Loop

Decompose the research workflow into an iterative loop. The Agent proposes steps and critiques itself, but the Human remains the judge.

1. DERIVE

Generator Agent Proposes Ansatz / Step

2. CRITIQUE

Reflection Agent Scans for Logic Gaps

3. VERIFY

Human Theorist Judgment & Insight

156 of 182

Building the Fences: Tools and Formal Prover

>> Verified: True

157 of 182

Challenge: Developing Some Tastes

157

Goal: train agents to naturally exploit symmetry principles

Disclaimer: this is a drastically simplified example that current LLMs can already do very well.

158 of 182

Challenge: From Model to Code

  • Example Hamiltonian (Fermi-Hubbard)

  • Agents can help prototype numerical ED / QMC / tensor network codes, translate legacy code.
  • Hard parts: fermionic anticommutation, boundary conditions, gauge / Gauss-law constraints.

159 of 182

Challenge: Even Theoretical Physics are Multimodal

GPT-4o generated tensor network diagrams that are subtly wrong in a non-human way

Current Vision-Language models struggle with connecting diagrams to formulas

Feynman Diagram

160 of 182

Takeaways

  • LLMs show promise for augmenting physics research already
  • Critical gaps in derivation reliability and taste
  • Theoretical Physics as a demanding testbed for LLM reasoning
  • Future: will scaling training/physics data leads to important artificial discovery in theoretical physics?

160

4

3

2

1

161 of 182

Collective Intelligence: A Survey on Multi-Agent Systems for AI-Driven Scientific Discovery

161

Bernhard Schölkopf

Mrinmaya Sachan

Yongjin Yang

Yinya Huang

Terry Jingchen Zhang

Zhijing

Jin

Sirui Lu

Preprint 2025

162 of 182

Challenges of Multi-Agent Systems

162

Coordination & Communication Failures

Specialized corpora causes misunderstandings between agents and skewed outcome

Integration & Scalability Limitations

Produces superficial knowledge aggregation instead of theoretical hypothesis

Security Risks

Vulnerable to manipulation by malicious agents and attack vectors

Interoperability and Transparency Deficits

Decision making process is intransparent, undermining accountability standards

163 of 182

Future Directions

163

Organizational Integration

Robust Cross-Validation

Standardized Execution Frameworks

Security Protocol and Infrastructure

Adaptive Resource Allocation

Enhanced Error Detection & Benchmarking

164 of 182

Survey: LLM Agents for AI for Science

164

Literature Review

Coordination mechanism

Security protocol & Fast checking

Hypothesis Generation

Organizational integration

Robust cross-validation

Experimental Planning

Predictive resource coordination

Distributed security architecture

Experimental Execution

Standardized execution

Robust security infrastructure

Peer Review

Enhanced error detection

Comprehensive benchmarking

165 of 182

Future Work: Technical Roadmap

165

Scale Up the Studies

Diversify the Domains

Increase the number of all three sources

Increase the domain coverage of Real-World data

Enrich the Methods

Implement Design of Experiments, Mendelian Randomization, and more domain specific methods

166 of 182

Future Work: Impact Roadmap

166

Career Data

What is the effect of having a PhD on wage?

Chemical Experiment Data

How does temperature affect the reaction yield?

Biomedical Data

How does air quality affect insomnia?

… and More!

Any domain that could benefit from causality

Pharmaceutical Data

What substance enhances the drug effect?

167 of 182

Future Work: Impact Roadmap

167

1

2

3

4

Social Sciences�Economics, social science, political science, …

Parsing causal relations from observational data & small RCTs

Experimental Natural Sciences�Chemistry, biology, material science

To guide large-scale RCT designs

Philosophy & Law

Causal argumentation and support

Theoretical Natural Sciences�Physics, Earth & Climate Sciences

Mechanistic law discovery & derivations

168 of 182

Research Overview

  • CLadder: Assessing Causal Reasoning in Language Models (NeurIPS 2023)
  • Can Large Language Models Infer Causation from Correlation? (ICLR 2024)

Latest work:

  • CauSciBench: Evaluating Causal Reasoning in Scientific Research by LLMs
  • Causal AI Scientist: Facilitating Causal Data Science with LLMs
  • Can LLMs Propose Instrumental Variables for Causal Reasoning?
  • A Formal Verification Framework for LLM-Generated Causal Expressions
  • Collective Intelligence: A Survey on Multi-Agent Systems for AI-Driven Scientific Discovery

168

169 of 182

Mentors, Mentees & Collaborators

169

Mentees

Great thanks to

170 of 182

170

171 of 182

Cooperate or Collapse:

Emergence of Sustainable Cooperation

in a Society of LLM Agents

171

Bernhard Schölkopf

Mrinmaya Sachan

Zhijing

Jin*

Max Kleiman-Weiner*

Giorgio

Piatti*

Rada

Mihalcea

172 of 182

A Fishing Village

172

… …

Name: John

Name: Kate

Name: Jack

… …

Images generated by DALL-E

173 of 182

Generative Agents: Era of LLMs

173

Park, Joon Sung, et al. “Generative agents: Interactive simulacra of human behavior.” (2023)

174 of 182

Generative Agents: Era of LLMs

174

Park, Joon Sung, et al. “Generative agents: Interactive simulacra of human behavior.” (2023)

175 of 182

Simulation: A Calendar Year

175

Apr

Mar

Feb

Jan

🎣

💬

🏠

🎣

💬

🏠

🎣

💬

🏠

🎣

💬

🏠

Discuss

Act

Reflect

Discuss

Act

Reflect

Discuss

Act

Reflect

Discuss

Act

Reflect

The remaining fish double up to 100 tons.

176 of 182

Simulation: A Calendar Year

176

Apr

Mar

Feb

Jan

🎣

💬

🏠

🎣

💬

🏠

🎣

💬

🏠

🎣

💬

🏠

Discuss

Act

Reflect

Discuss

Act

Reflect

Discuss

Act

Reflect

Discuss

Act

Reflect

177 of 182

Act-Discuss-Reflect (ADR) Agent Architecture

177

Apr

Mar

Feb

Jan

Discuss

Act

Reflect

Discuss

Act

Reflect

Discuss

Act

Reflect

Discuss

Act

Reflect

There are 100 tons of fish [...]

There are 38 tons of fish [...]

Villagers agree to a low harvest rate to restore fish population.

Jack caught more fish than all the other villagers.

To preserve the fish in the lake, I should fish less than 3 tons next time.

I caught 2 tons of fish.

I caught 40 tons of fish.

178 of 182

Results

178

The best model survives <54% of the time ⚠️

*Each metric is run over 5 random seeds per model.

179 of 182

We are growing: Causality Workshops

179

Guangyi Chen · Haoxuan Li · Sara Magliacane · Zhijing Jin · Biwei Huang · Francesco Locatello · Peter Spirtes · Kun Zhang

Dagstuhl seminar (Apr 7–9, 2026): Causality and LLMs

Amit Sharma, Zhijing Jin, Dominik Janzing, Kun Zhang

180 of 182

Our research is in line with our many efforts

180

ACL 2021

1st NLP4PosImpact Workshop

1st BoF for “NLP4SocialGood”

@ACL 2021

EMNLP 2022

2st NLP4PosImpact Workshop

2021

2022

2023

2024

NeurIPS 2023

Moral AI Workshop

2nd BoF for “NLP4SocialGood”

@ACL 2022

3rd BoF for “NLP4SocialGood”

@ACL 2023

ICLR 2021

RobustML Workshop

EMNLP 2024

3nd NLP4PosImpact Workshop

NeurIPS 2024

Causality Workshops

69

ACL 2025

4th NLP4PosImpact Workshop

2025

181 of 182

Research Overview

  • CLadder: Assessing Causal Reasoning in Language Models (NeurIPS 2023)
  • Can Large Language Models Infer Causation from Correlation? (ICLR 2024)

Latest work:

  • CauSciBench: Evaluating Causal Reasoning in Scientific Research by LLMs
  • Causal AI Scientist: Facilitating Causal Data Science with LLMs
  • Can LLMs Propose Instrumental Variables for Causal Reasoning?
  • A Formal Verification Framework for LLM-Generated Causal Expressions

181

182 of 182

Mentors, Mentees & Collaborators

182

Mentees

Great thanks to