Multi-Agent LLMs to Assist
Causal Reasoning in Science
1
Zhijing Jin
Assistant Professor at the University of Toronto
Research Scientist at Max Planck Institute
CIFAR AI Chair
Slides are shared at zhijing-jin.com
Complete Agenda
2
Causally Understanding LLMs
Robustness evaluation, Gender bias, User personalization, …
Causal Reasoning with LLMs
Causal relationship learning, Causal effect estimation, …
Line 1
Line 2
Use of LLMs in Many Domains
3
However, …
4
A Reliability Problem
https://www.bbc.com/news/articles/cd11gzejgz4o
5
Sometimes, Error = Wrong Imitation
https://www.threads.net/@crumbler/post/C7VGpYSPOgT
6
Is eating rocks really a cause for health?
7
What Happens When LLMs Receive a Question?
8
The capital of France is Paris.
… …
Question Receival
v1.0: Direct Generation
v2.0: LLM + RAG (Retrieval Augmented Generation)
+
+
v3.0: LLM w/ Structured (Formal) Reasoning
We do not have knowledge of a thing until we have grasped its why, that is to say, its cause.
— Aristotle
9
Causal Graphical Model
10
Z
X
Y
Z
X
Y
X
Y
Chocolate Example
11
Chocolate
Nobel Prize
?
Confounders?
Causality for Machine Learning
12
Bernhard Schölkopf
(Max Planck Institute)
Yoshua Bengio
(Mila)
My Work: Causal NLP
13
Causal NLP
Understanding and improving causal reasoning in natural language processing (NLP) models
Z
X
Y
What role can LLMs serve for causality?
14
How LLMs Find Causal Relationships
15
Simple Question
“Why is ground wet?”
Answer
Rain, sprinkler, etc.
Complex Question
“Does increasing minimum wage affect the employment rate?”
Answer
“Increasing min wage does not reduce employment”
Prior knowledge,
Common sense
Data-driven reasoning
Our Survey of Commonsense Causality
(Covering 200+ Papers)
16
EMNLP 2024 Survey
Our Survey of Commonsense Causality
(Covering 200+ Papers)
17
Limitations: Cannot handle precise causal relations, or causality under question, etc.
As a Knowledge Base, Language Parser & Reasoner
Our CLadder paper (Page 27, Appendix): https://arxiv.org/pdf/2312.04350
18
Two Main Causal Inference Tasks
19
Task 2. Causal Effect Reasoning
Task 1. Causal Discovery
What variable affects what?
By how much?
X
Y
Z
X
Y
Z
What is the strength of the influence of X on Y?
?
Peter Spirtes
Algorithms: PC algorithm, FCI…
Two Main Causal Inference Tasks
20
Task 2. Causal Effect Reasoning
Task 1. Causal Discovery
What variable affects what?
By how much?
X
Y
Z
X
Y
Z
?
Lots of tech companies’ CEOs are college dropouts, does it mean “dropout” causes “job success”?
How much does COVID vaccination increase the survival rate?
Example
Example
𝗖𝗼𝗿𝗿𝟮𝗖𝗮𝘂𝘀𝗲 Paper
(Jin et al, 2024 ICLR)
CLadder Paper
(Jin* et al, 2023 NeurIPS)
What is the strength of the influence of X on Y?
Accurate Causal Graph-Based Reasoning
21
Needs for Reasoning
https://www.nytimes.com/2024/01/08/technology/ai-robots-chatbots-2024.html
22
23
Can Large Language Models Infer Causation from Correlation?
Rada Mihalcea
Mrinmaya Sachan
Zhijing Jin*
Zhiheng Lyu
Mona Diab†
Bernhard Schoelkopf†
Spencer Poff
Jiarui Liu*
ICLR 2024
Correlation vs. Causation
Imagine we have a common sense dataset / news dataset with this sentence:
What should our model infer?
24
As ice cream sales increase,
the rate of shark attacks sharply increases.
Hypothesis 1
Ice creams makes you taste better?
Correlation vs. Causation
Imagine we have a common sense dataset / news dataset with this sentence:
What should our model infer?
25
As ice cream sales increase,
the rate of shark attacks sharply increases.
Hypothesis 1
Ice creams makes you taste better?
Hypothesis 2
It’s about getting warm.
Getting Warm
Eating more ice creams
More swimming, thus more shark attacks
Formulation of Correlation and Causation
“Correlation does not mean causation.”
26
✅ With additional intervention or with causal discovery methods, we can say things like C causes D (i.e., C → D).
More Shark Attacks
More People Swimming in the Sea
Eating More Ice Creams
Getting Warm
A
B
C
D
Correlated:
Causal:
A
B
C
D
✅ Nodes connected in a causal graph may be correlated.
❌ We cannot say B causes D, etc.
Can we go further?
Reference: Causal Discovery
Corr2Cause:
Correlation-to-Causation Inference
27
Input: Correlational statements
Output: Judgment of a causal relation
Assumptions: We assume a close system of the given variables and correlations. The variable relations satisfy the faithfulness assumption.
Dataset
28
207K samples
1K test data
1K dev data
Model Performance
29
GPT-4
F1 Score
Random
(Uniform)
Alpaca-7B
LLaMa-7B
BART MNLI
20
50
30
40
20
33
29
26
27
Opening Directions for Follow-Up Works
Method followups
And many others! (See our github paper list)
30
Opening Directions for Follow-Up Works
Application studies
Application studies
31
And many others!
Why do future LLMs need this?
32
How can LLMs process such information? Correlation? Causation? What causes what?
Two Main Causal Inference Tasks
33
Task 2. Causal Effect Reasoning
Task 1. Causal Discovery
What variable affects what?
By how much?
X
Y
Z
X
Y
Z
?
Lots of tech companies’ CEOs are college dropouts, does it mean “dropout” causes “job success”?
How much does COVID vaccination increase the survival rate?
Example
Example
𝗖𝗼𝗿𝗿𝟮𝗖𝗮𝘂𝘀𝗲 Paper
(Jin et al, 2024 ICLR)
CLadder Paper
(Jin* et al, 2023 NeurIPS)
What is the strength of the influence of X on Y?
34
CLadder: Assessing Causal Reasoning in Language Models
Bernhard Schölkopf
Mrinmaya Sachan
Luigi Gresele*
Yuen Chen*
Felix Leeb*
Zhijing Jin*
Kevin Blin
Ojasv Kamal
Zhiheng Lyu
Fernando Gonzalez
Max Kleiman-Weiner
Causal Reasoning: An Example
https://www.covid-datascience.com/post/israeli-data-how-can-efficacy-vs-severe-disease-be-strong-when-60-of-hospitalized-are-vaccinated
35
?
Statistics of COVID-19 patients (Israel, August 15, 2021):
Among the severe cases, more than half were vaccinated. P(v | s)
Among the vaccinated people, … are severe cases. P(s | v)
Does it mean the overall causal effect of vaccination is negative for health?
Vaccinated people are more likely to develop severe cases of COVID.
Causal Reasoning: An Example
https://www.covid-datascience.com/post/israeli-data-how-can-efficacy-vs-severe-disease-be-strong-when-60-of-hospitalized-are-vaccinated
36
V
Vaccination
Vulnerability
?
Severity
Causal Reasoning: An Example
37
V
Vaccination
Vulnerability
?
Severity
CLadder Dataset: 10,000 Causal Questions
Prototypes from
38
…
Testing Causal Reasoning in LLMs
Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models
39
Imagine a self-contained, hypothetical world with only the following conditions, and without any unmentioned factors or causal relationships:
Physical vulnerability has a direct effect on the likelihood of fatality, and vaccination decision. Vaccination has a direct effect on the chances of fatality.
In the entire population, 50% of the people are vulnerable.
For healthy and vaccinated people, the fatality rate is 1%. For healthy but unvaccinated people, the fatality rate is …
Overall, the fatality rate for vaccinated people is …
Does vaccination lead to an increase in the survival rate?
Causal Query
[Average treatment effect (ATE)]
Causal Graph
Available Data
P(...) = … P(... | …) = …
Z
X
Y
Yes?
No?
Question
Answer
…
10K questions
with prototypes from
Designing variations into the dataset
40
Query Type:
Causal Graph
1
2
3
4
Testing Causal Reasoning in LLMs
Hünermund, P., & Bareinboim, E. (2023). Causal inference and data fusion in econometrics. The Econometrics Journal
41
Causal Query
[Average treatment effect (ATE)]
Causal Graph
Available Data
P(...) = … P(... | …) = …
Question
Answer
Z
X
Y
Causal Inference Engine
Yes?
No?
Correct Solution
Human Causal Scientists
Testing Causal Reasoning in LLMs
Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models
42
Causal Query
[Average treatment effect (ATE)]
Causal Graph
Available Data
P(...) = … P(... | …) = …
Question
Answer
Z
X
Y
Causal Inference Engine
Yes?
No?
Model
LLMs
Model Performance
Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models
43
Accuracy
Random
GPT-3
LLaMa Alpaca
50
80
60
70
50
50
…
davinci
Model Performance
Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models
44
Accuracy
Random
GPT-3
InstructGPT-3
50
80
60
70
51
50
50
…
53
56
d.-001 d.-002 d.-003
davinci
+ instruction-tuning
LLaMa Alpaca
Model Performance
Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models
45
GPT-3.5
Accuracy
Random
GPT-3
InstructGPT-3
GPT-4
50
80
60
70
51
50
50
…
53
62
56
52
d.-001 d.-002 d.-003
davinci
LLaMa Alpaca
Causal CoT: Causal Chain-of-Thought Prompting
Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models
46
[...] Does vaccination lead to an increase in the survival rate?
Causal Graph Parsing
Arithmetics
Final Answer
Yes
Query Formalization
ATE (X→Y)
= E[Y | do(X=1)] - E[Y|do(X = 0)]
Z
X
Y
Statistics Collection
P(Z=1)=0.5
P(Y=1|X=1)=...
Formal Causal Inference Steps
E.g., Do-calculus,
Counterfactual prediction
Model Performance
Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models
47
GPT-3.5
Accuracy
Random
GPT-3
InstructGPT-3
GPT-4
50
80
60
70
51
50
50
…
53
62
56
52
70
d.-001 d.-002 d.-003
gpt-4-1106-preview
+ Causal CoT
davinci
8%
LLaMa Alpaca
Causal AI Scientist: Facilitating Causal Data Science with LLMs
Mrinmaya Sachan
Zhijing
Jin
Bernhard Schölkopf
Samuel�Simko*
Sawal�Acharya*
Preprint 2025
Sawal Acharya*, Vishal Verma*, Devansh Bhardwaj*, Samuel Simko*, Jacob T. Emmerson*, Rahul Babu Shrestha*, Punya Syon Pandey, Anahita Haghighat, Yongjin Yang, Dominik Janzing, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin
Research Agenda Overview
49
50
CauSciBench: Evaluating Causal Reasoning in Scientific Research by LLMs
Mrinmaya Sachan
Zhijing Jin
Bernhard Schölkopf
Terry Jingchen Zhang*
Preprint 2025
Together with: Anahita Haghighat, Sun Xianlin, Pepijn Cobben, Rahul Babu Shrestha, Maximilian Mordig, Jacob T. Emmerson, Furkan Danisman, Yuen Chen, Clijo Jose, Andrei Ioan Muresanu, Justin Cui, Jiarui Liu, Yahang Qi, Punya Syon Pandey, Yinya Huang
Andrew Kim
Sawal Acharya*
Can LLMs truly reason causally from data?
51
Models + Methods Tested
52
Program of Thoughts
Direct Prompting
ReAct (Reason + Act)
Chain of Thoughts
28 Model-Method Pairs
x4
Three Sources of Data
53
Real-World Papers
Synthetic Scenarios
Textbook Examples
143 Queries
Programmatically generated
100 Datasets
(185 Queries)
5 Datasets +
3 Textbooks
(39 Queries)
Topic distribution in our 100+ annotated papers
54
Electoral Behavior (17.1%): “Do legislators respond differently to emails based on race?”
Healthcare (15.3%): “What is the effect of privatizing jail healthcare on inmate mortality rates?”
Political Economics (11.7%): “Does higher corruption lead to higher income inequality?”
Labor Economics (10.8%): “Did an increase in the minimum wage decrease employment?”
Education (9.9%): “What is the effect of highest education degree on yearly earnings?”
Political Behavior (9.9%): “What is the effect of imposing sanctions on mass mobilization?”
Criminology (9.0%): “What is the effect of castle doctrine laws on burglary rates?”
Public Opinion (9.0%): “Does dismissing tainted ministers affect presidential approval ratings?”
Applied Economics (7.2%): “How does an increase in product cost affect customer demand?”
Real-World Paper Quality Control
55
Reproducibility
Expert Validation
Methodological Rigor
Two rounds of expert review
Ensures usage of correct method & presence of all method-specific variables
Results are manually replicated in Python
Only the papers that can be replicated within 10% error are included
Covers 9 domains and 7 causal methods
Input to LLMs should not contain (implicit) hints to prevent contamination
Supported Methods: Ordinary Least Squares, Difference in Differences, Regression Discontinuity Design, Instrumental Variables, Propensity Score, General Linear Models, Front-door Adjustment
Data Curation Process
56
Information Extraction
Extract information to be used as an input to the LLM and ground truth from each paper
Accept/Reject Paper
Expert Verification
Check that curated query & dataset description doesn’t provide methodological hints
Verification #2
Check that the replication code uses the correct method & variables, and it matches the correct paper value
Manual Replication
Replicate the study in Python to ensure replicability
Sample Information Extraction
57
Notations: �Running variables (for RDD)
Temporal variables (for DiD)
State variables (for DiD)
Is RCT (for validity of OLS)
Interaction variables (for heterogeneous treatment effects)
Multi RCT treatment (for MultiRCT)
Results are reproduced in Python by experts
Results
58
48.96%
Lowest mean relative error for real data across all model-method pair
77.17%
Highest method accuracy for real data across all model-method pair
93.48%
Highest method accuracy for synthetic data across all model-method pair
7.91%
Lowest mean relative error for synthetic data across all model-method pair
Findings
59
High reliance on OLS,
especially for weaker models
Causal AI Scientist: Facilitating Causal Data Science with LLMs
Mrinmaya Sachan
Zhijing
Jin
Bernhard Schölkopf
Samuel�Simko*
Sawal�Acharya*
Preprint 2025
Vishal Verma*, Sawal Acharya*, Samuel Simko*, Devansh Bharadwa*, Anahita Haghighat, Mrinmaya Sachan, Dominik Janzing, Bernhard Schölkopf, Zhijing Jin
Causal CoT: Causal Chain-of-Thought Prompting
Jin* et al., NeurIPS 2023: CLadder: Assessing Causal Reasoning in Language Models
61
[...] Does vaccination lead to an increase in the survival rate?
Causal Graph Parsing
Arithmetics
Final Answer
Yes
Query Formalization
ATE (X→Y)
= E[Y | do(X=1)] - E[Y|do(X = 0)]
Z
X
Y
Statistics Collection
P(Z=1)=0.5
P(Y=1|X=1)=...
Formal Causal Inference Steps
E.g., Do-calculus,
Counterfactual prediction
How do human statisticians solve it?
62
63
… and more
Decision Tree Approach for Method Selection
64
…
Instrumental Variables, etc.
Decision Tree Approach for Method Selection
65
Linear Regression w/ pre-treatment variables
Randomized Control Trial?
✓
Difference in Means
Difference in Differences
Valid pre-treatment variables?
✓
…
Instrumental
Variable
Valid Instruments?
…
✓
✓
X
X
X
X
Temporal information on
treatment timings?
Decision Tree Approach for Method Selection
66
Linear Regression w/ pre-treatment variables
Randomized Control Trial?
✓
Difference in Means
Difference in Differences
Valid pre-treatment variables?
✓
…
Instrumental
Variable
Valid Instruments?
…
✓
✓
X
X
X
X
Temporal information on
treatment timings?
Covered
Solving the Causal Questions with CAIS
67
Efficiently Prompting the LLM Models
https://vdsbook.com/
Problem formulation
�
Data cleaning and preprocessing�
Exploratory data analysis
Modeling + coding step
Post-hoc analysis
Interpretation of results
68
4
How do differences in the formulation affect the final result?
3
What is the domain question?
2
1
5
6
Make the model answer a checklist of questions at each step
How was the data collected?
What are the data columns and types?
Are the any interesting relationships?
Are there any variables which are strongly correlated?
Which method is appropriate to use?
What refutation methods are available for this method?
If necessary, specify null hypotheses
Are the relationships found stable?
What conclusions can be drawn?
What recommendations can be drawn?
Code generation
Code execution
Code output check
Example 1: Multi-treatment RCT based on Gerber et al.
Query: Did the Hawthorn scheme lead to an increase in voter-turnout in 2006 elections?
Filename: voter_turnout.csv
Description: The authors conducted a randomized experiment in Michigan to evaluate the effects of four different types of social pressure on voting behavior.
Treatment Types: Civic Duty, Hawthorne, Self, Neighbors or the control group.
Data Variables: voted (voted or did not vote in the 2006 elections); treatment; sex (male/female), g2002, g2004, g2006 (voted or did not vote in gubernatorial elections), hh_size(household size), hh_id (household id)
Results
69
Example 2: DiD based on Card and Krueger (1992)
Query: Did an increase in minimum wage decrease employment?
Filename: min_wage_data.csv
Description: The dataset was collected to examine the effects of a minimum wage increase on fast food employment. It includes two survey waves from 410 fast food restaurants such as Burger King, KFC, Wendy’s, and Roy Rogers in New Jersey and Pennsylvania. The survey gathered information on employment, wages, prices, store characteristics, and location.
Results
70
Example 3 (Difference in Differences based on Kessler et al.)
Query: Did changing organ donation registration from opt-in to active choice increase donation rates?
Filename: organ_donations.csv
Description: California enacted a change on July 2011, where the method for asking the organ donation question at the Department of Motor Vehicles (DMV) changed from an opt-in frame (make a positive statement or skip) to an active choice frame (required to check either "yes" or "no").
Data Variables: State; Post (indicating before/after policy enactment); Rate: Organ donation rate
Results
71
72
Terry Jingchen Zhang*
Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination
Mrinmaya Sachan
Zhijing Jin
Bernhard Schölkopf
Gopal Dev*
Yinya Huang
Together with: Ning Wang, Nicole Ni, Wenyuan Jiang
Do LLMs Reason or Recall?
73
Memorization
Dataset contamination
Heavily inflated performance
Paraphrase brittleness
Fragile zero-shot behavior
.
.
.
Reasoning
Novel problem solving
Accurate metrics
Translate across contexts
Error localization and fixability
.
.
.
Testing Method
74
Evaluate Model Response
Compare the mean accuracy around 6 month span pre and post knowledge cutoff.
3
Run Models
Query the questions and log the accuracy for pre-cutoff and post-cutoff questions.
2
Synthesize QA
Synthesize QA from arXiv papers released pre and post knowledge cutoff of the models. (May 2023–June 2025)
1
Results
75
No notable performance decay
Consistent across models & domains
76
Can LLMs Propose Instrumental Variables for Causal Reasoning?
Dominik Janzing
Mario Fritz
Bryan Wilder
Ivaxi Sheth
Zhijing Jin
Preprint 2025
Instrumental Variables: A Motivating Example
Paper link: Does Compulsory School Attendance Affect Schooling and Earnings? (Angrist & Krueger, 1991)
Paper: Does Compulsory School Attendance Affect Schooling and Earnings? (Angrist & Krueger, 1991).
Query: Does longer education lead to higher wage?
Variables
Why Quarter of Birth (QoB)?
77
Causal Graph Conditions of the IV
78
Education (Treatment)
Wage (Outcome)
?
Causal Graph Conditions of the IV
79
Cigarette Tax (Instrument)
Smoking During Pregnancy (Treatment)
Low Birthweight (Outcome)
?
Exclusion Restriction
Relevance
Parental Education (Covariate)
Independence
Core IV assumptions:
How IV Estimate is Calculated
80
First Stage
Predict the treatment using instruments and controls to extract instrument-driven variation in the treatment
2-Stage Least Squares | 𝛽2SLS = (X’PZX)-1X’PZY, where PZ = Z(Z’Z)-1Z’
Second Stage
Regress the outcome on the predicted treatment and controls so the estimate isolates the change to treatment caused by the instrument
*OLS after replacing original regressors with their projections on the instruments
How IV Estimate is Calculated
81
Figure Inspirations
82
Figure Inspirations
83
Motivation
84
Opportunities
LLMs have vast knowledge repositories, sometimes even more than experts
LLMs can generate creative hypotheses and connections
Allows for rapid IV exploration and validation through LLM + Statistical grounding
Challenges
IV identification requires deep domain expertise and creativity
IV has strong assumptions it must satisfy (Relevance, exclusion, independence)
CAIS sends a query to one LLM to find the instrument, leading to unignorable rate of error
Proposing good IVs illustrates 3 skills of Causal LLMs
85
Can LLMs recover canonical IVs?
Testing ability to identify well-established instruments in existing economic literature
Can LLMs avoid proposing invalid IVs?
Evaluating safeguards against invalid instruments
Can LLMs discover novel & valid IVs in open-ended scenarios?
Testing creative approach to instrument identification on real-world datasets
LLM as a knowledge base:
LLM as a statistics agent:
LLM as a creative scientist:
IV Co-Scientist Pipeline
86
PreSelector
Correlation filtering of variable pairs
Suggested Instruments
HumanProxy
Semantic relevance assessment
CausalOracle
Directional inference using LLM + Granger causality
CriticAgents
Independent evaluation of exclusion & independence
HypothesisGenerator
Proposes candidate IVs and confounders
Grounder
Maps proposed IVs to dataset variables
Example on Angrist & Krueger (1991)
87
PreSelector
Finds correlation (QoB-Education, Education-Wage)
Output:
Variable name of QoB
HumanProxy
Validates the correlation using prior knowledge
CausalOracle
Tests causal relations
(Education -> Wage)
CriticAgents
Tests whether QoB only affects wage through education
HypothesisGenerator
Proposes candidate IVs and confounders
(Quarter of Birth)
Grounder
Maps QoB, Education, and Wage to variables in the dataset
88
Example on Angrist & Krueger (1991)
89
PreSelector
Finds correlation (QoB-Education, Education-Wage)
HumanProxy
Ensure variables are distinct (Not "male literacy" vs. "female literacy")
CausalOracel
Retrieve causal direction
(Education -> Wage)
HypothesisGenerator
Proposes candidate IVs and confounders
(Quarter of Birth)
CriticAgents
Tests whether QoB only affects wage through education
Grounder
Maps QoB, Education, and Wage to variables in the dataset
Output:
Variable name of QoB
Findings: Canonical IV
90
Exact Match: Exact variable name
(e.g. rainfall)
Concept Match: Same concept, different wording
(e.g. drought index)
Findings: Open-Ended Discovery
91
How do we measure the statistical power of the found instruments?
Is the instrument actually connected to our treatment?
Answered by Relevance (Distance from ground truth)
Do multiple IVs give consistent results?
Answered by Cnorm (Consistency)
Findings: Open-Ended Discovery
92
Relevance
Minimum relevance score (F-stat) of 10.76
Relevance > 10 indicates selected instruments are strong
Cnorm
Minimum Cnorm (Consistency) of 0.505
Cnorm > 0.5 indicates good consistency
Key Takeaways
93
4
3
2
1
94
Results
Results
97
A Formal Verification Framework for LLM-Generated Causal Expressions
Mrinmaya Sachan
Zhijing Jin
Bernhard Schölkopf
Paul He
Yinya Huang
Preprint 2025
Verifying LLMs’ Step-by-Step Causal Reasoning
98
[...] Does vaccination lead to an increase in the survival rate?
Causal Graph Parsing
Arithmetics
Final Answer
Yes
Query Formalization
ATE (X→Y)
= E[Y | do(X=1)] - E[Y|do(X = 0)]
Z
X
Y
Statistics Collection
P(Z=1)=0.5
P(Y=1|X=1)=...
Formal Causal Inference Steps
E.g., Do-calculus,
Counterfactual prediction
Sample Input
## Instructions:
1. For each problem, identify the correct expression that represents the query
2. Draw the graphical representation as a text description of edges
3. Show your mathematical reasoning step by step
4. Provide a final yes/no answer
5. Keep your response concise and focused on the solution
Prompt:
[**Problem description omitted for brevity**]
Let V2 = water company; V1 = poverty; X = liking spicy food; Y = cholera
Expression: P(Y | X)
Graphical Representation: V1->X,V2->X,V1->Y,X->Y
Reasoning: P(X = 1, Y = 1)/P(X = 1) - P(X = 0, Y = 1)/P(X = 0)
P(X=1) = 0.81
P(Y=1, X=0) = 0.13
P(Y=1, X=1) = 0.17
0.17/0.81 - 0.13/0.19 = -0.44
-0.44 < 0
Final Answer: No
[**Includes a few more examples**]
## Your Task:
Solve the following problem using the format above. Begin your response with "Solution:" and provide only the expression, graphical representation, reasoning, and final answer.
Prompt: {description}
99
Sample Problem
Prompt: Imagine a self-contained, hypothetical world with only the following conditions, and without any unmentioned factors or causal relationships: Poverty has a direct effect on liking spicy food and cholera. Water company has a direct effect on liking spicy food. Liking spicy food has a direct effect on cholera. Poverty is unobserved. For people served by a local water company, the probability of cholera contraction is 64%. For people served by a global water company, the probability of cholera contraction is 66%. For people served by a local water company, the probability of liking spicy food is 50%. For people served by a global water company, the probability of liking spicy food is 45%. Will liking spicy food decrease the chance of cholera contraction? Let V2 = water company; V1 = poverty; X = liking spicy food; Y = cholera.
100
Sample Response
Expression: E[Y | do(X = 1)] - E[Y | do(X = 0)]
Graphical Representation: V1->X,V2->X,V1->Y,X->Y
Reasoning: E[Y | do(X = 1)] - E[Y | do(X = 0)]
[P(Y=1|V2=1)-P(Y=1|V2=0)]/[P(X=1|V2=1)-P(X=1|V2=0)]
P(Y=1 | V2=0) = 0.64
P(Y=1 | V2=1) = 0.66
P(X=1 | V2=0) = 0.50
P(X=1 | V2=1) = 0.45
(0.66 - 0.64) / (0.45 - 0.50) = -0.39
-0.39 < 0
Final Answer: Yes
101
DoVerifier workflow (after obtaining the output)
102
Causal Graph Parsing
Z
X
Y
Parse Expression
P(Y|do(X), Z)
Perform BFS to match
E1
E2
E3
E4
E5
Final Answer
Yes
The ID Algorithm (Shipster 2008)
Shpitser, I., & Pearl, J. (2008). Complete identification methods for the causal hierarchy. Journal of Machine Learning Research, 9(64), 1941–1979.
103
DoSearch: Tikka, S., Hyttinen, A., & Karvanen, J. (2021)
https://github.com/santikka/dosearch?tab=readme-ov-file
104
DoSearch (R) vs DoVerifier (Python)
105
DoSearch | DoVerifier |
Determines whether a causal query is identifiable and derives an estimand if so | Verifies derivability/semantic equivalence of expressions under a DAG |
Search-based identification (ID / do-calculus–style procedures over graphs | BFS proof search over do-calculus + probability theory |
Provably sound, and it is complete with respect to identifiability problems | Provably sound, and it is complete with respect to identifiability problems + unidentifiable problems |
Outputs estimand (or failure / not identifiable when unidentifiable) | Outputs Proof paths / rewrite sequences (or None if not derivable within limits) |
DoSearch works on Unidentifiable but Equivalent Expressions
106
Z
X
Y
W
P(Y | do(X), do(W), Z) = P(Y|do(X), Z)
U
U is unobserved!
Visualizing the entire search
107
Z
X
Y
W
U
U is unobserved!
P(Y|do(X), do(W), Z)
P(Y|do(X), W, Z)
P(Y|do(X), Z)
P(Y|do(X), Z)
Rule 3
Rule 2
Rule 1
Technically, the search stops here, but we visualize the entire tree for completeness
DoSearch works on Unidentifiable but Equivalent Expressions
108
U
X
Y
Z
P(Y | do(X)) = P(Y|do(X), Z)
Because according to Rule 1 of do-calculus after we remove edges into X and so we can remove Z in the conditioning.
U is unobserved!
Goal
Given a causal expression generated by some LLM, check if it is correct against a ground truth.
109
P(Y | do(X), Z)
LLM Generated Expression
P(Y | Z, do(X))
Ground Truth
A very simple idea…
110
Can we apply the rules of do-calculus along standard probability axioms to transform one expression to another?
E1
E2
E3
E4
E5
BFS over possible rule applications!
111
Takeaways
112
Big Picture of
Future Causal LLMs
113
Ideal Future for LLMs
Current Skill Sets
Further Skills
114
Direct Generation
Websites
Simple Math & Data Science
Causality
Z
X
Y
115
Quriosity: Analyzing Human Questioning Behavior and the Quest for Causality
Rada
Mihalcea
Mrinmaya Sachan
Zhijing
Jin
Bernhard Schölkopf
Roberto Ceraolo *
Dmitrii Kharlapenko *
Amélie Reymond
Ahmad Khan*
Causal Questions
116
42% of human questions are causal in nature!
Why do apples fall?
Was I rejected because I lack experience?
How to make the world a better place?
How are today’s actions influencing what will happen tomorrow?
Should I get a PhD?
Taxonomy of Causal Questions
117
Causal Reasoning
Causal Graph Learning
1. What variables exist?
3. By how much?
?
?
Human civilization advances 🚀.
Z
Y
?
?
2. What affects what?
X
Z
Y
X
Z
Y
How much does X+Δ affect Z?
4. Full understanding
Z = f(X, Y, σ)
X
Taxonomy of Causal Questions
118
1. What variables exist?
3. By how much?
?
?
Human civilization advances 🚀.
Z
Y
?
?
2. What affects what?
X
Z
Y
X
Z
Y
How much does X+Δ affect Z?
4. Full understanding
Z = f(X, Y, σ)
X
Taxonomy of Causal Questions
119
What are the key elements needed to start a fire?
4
What is the change in my income after a PhD?
3
3
Does smoking interfere with the drug effect?
2
What pixels of an image represent the lighting, angle, and position?
1
2
1
How much oxygen (O2) is needed to oxidize 1 kg of mercury (Hg)?
4
Formalization
Given a set of unstructured data, and a (partial) causal structure, suggest variables that play certain roles.
Given a set of variables, suggest their relations
Given a causal structure and some data, what is the effect of an intervention if the exogenous variables are controlled for (or randomized)?
What is the functional form of the causal parents to an effect?
120
3
2
1
4
Capturing Human Natural Curiosity
121
Emergence of Curiosity
42%
NatQuest data:
13.5K Questions
of the natural
questions are causal
Query Channels
Search Engine Queries
Community Forums
LLM Interactions
Linguistic Properties of Causal Questions
“How” and “Why” questions are often causal
122
Levels of Cognitive Skills Needed for Causal Questions
123
Create
Produce new or original work. Design, assemble, construct, conjecture, develop, formulate, author, investigate
01
Evaluate
Justify a stand or decision. Appraise, argue, defend, judge, select, support, value, critique, weigh
02
Analyze
Draw connections among ideas. Differentiate, organize, relate, compare, contrast, distinguish, examine, experiment, question, test
03
Understand
Explain ideas or concepts. Classify, describe, discuss, explain, identify, locate, recognize, report, select, translate
05
Apply
Use information in new situation. Execute, implement, solve, use, demonstrate, interpret, operate, schedule, sketch
04
Remember
Recall facts and basic concepts. Define, duplicate, list, memorize, repeat, state
06
Bloom’s
Taxonomy
Levels of Cognitive Skills Needed for Causal Questions
From Bloom’s taxonomy of cognitive complexities
124
Current Performance of GPT-4o
The three usability criteria are from user satisfaction surveys (ISO, 2018)
Observations:
125
Effectiveness
Efficiency
Satisfaction
GPT-4o score
(by human raters)
3.83 (out of 5)
2.88 (out of 5)
3.85 (out of 5)
Towards Solving the Causal Questions
126
… …
Question Receival
Formal Causal Inference and Discovery
LLM w/ Web Search
*Based on human annotation results on a random sample of 600 questions, conducted for this slide.
“Why do apples fall?”
“How much does vaccination help?”
53.6%
questions can potentially be addressed by Formal CI
Takeaways
127
Problems: Lack of Clear Understanding of MAD
128
Several papers study multi-agent effectiveness, but lack a standard for distinguishing key features from self-play or evaluating their impact under equal conditions.
Experimental Setting
129
Results
130
Mathematical Reasoning : Homogeneous Agents
131
Mathematical Reasoning : Homogeneous Agents
132
In sequential scaling, MAD actively corrects errors, unlike self-refinement, making it effective at high task difficulty when only a few initial agents are correct.
Mathematical Reasoning : Qualitative Example
133
Initially, there is only one correct answer.�With Self-Consistency, one can not catch this correct answer.
Mathematical Reasoning : Qualitative Example
134
Collaborative refinement in MAD lets one correct solution guide the process toward the right answer.
Mathematical Reasoning : Homogeneous Agents
135
Self-consistency excels at math reasoning, but MAD gains a relative edge on harder problems via collaborative refinement.
Takeaway 1
Mathematical Reasoning : Homogeneous Agents
136
MAD is relatively more beneficial on challenging tasks, especially when using smaller models.
Mathematical Reasoning : Homogeneous Agents
137
Smaller models benefit more from MAD than parallel scaling on hard math tasks.
Takeaway 2
Mathematical Reasoning : Heterogeneous Agents
138
Diverse exploration with heterogeneous models generally shows no clear advantage over the best single agent in math.
Mathematical Reasoning : Heterogeneous Agents
139
Heterogeneous MAD performs near the agents’ harmonic mean, showing limited gains from diverse exploration in math reasoning
Takeaway 3
Safety Reasoning : Homogeneous Agents
140
In safety reasoning, bias propagates during refinement and generally leads to an increase in ASR.
Safety Reasoning : Homogeneous Agents
141
Sequential scaling increases ASR in safety tasks, while MAD helps reduce this risk, though it does not fully prevent it.
Takeaway 4
Safety Reasoning : Homogeneous Agents
142
Larger models are less vulnerable with MAD, as few initial agents respond to harmful prompts.
Safety Reasoning : Homogeneous Agents
143
Larger models are less vulnerable in MAD due to stronger defenses and safer agent scaling, but some risk persists.
Takeaway 5
Safety Reasoning : Heterogeneous Agents
144
Unlike math, heterogeneous MAD helps safety tasks by occasionally reducing ASR through diverse exploration.
Safety Reasoning : Heterogeneous Agents
145
Unlike in mathematical reasoning, weaker agents can introduce views that shift stronger ones.
Safety Reasoning : Heterogeneous Agents
146
In safety tasks, diverse exploration in heterogeneous MAD improves performance, unlike in math reasoning.
Takeaway 6
Takeaways
147
Summary
148
Math
Safety
It is helpful for highly challenging problems.
Usually leads to an increase in harmfulness due to harm propagation during refinement
Not Helpful as as stronger models often subsume weaker ones.
Can be helpful, as less safe models may aid stronger ones by offering alternative safe responses.
Collaborative Refinemenet
Diverse Exploration
Contributions
149
We identify key features of the vanilla MAD framework: collaborative refinement and diverse exploration.
We investigate how these features provide benefits compared to test-time scaling with a self-agent
We explore Self vs. MAD across task types, difficulty levels, model scale, and agent configurations.
Extension
150
Can Theoretical Physics Research Benefit from Language Agents?
151
Bernhard Schölkopf
Ignacio Cirac
Pavel Kos
Terry Jingchen Zhang
Zhijing Jin
Sirui Lu
NeurIPS 2025 AI4Science Workshop
Main Position
152
LLM agents could help theoretical physics
Opportunities
Automate repetitive calculations
Explore risky directions
Connect formulas with code
…
Challenges
Physics Research Taste
Reliability & Verification
Hallucinations
…
Requires: domain know hows + tools + more synergized research
Disclaimer: the expertise of the authors are restricted to theoretical atomic optics physics/ quantum information theory and many-body physics
INPUT
Idea / Note
LLM AGENT
Derivation
FORMAL CHECK
Mathematica/Lean
INFORMAL CHECK
Self-Critique
OUTCOME
New results??
The Problem of Plausible Hallucinations
Example: Without checks, LLM derivations can introduces subtle errors can propagate.
"The errors are often really subtle... No human would have actually made that mistake."�— Terence Tao
EXAMPLE: THE COMMUTATOR TRAP
AI Derivation: "Since A and B are square matrices, we expand the exponential..."
× Error: This identity holds only if [A,B] = 0. An LLM can neglects this condition, producing a result that looks correct but is physically wrong.
Agree with the critique, since the threshold was not derived correctly, this does not follow.
Self-Critique with Human in the Loop
Decompose the research workflow into an iterative loop. The Agent proposes steps and critiques itself, but the Human remains the judge.
1. DERIVE
Generator Agent� Proposes Ansatz / Step
2. CRITIQUE
Reflection Agent� Scans for Logic Gaps
3. VERIFY
Human Theorist� Judgment & Insight
Building the Fences: Tools and Formal Prover
>> Verified: True
Challenge: Developing Some Tastes
157
Goal: train agents to naturally exploit symmetry principles
Disclaimer: this is a drastically simplified example that current LLMs can already do very well.
Challenge: From Model to Code
Challenge: Even Theoretical Physics are Multimodal
GPT-4o generated tensor network diagrams that are subtly wrong in a non-human way
Current Vision-Language models struggle with connecting diagrams to formulas
Feynman Diagram
Takeaways
160
4
3
2
1
Collective Intelligence: A Survey on Multi-Agent Systems for AI-Driven Scientific Discovery
161
Bernhard Schölkopf
Mrinmaya Sachan
Yongjin Yang
Yinya Huang
Terry Jingchen Zhang
Zhijing
Jin
Sirui Lu
Preprint 2025
Challenges of Multi-Agent Systems
162
Coordination & Communication Failures
Specialized corpora causes misunderstandings between agents and skewed outcome
Integration & Scalability Limitations
Produces superficial knowledge aggregation instead of theoretical hypothesis
Security Risks
Vulnerable to manipulation by malicious agents and attack vectors
Interoperability and Transparency Deficits
Decision making process is intransparent, undermining accountability standards
Future Directions
163
Organizational Integration
Robust Cross-Validation
Standardized Execution Frameworks
Security Protocol and Infrastructure
Adaptive Resource Allocation
Enhanced Error Detection & Benchmarking
Survey: LLM Agents for AI for Science
164
Literature Review
Coordination mechanism
Security protocol & Fast checking
Hypothesis Generation
Organizational integration
Robust cross-validation
Experimental Planning
Predictive resource coordination
Distributed security architecture
Experimental Execution
Standardized execution
Robust security infrastructure
Peer Review
Enhanced error detection
Comprehensive benchmarking
Future Work: Technical Roadmap
165
Scale Up the Studies
Diversify the Domains
Increase the number of all three sources
Increase the domain coverage of Real-World data
Enrich the Methods
Implement Design of Experiments, Mendelian Randomization, and more domain specific methods
Future Work: Impact Roadmap
166
Career Data
What is the effect of having a PhD on wage?
Chemical Experiment Data
How does temperature affect the reaction yield?
Biomedical Data
How does air quality affect insomnia?
… and More!
Any domain that could benefit from causality
Pharmaceutical Data
What substance enhances the drug effect?
Future Work: Impact Roadmap
167
1
2
3
4
Social Sciences�Economics, social science, political science, …
Parsing causal relations from observational data & small RCTs
Experimental Natural Sciences�Chemistry, biology, material science
To guide large-scale RCT designs
Philosophy & Law
Causal argumentation and support
Theoretical Natural Sciences�Physics, Earth & Climate Sciences
Mechanistic law discovery & derivations
Research Overview
Latest work:
168
Mentors, Mentees & Collaborators
169
Mentees
Great thanks to
170
Cooperate or Collapse:
Emergence of Sustainable Cooperation
in a Society of LLM Agents
171
Bernhard Schölkopf
Mrinmaya Sachan
Zhijing
Jin*
Max Kleiman-Weiner*
Giorgio
Piatti*
Rada
Mihalcea
NeurIPS 2024
A Fishing Village
172
… …
Name: John
…
Name: Kate
…
Name: Jack
…
… …
Images generated by DALL-E
Generative Agents: Era of LLMs
173
Park, Joon Sung, et al. “Generative agents: Interactive simulacra of human behavior.” (2023)
Generative Agents: Era of LLMs
174
Park, Joon Sung, et al. “Generative agents: Interactive simulacra of human behavior.” (2023)
Simulation: A Calendar Year
175
Apr
Mar
Feb
Jan
🎣
💬
🏠
🎣
💬
🏠
🎣
💬
🏠
🎣
💬
🏠
Discuss
Act
Reflect
Discuss
Act
Reflect
Discuss
Act
Reflect
Discuss
Act
Reflect
The remaining fish double up to 100 tons.
Simulation: A Calendar Year
176
Apr
Mar
Feb
Jan
🎣
💬
🏠
🎣
💬
🏠
🎣
💬
🏠
🎣
💬
🏠
Discuss
Act
Reflect
Discuss
Act
Reflect
Discuss
Act
Reflect
Discuss
Act
Reflect
Act-Discuss-Reflect (ADR) Agent Architecture
177
Apr
Mar
Feb
Jan
Discuss
Act
Reflect
Discuss
Act
Reflect
Discuss
Act
Reflect
Discuss
Act
Reflect
There are 100 tons of fish [...]
There are 38 tons of fish [...]
Villagers agree to a low harvest rate to restore fish population.
Jack caught more fish than all the other villagers.
To preserve the fish in the lake, I should fish less than 3 tons next time.
I caught 2 tons of fish.
I caught 40 tons of fish.
Results
178
The best model survives <54% of the time ⚠️
*Each metric is run over 5 random seeds per model.
We are growing: Causality Workshops
179
Guangyi Chen · Haoxuan Li · Sara Magliacane · Zhijing Jin · Biwei Huang · Francesco Locatello · Peter Spirtes · Kun Zhang
Dagstuhl seminar (Apr 7–9, 2026): Causality and LLMs
Amit Sharma, Zhijing Jin, Dominik Janzing, Kun Zhang
Our research is in line with our many efforts
180
ACL 2021
1st NLP4PosImpact Workshop
1st BoF for “NLP4SocialGood”
@ACL 2021
EMNLP 2022
2st NLP4PosImpact Workshop
2021
2022
2023
2024
NeurIPS 2023
Moral AI Workshop
2nd BoF for “NLP4SocialGood”
@ACL 2022
3rd BoF for “NLP4SocialGood”
@ACL 2023
ICLR 2021
RobustML Workshop
EMNLP 2024
3nd NLP4PosImpact Workshop
NeurIPS 2024
Causality Workshops
69
ACL 2025
4th NLP4PosImpact Workshop
2025
Research Overview
Latest work:
181
Mentors, Mentees & Collaborators
182
Mentees
Great thanks to