1 of 30

Synergize Post-Training with the Science of Evaluation:

Making AI Agents Smarter and Safer

Terry Jingchen Zhang

Slides at terry-zhang.notion.site

2 of 30

Making AI agents smarter and safer at the same time

Guide Fine-tuning Towards Generalizable Reasoning

Improve (Multimodal) Reasoning

By probing transferability/generalizability of reasoning performance

Smarter

Uncover Agentic Misaligned Behavior

By analyzing when AI agents commit deception and harmful actions

Safer

Tackle Benchmark Saturation and Contamination

Internalize Consequence-Aware Reasoning for Safety

Look for Key Triggers of Emergent Scheming

3 of 30

Benchmark Saturation: From 20% to 100%

4 of 30

PioneerPhysics: By Researchers, For Researchers

https://pioneerphysics.netlify.app

5 of 30

Benchmark Contamination: Memorize vs. Generalize

6 of 30

Disentangle Key Capability

7 of 30

Various Use Case

8 of 30

Interdisciplinary Synergy: Physics-AI Reciprocity

9 of 30

Towards Trustworthy AI in Scientific Workflow

10 of 30

Making AI agents smarter and safer at the same time

11 of 30

12 of 30

Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy

Against Benchmark Contamination

Bernhard Schölkopf

Mrinmaya Sachan

Gopal

Dev*

Ning

Wang

Terry J. C. Zhang*

Zhijing

Jin

FoRLM @NeurIPS 2025

Read the Paper

13 of 30

From Research Papers to Questions

Natural Temporal Structure of Ever-Growing Research Paper Corpora

Synthesis Framework produce

Reasoning QAs from Research Paper Automatically

Longitudinal Analysis Relative to LLM Training Cutoff Dates:

Significant Post-Cutoff Performance Decay would indicate contamination

�

14 of 30

What did we do beyond reproduction study of RealMath?

We extended the framework in 3 aspects

Not just math, but also physics (and other science)
Use multimodal model for synthesis (o4-mini)
Omit “context” entry to synthesize self-contained question and avoid leakage (and save tokens)

More rigorous settings on #problems per month per domain to be more rigorous (RealMath simply partitioned the dataset to before vs. after without fine-grained comparison i.e. 2022 vs. 2025)
We tested on 4 leading model family, each represented by 2 models of similar size but different cutoff dates.

15 of 30

If we sum the model performance over the same time window (X months before vs. X months after cutoff) this also proves the RealMath explanation is not the actual reason for no decay

16 of 30

Quick Recall: Retrieval vs. Synthesis Benchmarking

LiveCodeBench (ICLR 2025)

RealMath (NeurIPS 2025)

17 of 30

Hypothesis: The synthesis process transformed the question such that the models do not recognize/memorize

From Retrieval to Synthesis: We synthesized questions based on LiveCodeBench

From Synthesis to Retrieval: We reverse-synthesized CLOZE questions on RealMath

18 of 30

Same code after synthesis

Same papers in CLOZE

19 of 30

Evaluation on CLOZE QAs

We generate CLOZE (fill-in-the-blank) questions from paper abstracts based on RealMath and our papers dataset.

These are direct recall type questions to test memorization in the models.

We observe consistent drop in performance in post-cutoff scenarios for all the models.

Drop in binary accuracy scores on CLOZE QAs

20 of 30

Evaluation on Transformed LiveCodeBench QAs

LiveCodeBench dataset is directly sourced from codeforces, LeetCode and AtCoder problems, which show clear temporal post-cutoff decay.

We apply the reasoning-driven synthesis to transform this dataset into QA pairs with same core solution.

We reevaluate the same models and the post-cutoff decay disappears!

This stands in contrast to previous work of longitudinal study LiveCodeBench (ICLR 2025) and ToTheCutoff (ICLR 2024).

Relatively stable performance on synthesised dataset

21 of 30

The “LiveBench” approach: use new info, periodically

When new models come out => Use a new datasets => Eval all the models again using this new dataset

(It’s costly, and papers from different time focus on different things)

We want to make a further point on longitudinal analysis:

It’s not (just) the new contents that matters

Rather “just” the synthesis process (i.e. transformation by a reasoning model) makes the difference (post-cutoff decay is not seen in synthesis-based benchmark, but apparent in retrieval-based benchmark)

What is different within RealMath+us vs. LiveCode/ToTheCutOff

22 of 30

The “LiveBench” approach: use new info, periodically

When new models come out => Use a new datasets => Eval all the models again using this new dataset

(It’s costly, and papers from different time focus on different things)

We want to make a further point on longitudinal analysis:

It’s not (just) the new contents that matters

Rather “just” the synthesis process (i.e. transformation by a reasoning model) makes the difference (post-cutoff decay is not seen in synthesis-based benchmark, but apparent in retrieval-based benchmark)

23 of 30

Validation Experiment: We test both ways

From Retrieval to Synthesis: We synthesized questions based on LiveCodeBench

From Synthesis to Retrieval: We reverse-synthesized CLOZE questions on RealMath

24 of 30

Same code after synthesis

Same papers in CLOZE

25 of 30

Collective Intelligence: On the Promise and Reality of Multi-Agent Systems for AI-Driven Scientific Discovery

Sirui

Lu

Bernhard Schölkopf

Yongjin

Yang

Yinya

Huang

Terry J. C. Zhang*

Zhijing

Jin

Preprint 2025

Read the Paper

26 of 30

Human Scientists Collaborate to Progress, Why Not AI?

Cross-Validation to Improve Accuracy

Multiple Perspective to Foster Ideation

Parallel Exploration to Boost Efficiency

27 of 30

28 of 30

What do MAS have to offer across 5 research stages?

Literature

Review

Parallel Information Processing

Cross-Domain Knowledge Integration

Hypothesis Generation

Multiple Perspectives Foster ideation

Cross-Validate by Multi-Agent Debate

Experimental Planning

Distributed Design and Optimization

Adaptive Coordination by Multi-Agents

Peer