Synergize Post-Training with the Science of Evaluation:
Making AI Agents Smarter and Safer
Terry Jingchen Zhang
Slides at terry-zhang.notion.site
Making AI agents smarter and safer at the same time
Guide Fine-tuning Towards Generalizable Reasoning
Improve (Multimodal) Reasoning
By probing transferability/generalizability of reasoning performance
Smarter
Uncover Agentic Misaligned Behavior
By analyzing when AI agents commit deception and harmful actions
Safer
Tackle Benchmark Saturation and Contamination
Internalize Consequence-Aware Reasoning for Safety
Look for Key Triggers of Emergent Scheming
Benchmark Saturation: From 20% to 100%
PioneerPhysics: By Researchers, For Researchers
Benchmark Contamination: Memorize vs. Generalize
Disentangle Key Capability
Various Use Case
Interdisciplinary Synergy: Physics-AI Reciprocity
Towards Trustworthy AI in Scientific Workflow
Making AI agents smarter and safer at the same time
Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy
Against Benchmark Contamination
Bernhard Schölkopf
Mrinmaya Sachan
Gopal
Dev*
Ning
Wang
Terry J. C. Zhang*
Zhijing
Jin
FoRLM @NeurIPS 2025
From Research Papers to Questions
Natural Temporal Structure of Ever-Growing Research Paper Corpora
Synthesis Framework produce
Reasoning QAs from Research Paper Automatically
Longitudinal Analysis Relative to LLM Training Cutoff Dates:
Significant Post-Cutoff Performance Decay would indicate contamination
�
What did we do beyond reproduction study of RealMath?
If we sum the model performance over the same time window (X months before vs. X months after cutoff) this also proves the RealMath explanation is not the actual reason for no decay
Quick Recall: Retrieval vs. Synthesis Benchmarking
LiveCodeBench (ICLR 2025)
RealMath (NeurIPS 2025)
Hypothesis: The synthesis process transformed the question such that the models do not recognize/memorize
From Retrieval to Synthesis: We synthesized questions based on LiveCodeBench
From Synthesis to Retrieval: We reverse-synthesized CLOZE questions on RealMath
Same code after synthesis
Same papers in CLOZE
Evaluation on CLOZE QAs
We generate CLOZE (fill-in-the-blank) questions from paper abstracts based on RealMath and our papers dataset.
These are direct recall type questions to test memorization in the models.
We observe consistent drop in performance in post-cutoff scenarios for all the models.
Drop in binary accuracy scores on CLOZE QAs
Evaluation on Transformed LiveCodeBench QAs
LiveCodeBench dataset is directly sourced from codeforces, LeetCode and AtCoder problems, which show clear temporal post-cutoff decay.
We apply the reasoning-driven synthesis to transform this dataset into QA pairs with same core solution.
We reevaluate the same models and the post-cutoff decay disappears!
This stands in contrast to previous work of longitudinal study LiveCodeBench (ICLR 2025) and ToTheCutoff (ICLR 2024).
Relatively stable performance on synthesised dataset
The “LiveBench” approach: use new info, periodically
When new models come out => Use a new datasets => Eval all the models again using this new dataset
(It’s costly, and papers from different time focus on different things)
We want to make a further point on longitudinal analysis:
It’s not (just) the new contents that matters
Rather “just” the synthesis process (i.e. transformation by a reasoning model) makes the difference (post-cutoff decay is not seen in synthesis-based benchmark, but apparent in retrieval-based benchmark)
What is different within RealMath+us vs. LiveCode/ToTheCutOff
The “LiveBench” approach: use new info, periodically
When new models come out => Use a new datasets => Eval all the models again using this new dataset
(It’s costly, and papers from different time focus on different things)
We want to make a further point on longitudinal analysis:
It’s not (just) the new contents that matters
Rather “just” the synthesis process (i.e. transformation by a reasoning model) makes the difference (post-cutoff decay is not seen in synthesis-based benchmark, but apparent in retrieval-based benchmark)
Validation Experiment: We test both ways
From Retrieval to Synthesis: We synthesized questions based on LiveCodeBench
From Synthesis to Retrieval: We reverse-synthesized CLOZE questions on RealMath
Same code after synthesis
Same papers in CLOZE
Collective Intelligence: On the Promise and Reality of Multi-Agent Systems for AI-Driven Scientific Discovery
Sirui
Lu
Bernhard Schölkopf
Yongjin
Yang
Yinya
Huang
Terry J. C. Zhang*
Zhijing
Jin
Preprint 2025
Human Scientists Collaborate to Progress, Why Not AI?
Cross-Validation to Improve Accuracy
Multiple Perspective to Foster Ideation
Parallel Exploration to Boost Efficiency
What do MAS have to offer across 5 research stages?
Literature
Review
Parallel Information Processing
Cross-Domain Knowledge Integration
Hypothesis Generation
Multiple Perspectives Foster ideation
Cross-Validate by Multi-Agent Debate
Experimental Planning
Distributed Design and Optimization
Adaptive Coordination by Multi-Agents
Peer
Review
Expert Review Panel of Multi-Agents
Agent Feedback for Iterative Refinement
Experimental Execution
Multiple Systems Integration
Natural Fault-Tolerance by Spare Agents
Future AI Co-Scientist: Co-Evolve with Human Researcher