The art of a good (reasoning) model
Why it’s so hard to land a good-vibes model.
Nathan Lambert
Allen Institute for AI // Interconnects.ai�Enterprise AI Agents Summit
13 June 2025��Slides at: https://www.interconnects.ai/p/links
Lambert | The art of the model 1
Everybody has reasoning models
OpenAI o3
DeepSeek R1
Gemini 2.5
Claude 4 w/ Extended Thinking
Grok 3
Qwen 3
Lambert | The art of the model 2
Everybody has reasoning models
OpenAI o3
DeepSeek R1
Gemini 2.5
Claude 4 w/ Extended Thinking
Grok 3
Qwen 3
Lambert | The art of the model 3
What are we getting out of them
besides high benchmarks?
What are the next frontiers of training them?
Reasoning is starting to unlock new LM applications
Asking o3 to find a reference that took me ~10 minutes to Google the week before.
One shotted it (with nice download link) in 56 seconds.
Lambert | The art of the model 4
Reasoning is starting to unlock new LM applications
Asking o3 to find a reference that took me ~10 minutes to Google the week before.
One shotted it (with nice download link) in 56 seconds.
Lambert | The art of the model 5
The classic RL overoptimization gif
(“Coast Runners”)
Reasoning is starting to unlock new LM applications
Lambert | The art of the model 6
Deep Research
I use extensively to find papers, tweets, etc.
Reasoning is starting to unlock new LM applications
Lambert | The art of the model 7
Deep Research
I use extensively to find papers, tweets, etc.
Claude Code
[Interactive code agents]
I use for fun website fixes and features (e.g. rlhbook.com)
Reasoning is starting to unlock new LM applications
Lambert | The art of the model 8
Deep Research
I use extensively to find papers, tweets, etc.
Claude Code
[Interactive code agents]
I use for fun website fixes and features (e.g. rlhbook.com)
Codex
[Autonomous code agents]
Finding the niche!
Skills: The foundation of reasoning
Lambert | The art of the model 9
GPT 4o
Skills: The foundation of reasoning
Lambert | The art of the model 10
GPT 4o
o1 improvements
Skills: The foundation of reasoning
Lambert | The art of the model 11
GPT 4o
o1 improvements
o3 improvements
Skills: The foundation of reasoning
Reasoning models unlocked with huge increase in benchmark scores:
(a growing list!)
Lambert | The art of the model 12
GPT 4o
o1 improvements
o3 improvements
The allure of continuing to hillclimb on imperfect evals
Lambert | The art of the model 13
https://x.com/suchenzang/status/1701615026648605095
Anthropic’s Claude 3.7: Willing to cheat to finish code
Extensive reports of negative side-effects when coding.
Lambert | The art of the model 14
https://x.com/ArthurB/status/1897570146102743224
https://x.com/ArthurB/status/1897570146102743224
Anthropic’s Claude 3.7
Lambert | The art of the model 15
https://www.interconnects.ai/p/claude-3-7-thonks
Anthropic’s Claude 3.7
Anthropic’s Claude 3.7: Willing to cheat to finish code
Extensive reports of negative side-effects when coding.
Cause: Too much reinforcement learning on verifiable rewards on imperfect training data – i.e. the model passes the tests in the training data, but the tests don’t cover the actual test.
Lambert | The art of the model 16
Anthropic’s Claude 3.7
Anthropic’s Claude 3.7: Willing to cheat to finish code
Extensive reports of negative side-effects when coding.
Cause: Too much reinforcement learning on verifiable rewards on imperfect training data – i.e. the model passes the tests in the training data, but the tests don’t cover the actual test.
(other models from OpenAI and Gemini show tendencies of this, but not to the same extent)
Lambert | The art of the model 17
o3 hallucinates actions
Many users report o3 “trying” things or supporting findings with impossible supported results.
So far, doesn’t interfere with utility.
���
Lambert | The art of the model 18
More: https://www.interconnects.ai/p/openais-o3-over-optimization-is-back
https://transluce.org/investigating-o3-truthfulness
o3 hallucinates actions
Many users report o3 “trying” things or supporting findings with impossible supported results.
So far, doesn’t interfere with utility.
Suspected cause: Too much reinforcement learning training on long tool use chains, hard to regularize.
Lambert | The art of the model 19
More: https://www.interconnects.ai/p/openais-o3-over-optimization-is-back
https://transluce.org/investigating-o3-truthfulness
Technical bottlenecks
Lambert | The art of the model 20
OpenAI’s GPT 4.5
Totally unimpressive eval. scores, with:
Lambert | The art of the model 21
https://www.interconnects.ai/p/gpt-45-not-a-frontier-model
Expert in green text
OpenThoughts3 & Magistral
Recent, strong open models getting criticism for not being useful outside of narrow evaluation domains. The bar for AI releases continues to be raised for everyone!
Lambert | The art of the model 22
OpenThoughts3 & Magistral: Overthinking & overformatting
Lambert | The art of the model 23
https://x.com/kalomaze/status/1932528415904854166
https://x.com/kalomaze/status/1930927753987244236�
The pressure to give users what they “want”
Lambert | The art of the model 24
More: https://www.interconnects.ai/p/sycophancy-and-the-art-of-the-model
Sycophancy across the industry
Lambert | The art of the model 25
More: https://www.interconnects.ai/p/llama-4�https://www.interconnects.ai/p/sycophancy-and-the-art-of-the-model
Source: https://x.com/___frye/status/1916346474893656572
https://x.com/chatgpt21/status/1906624752304677096
Llama 4 LMArena Version
GPT 4o Sycophant
Why does sycophancy happen?
Users like it!
Ends up being a “tiebreak” in human preference data workflows that is hard to control for.
Lambert | The art of the model 26
More: https://rlhfbook.com/c/06-preference-data.html�https://www.interconnects.ai/p/sycophancy-and-the-art-of-the-model
https://openai.com/index/expanding-on-sycophancy/
Over-optimization, then and now
Over-optimization of reward models is very obvious when your training signals are better.
Now optimization is more robust and multifaceted, but not all skills we want from the models are covered by evaluations.
Stronger optimization will take from what you’re not measuring and move it to where you are!
Lambert | The art of the model 27
More: https://rlhfbook.com/c/17-over-optimization.html
Gao et al. 2022
xAI’s Grok 3
Lambert | The art of the model 28
https://www.interconnects.ai/p/grok-3-and-an-accelerating-ai-roadmap
The Goldilocks Zone: Evals, vibes, and price
Claude 3.5 Sonnet was likely the best model release we’ve had yet.
While the pace of progress is so high, random noise in the process makes the difference between models bigger (and the chance of getting a weird one higher).
Lambert | The art of the model 29
https://www.interconnects.ai/p/switched-to-claude-from-chatgpt
What’s next with models?
Lambert | The art of the model 30
Autonomy: A defining trend of new reasoning products
Lambert | The art of the model 31
METR, 2025.
Autonomy: A defining trend of new reasoning products
Lambert | The art of the model 32
METR, 2025.
Gains on this chart aren’t free! We need to constantly feed new training data and algorithms into the models.
Autonomy: A defining trend of new reasoning products
Lambert | The art of the model 33
METR, 2025.
Reasoning models better at planning will push this boundary!
Gains on this chart aren’t free! We need to constantly feed new training data and algorithms into the models.
How do we do this?
How do we train a reasoning model that can work autonomously 10X longer?
Lambert | The art of the model 34
What reasoning models for independent agents need
Lambert | The art of the model 35
What reasoning models for independent agents need
Lambert | The art of the model 36
We largely have this today
What reasoning models for independent agents need
Lambert | The art of the model 37
We largely have this today
A lot of research underway
What reasoning models for independent agents need
Lambert | The art of the model 38
We largely have this today
A lot of research underway
What is referred to as “planning”
Revisiting this example
Very skillful, lacking planning
Lambert | The art of the model 39
The classic RL overoptimization gif
(“Coast Runners”)
Revisiting this example
Very skillful, lacking planning
Search is a skill that o3 has taken a massive leap on.
Synthesizing complex information and comparisons requires better planning.
Lambert | The art of the model 40
The classic RL overoptimization gif
(“Coast Runners”)
RL as the focal point of language model development
Lambert | The art of the model 41
What I’m thinking about for scaling RL
Lambert | The art of the model 42
From “post-training” to “training”
As we have already trained on more or less the whole internet, interest in RL training on more and more domains will grow.
How much further will the compute used in “post-training” grow?
Will “continual learning” work and further reduce pretraining?
Lambert | The art of the model 43
From “post-training” to “training”
DeepSeek V3 used .18% of compute on post-training.
DeepSeek V3 pretraining took <2 months.
DeepSeek R1 RL training took “a few weeks”*
DeepSeek R1 could already be ~10-20% of compute in GPU hours.
Scaling RL has just begun.
Lambert | The art of the model 44
* from a now deleted tweet from a DeepSeek employee.
Thank you!
nathan@natolambert.com
www.interconnects.ai
Lambert | The art of the model 45
Extra slides follow
Lambert | The art of the model 46
Reinforcement learning with verifiable rewards (RLVR)
Lambert | The art of the model 47
Tülu 3, Ai2, 2024.
Ways to compute rewards
Complexity of evaluation with inference-time compute
Additional compute use is not constant across model providers!
xAI’s official numbers Community comparison
Lambert | The art of the model 48
https://x.com/teortaxesTex/status/1895505072605606165/photo/1
What reasoning models for independent agents need
Lambert | The art of the model 49
Parallel compute as amplification of reasoning abilities
Parallel compute and better verifiers increase the slope of inference-time scaling and in practice improve the robustness of answers (e.g. o1 pro)
Lambert | The art of the model 50
Claude’s Extended Thinking, Anthropic, 2025
Reinforcement learning with verifiable rewards (RLVR)
Lambert | The art of the model 51
Tülu 3, Ai2, 2024.
Ways to compute rewards
Calibration: Reasoners that try as hard as they need to
Effort is currently offloaded to the user:
Soon the model will know how hard to think.
Lambert | The art of the model 52
Calibration: Reasoners that try as hard as they need to
Still, overthinking is a major problem.
Lambert | The art of the model 53
Generation length on GSM8K, MATH, AIME mix. �Qu & Li et al. 2024�https://arxiv.org/abs/2503.21614
Chen, Xu, Liang, He et al. 2024�https://arxiv.org/abs/2412.21187
Strategy: Reasoning models that go in the right direction
There’s a large gap between reasoning models and agents build with reasoning models right now.
Over time, picking the right plan needs to become model native and a core skill.
Lambert | The art of the model 54
Example where DeepSeek R1 dives right into a solution of a Frontier Math problem. No planning.
Abstraction: Reasoning models that break down a task
Questions for designing a LM that orchestrates its own plans:
Lambert | The art of the model 55
Bootstrapping training data for planning
Q*, Strawberry, and finally o1 took 12+ months due to the need to create training data to seed models with reasoning skills (backtracking, verification, etc.)
Planning will go through a similar arc, but it will be easier to add.
Finally, RL can reinforce useful planning styles.
Lambert | The art of the model 56
The art
of a good (reasoning) model
AI Agents Summit
13 June 2025