1 of 57

The art of a good (reasoning) model

Why it’s so hard to land a good-vibes model.

Nathan Lambert

Allen Institute for AI // Interconnects.ai�Enterprise AI Agents Summit

13 June 2025��Slides at: https://www.interconnects.ai/p/links

Lambert | The art of the model 1

2 of 57

Everybody has reasoning models

OpenAI o3

DeepSeek R1

Gemini 2.5

Claude 4 w/ Extended Thinking

Grok 3

Qwen 3

Lambert | The art of the model 2

3 of 57

Everybody has reasoning models

OpenAI o3

DeepSeek R1

Gemini 2.5

Claude 4 w/ Extended Thinking

Grok 3

Qwen 3

Lambert | The art of the model 3

What are we getting out of them

besides high benchmarks?

What are the next frontiers of training them?

4 of 57

Reasoning is starting to unlock new LM applications

Asking o3 to find a reference that took me ~10 minutes to Google the week before.

One shotted it (with nice download link) in 56 seconds.

Lambert | The art of the model 4

5 of 57

Reasoning is starting to unlock new LM applications

Asking o3 to find a reference that took me ~10 minutes to Google the week before.

One shotted it (with nice download link) in 56 seconds.

Lambert | The art of the model 5

The classic RL overoptimization gif

(“Coast Runners”)

6 of 57

Reasoning is starting to unlock new LM applications

Lambert | The art of the model 6

Deep Research

I use extensively to find papers, tweets, etc.

7 of 57

Reasoning is starting to unlock new LM applications

Lambert | The art of the model 7

Deep Research

I use extensively to find papers, tweets, etc.

Claude Code

[Interactive code agents]

I use for fun website fixes and features (e.g. rlhbook.com)

8 of 57

Reasoning is starting to unlock new LM applications

Lambert | The art of the model 8

Deep Research

I use extensively to find papers, tweets, etc.

Claude Code

[Interactive code agents]

I use for fun website fixes and features (e.g. rlhbook.com)

Codex

[Autonomous code agents]

Finding the niche!

9 of 57

Skills: The foundation of reasoning

Lambert | The art of the model 9

GPT 4o

10 of 57

Skills: The foundation of reasoning

Lambert | The art of the model 10

GPT 4o

o1 improvements

11 of 57

Skills: The foundation of reasoning

Lambert | The art of the model 11

GPT 4o

o1 improvements

o3 improvements

12 of 57

Skills: The foundation of reasoning

Reasoning models unlocked with huge increase in benchmark scores:

  • Inference-time scaling
  • Better coding & math performance
  • More reliable tool-use

(a growing list!)

Lambert | The art of the model 12

GPT 4o

o1 improvements

o3 improvements

13 of 57

The allure of continuing to hillclimb on imperfect evals

Lambert | The art of the model 13

https://x.com/suchenzang/status/1701615026648605095

14 of 57

Anthropic’s Claude 3.7: Willing to cheat to finish code

Extensive reports of negative side-effects when coding.

Lambert | The art of the model 14

https://x.com/ArthurB/status/1897570146102743224

https://x.com/ArthurB/status/1897570146102743224

15 of 57

Anthropic’s Claude 3.7

  • Anthropic really focusing on strengths in agentic workflows and software development
  • Added thinking via toggle – which is a very odd experience. �Shouldn’t models know when they need help more than the users?
  • Confirmed by CEO Dario Amodei that “most recent progress has come through post training.” (See recent Hard Fork appearance)

Lambert | The art of the model 15

https://www.interconnects.ai/p/claude-3-7-thonks

16 of 57

Anthropic’s Claude 3.7

Anthropic’s Claude 3.7: Willing to cheat to finish code

Extensive reports of negative side-effects when coding.

Cause: Too much reinforcement learning on verifiable rewards on imperfect training data – i.e. the model passes the tests in the training data, but the tests don’t cover the actual test.

Lambert | The art of the model 16

17 of 57

Anthropic’s Claude 3.7

Anthropic’s Claude 3.7: Willing to cheat to finish code

Extensive reports of negative side-effects when coding.

Cause: Too much reinforcement learning on verifiable rewards on imperfect training data – i.e. the model passes the tests in the training data, but the tests don’t cover the actual test.

(other models from OpenAI and Gemini show tendencies of this, but not to the same extent)

Lambert | The art of the model 17

18 of 57

o3 hallucinates actions

Many users report o3 “trying” things or supporting findings with impossible supported results.

So far, doesn’t interfere with utility.

���

Lambert | The art of the model 18

More: https://www.interconnects.ai/p/openais-o3-over-optimization-is-back

https://transluce.org/investigating-o3-truthfulness

19 of 57

o3 hallucinates actions

Many users report o3 “trying” things or supporting findings with impossible supported results.

So far, doesn’t interfere with utility.

Suspected cause: Too much reinforcement learning training on long tool use chains, hard to regularize.

Lambert | The art of the model 19

More: https://www.interconnects.ai/p/openais-o3-over-optimization-is-back

https://transluce.org/investigating-o3-truthfulness

20 of 57

Technical bottlenecks

Lambert | The art of the model 20

21 of 57

OpenAI’s GPT 4.5

Totally unimpressive eval. scores, with:

  • Very big model (~10X compute of GPT4), very expensive in API
  • Biggest advantages in personality and reduced hallucinations
  • Diminishing returns to scaling?
    • Better base models unlock more post-training (i.e. RL)
    • Still need better training + serving infrastructure

Lambert | The art of the model 21

https://www.interconnects.ai/p/gpt-45-not-a-frontier-model

Expert in green text

22 of 57

OpenThoughts3 & Magistral

Recent, strong open models getting criticism for not being useful outside of narrow evaluation domains. The bar for AI releases continues to be raised for everyone!

Lambert | The art of the model 22

23 of 57

OpenThoughts3 & Magistral: Overthinking & overformatting

Lambert | The art of the model 23

https://x.com/kalomaze/status/1932528415904854166

https://x.com/kalomaze/status/1930927753987244236�

24 of 57

The pressure to give users what they “want”

Lambert | The art of the model 24

More: https://www.interconnects.ai/p/sycophancy-and-the-art-of-the-model

25 of 57

Sycophancy across the industry

Lambert | The art of the model 25

More: https://www.interconnects.ai/p/llama-4�https://www.interconnects.ai/p/sycophancy-and-the-art-of-the-model

Source: https://x.com/___frye/status/1916346474893656572

https://x.com/chatgpt21/status/1906624752304677096

Llama 4 LMArena Version

GPT 4o Sycophant

26 of 57

Why does sycophancy happen?

Users like it!

Ends up being a “tiebreak” in human preference data workflows that is hard to control for.

Lambert | The art of the model 26

More: https://rlhfbook.com/c/06-preference-data.html�https://www.interconnects.ai/p/sycophancy-and-the-art-of-the-model

https://openai.com/index/expanding-on-sycophancy/

27 of 57

Over-optimization, then and now

Over-optimization of reward models is very obvious when your training signals are better.

Now optimization is more robust and multifaceted, but not all skills we want from the models are covered by evaluations.

Stronger optimization will take from what you’re not measuring and move it to where you are!

Lambert | The art of the model 27

More: https://rlhfbook.com/c/17-over-optimization.html

Gao et al. 2022

28 of 57

xAI’s Grok 3

  • Having a lot of compute as a short path (with talent) to state-of-the-art models
  • Expanding Overton window of AI personalities with “based” and less safe model (even though it only feels slightly different from others)
  • Leading AI is not only the purview of OpenAI, Anthropic, and Google
  • Integrated with search and optional reasoning (becoming industry standards)

Lambert | The art of the model 28

https://www.interconnects.ai/p/grok-3-and-an-accelerating-ai-roadmap

29 of 57

The Goldilocks Zone: Evals, vibes, and price

Claude 3.5 Sonnet was likely the best model release we’ve had yet.

While the pace of progress is so high, random noise in the process makes the difference between models bigger (and the chance of getting a weird one higher).

Lambert | The art of the model 29

https://www.interconnects.ai/p/switched-to-claude-from-chatgpt

30 of 57

What’s next with models?

Lambert | The art of the model 30

31 of 57

Autonomy: A defining trend of new reasoning products

Lambert | The art of the model 31

METR, 2025.

32 of 57

Autonomy: A defining trend of new reasoning products

Lambert | The art of the model 32

METR, 2025.

Gains on this chart aren’t free! We need to constantly feed new training data and algorithms into the models.

33 of 57

Autonomy: A defining trend of new reasoning products

Lambert | The art of the model 33

METR, 2025.

Reasoning models better at planning will push this boundary!

Gains on this chart aren’t free! We need to constantly feed new training data and algorithms into the models.

34 of 57

How do we do this?

How do we train a reasoning model that can work autonomously 10X longer?

Lambert | The art of the model 34

35 of 57

What reasoning models for independent agents need

  1. Skills: The ability to solve self-contained problems.
  2. Calibration: The ability to understand the difficulty of a problem and not overthink.
  3. Strategy: The ability to choose the right high level plan.
  4. Abstraction: The ability to break down a strategy into solvable chunks.

Lambert | The art of the model 35

36 of 57

What reasoning models for independent agents need

  • Skills: The ability to solve self-contained problems.
  • Calibration: The ability to understand the difficulty of a problem and not overthink.
  • Strategy: The ability to choose the right high level plan.
  • Abstraction: The ability to break down a strategy into solvable chunks.

Lambert | The art of the model 36

We largely have this today

37 of 57

What reasoning models for independent agents need

  • Skills: The ability to solve self-contained problems.
  • Calibration: The ability to understand the difficulty of a problem and not overthink.
  • Strategy: The ability to choose the right high level plan.
  • Abstraction: The ability to break down a strategy into solvable chunks.

Lambert | The art of the model 37

We largely have this today

A lot of research underway

38 of 57

What reasoning models for independent agents need

  • Skills: The ability to solve self-contained problems.
  • Calibration: The ability to understand the difficulty of a problem and not overthink.
  • Strategy: The ability to choose the right high level plan.
  • Abstraction: The ability to break down a strategy into solvable chunks.

Lambert | The art of the model 38

We largely have this today

A lot of research underway

What is referred to as “planning”

39 of 57

Revisiting this example

Very skillful, lacking planning

Lambert | The art of the model 39

The classic RL overoptimization gif

(“Coast Runners”)

40 of 57

Revisiting this example

Very skillful, lacking planning

Search is a skill that o3 has taken a massive leap on.

Synthesizing complex information and comparisons requires better planning.

Lambert | The art of the model 40

The classic RL overoptimization gif

(“Coast Runners”)

41 of 57

RL as the focal point of language model development

Lambert | The art of the model 41

42 of 57

What I’m thinking about for scaling RL

  1. Get a big, multi-domain dataset of questions + answers
  2. Difficulty filtering: Not too easy, not too hard for starting checkpoint
  3. Let RL go for a long time
  4. Try all the random RL tricks for last few points (overlong filtering, two sided clipping, resetting reference model, Dr. GRPO advantage estimation…)

Lambert | The art of the model 42

43 of 57

From “post-training” to “training”

As we have already trained on more or less the whole internet, interest in RL training on more and more domains will grow.

How much further will the compute used in “post-training” grow?

Will “continual learning” work and further reduce pretraining?

Lambert | The art of the model 43

44 of 57

From “post-training” to “training”

DeepSeek V3 used .18% of compute on post-training.

DeepSeek V3 pretraining took <2 months.

DeepSeek R1 RL training took “a few weeks”*

DeepSeek R1 could already be ~10-20% of compute in GPU hours.

Scaling RL has just begun.

Lambert | The art of the model 44

* from a now deleted tweet from a DeepSeek employee.

45 of 57

Thank you!

nathan@natolambert.com

www.interconnects.ai

Lambert | The art of the model 45

46 of 57

Extra slides follow

Lambert | The art of the model 46

47 of 57

Reinforcement learning with verifiable rewards (RLVR)

Lambert | The art of the model 47

Tülu 3, Ai2, 2024.

Ways to compute rewards

  • Math-Verify (https://github.com/huggingface/Math-Verify)
  • LLM-as-a-judge for facts
  • Code Sandboxes
  • More!

48 of 57

Complexity of evaluation with inference-time compute

Additional compute use is not constant across model providers!

xAI’s official numbers Community comparison

Lambert | The art of the model 48

https://x.com/teortaxesTex/status/1895505072605606165/photo/1

49 of 57

What reasoning models for independent agents need

  • Skills: The ability to solve self-contained problems.
  • Calibration: The ability to understand the difficulty of a problem and not overthink.
  • Strategy: The ability to choose the right high level plan.
  • Abstraction: The ability to break down a strategy into solvable chunks.

Lambert | The art of the model 49

50 of 57

Parallel compute as amplification of reasoning abilities

Parallel compute and better verifiers increase the slope of inference-time scaling and in practice improve the robustness of answers (e.g. o1 pro)

Lambert | The art of the model 50

Claude’s Extended Thinking, Anthropic, 2025

51 of 57

Reinforcement learning with verifiable rewards (RLVR)

Lambert | The art of the model 51

Tülu 3, Ai2, 2024.

Ways to compute rewards

  • Math-Verify (https://github.com/huggingface/Math-Verify)
  • LLM-as-a-judge for facts
  • Code Sandboxes
  • More!

52 of 57

Calibration: Reasoners that try as hard as they need to

Effort is currently offloaded to the user:

  • Model selectors between reasoners or traditional instruct models,
  • Reasoning on/off buttons,
  • Reasoning effort selectors.

Soon the model will know how hard to think.

Lambert | The art of the model 52

53 of 57

Calibration: Reasoners that try as hard as they need to

Still, overthinking is a major problem.

Lambert | The art of the model 53

Generation length on GSM8K, MATH, AIME mix. Qu & Li et al. 2024https://arxiv.org/abs/2503.21614

Chen, Xu, Liang, He et al. 2024https://arxiv.org/abs/2412.21187

54 of 57

Strategy: Reasoning models that go in the right direction

There’s a large gap between reasoning models and agents build with reasoning models right now.

  • Reasoning models themselves do little planning.
  • Reasoning agents are prompted to plan.

Over time, picking the right plan needs to become model native and a core skill.

Lambert | The art of the model 54

Example where DeepSeek R1 dives right into a solution of a Frontier Math problem. No planning.

55 of 57

Abstraction: Reasoning models that break down a task

Questions for designing a LM that orchestrates its own plans:

  • How should a LM manage it’s memory?
  • How can a LM avoid repeating the same mistake?
  • How can a LM make sure it breaks down a plan into parts it can solve on its own?
  • How can a LM offload more thinking (e.g. parallel compute) to the hardest sub-tasks?
  • How can a LM work on multiple sub-tasks in parallel?

Lambert | The art of the model 55

56 of 57

Bootstrapping training data for planning

Q*, Strawberry, and finally o1 took 12+ months due to the need to create training data to seed models with reasoning skills (backtracking, verification, etc.)

Planning will go through a similar arc, but it will be easier to add.

Finally, RL can reinforce useful planning styles.

Lambert | The art of the model 56

57 of 57

The art

of a good (reasoning) model

AI Agents Summit

13 June 2025