1 of 21

The state of reasoning

Nathan Lambert

Ai2 // Interconnects.ai

Latent Space // NeurIPS 2024

Lambert | Thoughts on reasoning 1

2 of 21

Oxford dictionary definition

“the action of thinking about something in a logical, sensible way.”

Lambert | Thoughts on reasoning 2

3 of 21

Community definitions are verging on a litmus test rather than a technical debate (like AGI)

Lambert | Thoughts on reasoning 3

4 of 21

Why should LM reasoning be constrained to look like human reasoning?�(and we don’t really know what human reasoning looks like)

Lambert | Thoughts on reasoning 4

5 of 21

Ross Taylor on why Chain of Thought is permissible

“So if you’re prompting the language model directly for the answer, you're expecting the language model in that forward pass to maintain and manipulate the state in a latent space, whereas the way chain-of-thought does it is in token space.

So you essentially output the intermediate steps. One of the problems with reasoning is that we have no idea how humans mechanistically reason…but if you think about how you'd solve a GSM8K problem in your head, then to me this seems a lot closer to something like chain-of-thought than adaptive computation.”

Lambert | Thoughts on reasoning 5

6 of 21

Language models have randomness built in, the reasoning process should mirror that. Humans do not.

Lambert | Thoughts on reasoning 6

7 of 21

o1 models as maximizing the CoT approach

Lambert | Thoughts on reasoning 7

8 of 21

What is o1?

I think it is: A lot of RL on verifiable outcomes.

I don’t think it is: Most complicated things like PRMs, self-play (???), etc.

Lambert | Thoughts on reasoning 8

9 of 21

o1 replications (relatives) coming fast

Lambert | Thoughts on reasoning 9

10 of 21

What is o1?

I think it is: A lot of RL on verifiable outcomes.

I don’t think it is: Most complicated things like PRMs, self-play (???), etc.

Headings from SemiAnalysis ($):

  • Incredible Amounts of Forward Passes During Training → many iterations over data / sampling many reasoning options in RL training.
  • Post-Training FLOPS Exceed Pre-Training → much bigger RL training than any of the early attempts at replication, yielding functionality on more domains.

Lambert | Thoughts on reasoning 10

11 of 21

Reinforcement finetuning

11

https://openai.com/form/rft-research-program/

12 of 21

What is reinforcement finetuning?

Uses repeated passes over the data with RL to encourage model to figure out more robust behaviors in domains.

Requires:

  1. Training data with explicitly correct answers.
  2. A grader (or extraction program) for verifying outputs.
  3. A model that can sometimes generate a correct solution (even low probability). Otherwise, no signal for RL to learn from.

Key innovation: �Improving targeted skills reliably without degradation on other tasks.

Lambert | Thoughts on reasoning 12

13 of 21

RL finetuning data format

Two components:

  1. Prompt.
  2. Answer.

Currently most popular for math and code, but expanding domains fast.

OpenAI’s example: Biology.

Lambert | Thoughts on reasoning 13

14 of 21

RL finetuning data format

Two components:

  • Prompt.
  • Answer.

Currently most popular for math and code, but expanding domains fast.

Tülu 3 example: Precise instruction following.

Lambert | Thoughts on reasoning 14

15 of 21

Answer extraction & grader models

For complex queries, extracting answer can be complicated.

E.g. .05 vs 1/20 vs \frac{1}{2} vs 5E-02 vs 5 x 10^-2…

OpenAI uses a specialized LM to extract answers.

Early replications can start with Python code based extraction.

Lambert | Thoughts on reasoning 15

This is the reward shaping from RL

16 of 21

RL finetuning learning / infra

Lambert | Thoughts on reasoning 16

17 of 21

RL finetuning learning / infra

Lambert | Thoughts on reasoning 17

18 of 21

RL finetuning learning / infra

Lambert | Thoughts on reasoning 18

19 of 21

RL finetuning

learning / infra

in the open

Lambert | Thoughts on reasoning 19

Lambert, Nathan, et al. "Tulu 3: Pushing Frontiers in Open Language Model Post-Training." arXiv preprint arXiv:2411.15124 (2024).

Used very similar techniques to train Tülu 3 models.

https://github.com/allenai/open-instruct

20 of 21

RL finetuning

learning / infra

in the open

Lambert | Thoughts on reasoning 20

Used very similar techniques to train Tülu 3 models.

https://github.com/allenai/open-instruct

21 of 21

I wrote about RL Finetuning (RFT) today

Lambert | Thoughts on reasoning 21