The state of reasoning
Nathan Lambert
Ai2 // Interconnects.ai
Latent Space // NeurIPS 2024
Lambert | Thoughts on reasoning 1
Oxford dictionary definition
“the action of thinking about something in a logical, sensible way.”
Lambert | Thoughts on reasoning 2
Community definitions are verging on a litmus test rather than a technical debate (like AGI)
Lambert | Thoughts on reasoning 3
Recommended post discussing this: https://aiguide.substack.com/p/the-llm-reasoning-debate-heats-up
Why should LM reasoning be constrained to look like human reasoning?�(and we don’t really know what human reasoning looks like)
Lambert | Thoughts on reasoning 4
Ross Taylor on why Chain of Thought is permissible
“So if you’re prompting the language model directly for the answer, you're expecting the language model in that forward pass to maintain and manipulate the state in a latent space, whereas the way chain-of-thought does it is in token space.
So you essentially output the intermediate steps. One of the problems with reasoning is that we have no idea how humans mechanistically reason…but if you think about how you'd solve a GSM8K problem in your head, then to me this seems a lot closer to something like chain-of-thought than adaptive computation.”
Lambert | Thoughts on reasoning 5
Source: https://www.interconnects.ai/p/interviewing-ross-taylor-on-llm-reasoning
Further context from Ross: https://www.youtube.com/watch?v=S5l5OvJ01ws
Language models have randomness built in, the reasoning process should mirror that. Humans do not.
Lambert | Thoughts on reasoning 6
o1 models as maximizing the CoT approach
Lambert | Thoughts on reasoning 7
What is o1?
I think it is: A lot of RL on verifiable outcomes.
I don’t think it is: Most complicated things like PRMs, self-play (???), etc.
Lambert | Thoughts on reasoning 8
o1 replications (relatives) coming fast
Lambert | Thoughts on reasoning 9
What is o1?
I think it is: A lot of RL on verifiable outcomes.
I don’t think it is: Most complicated things like PRMs, self-play (???), etc.
Headings from SemiAnalysis ($):
Lambert | Thoughts on reasoning 10
Reinforcement finetuning
11
https://openai.com/form/rft-research-program/
What is reinforcement finetuning?
Uses repeated passes over the data with RL to encourage model to figure out more robust behaviors in domains.
Requires:
Key innovation: �Improving targeted skills reliably without degradation on other tasks.
Lambert | Thoughts on reasoning 12
RL finetuning data format
Two components:
Currently most popular for math and code, but expanding domains fast.
OpenAI’s example: Biology.
Lambert | Thoughts on reasoning 13
RL finetuning data format
Two components:
Currently most popular for math and code, but expanding domains fast.
Tülu 3 example: Precise instruction following.
Lambert | Thoughts on reasoning 14
Answer extraction & grader models
For complex queries, extracting answer can be complicated.
E.g. .05 vs 1/20 vs \frac{1}{2} vs 5E-02 vs 5 x 10^-2…
OpenAI uses a specialized LM to extract answers.
Early replications can start with Python code based extraction.
Lambert | Thoughts on reasoning 15
This is the reward shaping from RL
RL finetuning learning / infra
Lambert | Thoughts on reasoning 16
RL finetuning learning / infra
Lambert | Thoughts on reasoning 17
RL finetuning learning / infra
Lambert | Thoughts on reasoning 18
RL finetuning
learning / infra
in the open
Lambert | Thoughts on reasoning 19
Lambert, Nathan, et al. "Tulu 3: Pushing Frontiers in Open Language Model Post-Training." arXiv preprint arXiv:2411.15124 (2024).
RL finetuning
learning / infra
in the open
Lambert | Thoughts on reasoning 20
I wrote about RL Finetuning (RFT) today
Lambert | Thoughts on reasoning 21