Learning to Reason with a Teacher Model
Problem Setting
Given:
How can we obtain the best generalizable performance for this base model?
2
Preliminary: Existing solutions. RLVR
Solutions:
3
Preliminary: Existing solutions. RLVR
4
Preliminary: Existing Solutions. Distillation
If we use a bigger and smarter teacher model?
Solutions:
5
*Cetin, Edoardo, Tianyu Zhao, and Yujin Tang. "Reinforcement Learning Teachers of Test Time Scaling." arXiv preprint arXiv:2506.08388 (2025).
Intuition behind Align-SFT
For each prompt, we have a set of correct reasoning traces S that lead to correct answers.�
6
Preliminary: Existing Solutions. Align-SFT
7
Cetin, Edoardo, Tianyu Zhao, and Yujin Tang. "Reinforcement Learning Teachers of Test Time Scaling." arXiv preprint arXiv:2506.08388 (2025).
Preliminary: Existing Solutions. Align-SFT
8
Cetin, Edoardo, Tianyu Zhao, and Yujin Tang. "Reinforcement Learning Teachers of Test Time Scaling." arXiv preprint arXiv:2506.08388 (2025).
Preliminary: Existing Solutions. Align-SFT
9
RL-train the teacher with sum of the two rewards:
Other Possible Solutions
10
Online Align-RL Algorithm
11
Online Align-RL Algorithm
12
What is TOPR
TOPR: Tapered off-policy REINFORCE
13
Roux, Nicolas Le, et al. "Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms." arXiv preprint arXiv:2503.14286 (2025).
How can we do Align-RL?
We need to generate correct and incorrect samples for the student using the teacher.
For the yes/no answer dataset:
14
Experiments
Dataset: ~1600 prompts of general-knowledge questions. Test ~700 prompts
Models: Qwen2.5-1.5B as the student, and either 1.5B or 7B as the teacher.
Runtime: ~250 updates per model.
15
Experiments: 7B to 1.5B
Setup: The 7B teacher is first trained with GRPO to solve the task.
For aligned setups the teacher fine-tuned to be in-distribution for the student.�
16
Example
You are a teacher explaining an incorrect final answer. A conversation between User and Assistant. The User gives a question and its final answer. The Assistant reconstructs only the reasoning process in the mind that leads to this answer. Output only the reasoning process inside <think> </think> tags and DO NOT output the <answer> tag. User: You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here </answer>. If the question can be answered with 'yes' or 'no', your answer must be 'yes' or 'no'.
This is the problem:
Are sesame seeds glued onto hamburger buns?
Assistant: <think>
17
Answers
Sesame seeds are commonly found on the top of hamburger buns. Glue is a substance used to hold things together. Sesame seeds serve as a natural adhesive, creating a sticky surface for toppings like cheese or meats. Therefore, it can be concluded that sesame seeds are indeed glued onto hamburger buns. So, the answer is yes.</think> <answer>yes</answer>
18
To determine if sesame seeds are glued onto hamburger buns, we need to consider the nature of the sesame seeds and the bun. Sesame seeds are naturally occurring on the surface of sesame plants and are commonly used as a topping for various foods, including hamburger buns. Hamburger buns are made from flour, water, yeast, and other ingredients, and the sesame seeds are typically applied to the surface of the bun during the baking process. There is no need for glue as the sesame seeds are already sticky and will adhere to the bun's surface without any additional adhesive. Therefore, the answer is no.</think> <answer>no</answer>
Conclusion. Next steps
19
Experiments: 7B to 1.5B. Truncated Importance Ratio
20
Experiments: 7B to 1.5B Online
21
Experiments (1.5b)
Blue – baseline, online GRPO training; Yellow – Finetune-RL, initial teacher training, then student finetuning; Green – Finetune-RL, initial teacher training, then one iteration of teacher, 4 iterations of students.
22
Experiments (1.5b)
Blue – baseline, online GRPO training; Yellow – Finetune-RL, initial teacher training, then student finetuning; Green – Finetune-RL, initial teacher training, then one iteration of teacher, 4 iterations of students.
23
Example
A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here </answer>. If the question can be answered with 'yes' or 'no', your answer must be 'yes' or 'no'.
This is the problem:
Are pennies commonly used in Canada?
Assistant: <think>
24
Answers
1. First, let's consider the currency system in Canada. The Canadian dollar is the official currency of the country, and it consists of various coins and banknotes.
2. Among these coins, the penny, also known as the "cent" in Canada, is one of the denominations. It represents one-hundredth of a Canadian dollar.
3. The penny has been in circulation in Canada since 1858, making it one of the oldest coins in the country's history.
4. Although the Canadian government announced in 2012 that it would phase out the penny due to its high production and distribution costs, it has not been completely eliminated from circulation.
5. The penny is still widely used in Canada, especially for small transactions where exact change is not required, or when rounding up or down is more convenient for both the buyer and the seller.
6. Therefore, the answer to the question "Are pennies commonly used in Canada?" is yes.</think> <answer>yes</answer>
25
1. The question asks if pennies are commonly used in Canada.
2. Canada adopted its own currency called the Canadian dollar (CAD) in 1867, replaci
ng the British North America currency.
3. The Canadian dollar has existed alongside the U.S. dollar and the British pound sterling for a significant period of time.
4. In the 1980s, Canada decided to phase out the penny due to rising production costs and the environmental impact of the metal used to make it.
5. On February 4, 2013, the Canadian government announced that the penny would no longer be in circulation, as it had become too expensive to produce and had little value to people.
6. Since the penny was removed from circulation, Canadian transactions are typically rounded to the nearest nickel (5 cents).
7. Therefore, pennies are not commonly used in Canada.</think> <answer>no</answer>
Multi-Agentic System Fine-Tuning
Motivation
27
*A. Williams, DELLMPHI
Motivation
28
*Venkatraman, Siddarth, et al. "Recursive self-aggregation unlocks deep thinking in large language models." arXiv preprint arXiv:2509.26626 (2025).
Problem Setting
29
Quick experiment
Model: Qwen3-4B-Instruct-2507
Setup: Multi-Agent System with three components: one Yes opponent, one No opponent, and one Verifier.
Dataset: General Knowledge Yes/No questions.
�Results (Accuracy):
30