5 of 30

Preliminary: Existing Solutions. Distillation

If we use a bigger and smarter teacher model?

Solutions:

RLVR: Train our model with Reinforcement Learning with Verifiable Rewards
SFT: Distill from the teacher to the student model.
Align-SFT: First fine-tune the teacher to be more in-distribution for the student, then distill*.

*Cetin, Edoardo, Tianyu Zhao, and Yujin Tang. "Reinforcement Learning Teachers of Test Time Scaling." arXiv preprint arXiv:2506.08388 (2025).

6 of 30

Intuition behind Align-SFT

For each prompt, we have a set of correct reasoning traces S that lead to correct answers.�

The base teacher model can reproduce a subset Sₜ
The student model can reproduce only Sₛ, with Sₛ ⊆ Sₜ.
Aligning the teacher corresponds to pruning Sₜ.

7 of 30

Preliminary: Existing Solutions. Align-SFT

Choose base teacher and student models.
Create teacher model by SFT based teacher model on the teacher dataset with a new teacher prompt style (question + ground-truth solution → explanation). Teacher get high quality data and more information, it knows the answer.
RL-train the teacher to be more in-distribution with respect to the student
Generate traces with the trained teacher
Distill to students via SFT

Cetin, Edoardo, Tianyu Zhao, and Yujin Tang. "Reinforcement Learning Teachers of Test Time Scaling." arXiv preprint arXiv:2506.08388 (2025).

8 of 30

Preliminary: Existing Solutions. Align-SFT

Cetin, Edoardo, Tianyu Zhao, and Yujin Tang. "Reinforcement Learning Teachers of Test Time Scaling." arXiv preprint arXiv:2506.08388 (2025).

9 of 30

Preliminary: Existing Solutions. Align-SFT

RL-train the teacher with sum of the two rewards:

10 of 30

11 of 30

Online Align-RL Algorithm

12 of 30

Online Align-RL Algorithm

Choose base teacher and student models.
Optionally, train the teacher to be aligned with the base student model.
Optionally, co-train the student with teacher traces and further train the teacher to be more aligned with the student.

13 of 30

What is TOPR

TOPR: Tapered off-policy REINFORCE

Roux, Nicolas Le, et al. "Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms." arXiv preprint arXiv:2503.14286 (2025).

14 of 30

How can we do Align-RL?

We need to generate correct and incorrect samples for the student using the teacher.

For the yes/no answer dataset:

We prompt the teacher to generate both “yes” and “no” answers for each prompt and train the teacher so that its correct and incorrect answers are in-distribution for the student.

15 of 30

Experiments

Dataset: ~1600 prompts of general-knowledge questions. Test ~700 prompts

Models: Qwen2.5-1.5B as the student, and either 1.5B or 7B as the teacher.

Runtime: ~250 updates per model.

16 of 30

Experiments: 7B to 1.5B

Setup: The 7B teacher is first trained with GRPO to solve the task.

For aligned setups the teacher fine-tuned to be in-distribution for the student.�

default distillation
align-distill
align-RL

17 of 30

Example

You are a teacher explaining an incorrect final answer. A conversation between User and Assistant. The User gives a question and its final answer. The Assistant reconstructs only the reasoning process in the mind that leads to this answer. Output only the reasoning process inside <think> </think> tags and DO NOT output the <answer> tag. User: You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here </answer>. If the question can be answered with 'yes' or 'no', your answer must be 'yes' or 'no'.

This is the problem:

Are sesame seeds glued onto hamburger buns?

Assistant: <think>

18 of 30

Answers

Sesame seeds are commonly found on the top of hamburger buns. Glue is a substance used to hold things together. Sesame seeds serve as a natural adhesive, creating a sticky surface for toppings like cheese or meats. Therefore, it can be concluded that sesame seeds are indeed glued onto hamburger buns. So, the answer is yes.</think> <answer>yes</answer>

To determine if sesame seeds are glued onto hamburger buns, we need to consider the nature of the sesame seeds and the bun. Sesame seeds are naturally occurring on the surface of sesame plants and are commonly used as a topping for various foods, including hamburger buns. Hamburger buns are made from flour, water, yeast, and other ingredients, and the sesame seeds are typically applied to the surface of the bun during the baking process. There is no need for glue as the sesame seeds are already sticky and will adhere to the bun's surface without any additional adhesive. Therefore, the answer is no.</think> <answer>no</answer>

19 of 30

Conclusion. Next steps

Overfitting on the training set.
KL is not a good proxy for models alignment or better alignment doesn’t translate to better performance.
Regularization is needed.
A larger train dataset.

20 of 30

Experiments: 7B to 1.5B. Truncated Importance Ratio

21 of 30

Experiments: 7B to 1.5B Online

22 of 30

Experiments (1.5b)

Blue – baseline, online GRPO training; Yellow – Finetune-RL, initial teacher training, then student finetuning; Green – Finetune-RL, initial teacher training, then one iteration of teacher, 4 iterations of students.

23 of 30

Experiments (1.5b)

24 of 30

Example

A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here </answer>. If the question can be answered with 'yes' or 'no', your answer must be 'yes' or 'no'.

This is the problem:

Are pennies commonly used in Canada?

Assistant: <think>

25 of 30

Answers

1. First, let's consider the currency system in Canada. The Canadian dollar is the official currency of the country, and it consists of various coins and banknotes.

2. Among these coins, the penny, also known as the "cent" in Canada, is one of the denominations. It represents one-hundredth of a Canadian dollar.

3. The penny has been in circulation in Canada since 1858, making it one of the oldest coins in the country's history.

4. Although the Canadian government announced in 2012 that it would phase out the penny due to its high production and distribution costs, it has not been completely eliminated from circulation.

5. The penny is still widely used in Canada, especially for small transactions where exact change is not required, or when rounding up or down is more convenient for both the buyer and the seller.

6. Therefore, the answer to the question "Are pennies commonly used in Canada?" is yes.</think> <answer>yes</answer>

1. The question asks if pennies are commonly used in Canada.

2. Canada adopted its own currency called the Canadian dollar (CAD) in 1867, replaci

ng the British North America currency.

3. The Canadian dollar has existed alongside the U.S. dollar and the British pound sterling for a significant period of time.

4. In the 1980s, Canada decided to phase out the penny due to rising production costs and the environmental impact of the metal used to make it.

5. On February 4, 2013, the Canadian government announced that the penny would no longer be in circulation, as it had become too expensive to produce and had little value to people.

6. Since the penny was removed from circulation, Canadian transactions are typically rounded to the nearest nickel (5 cents).

7. Therefore, pennies are not commonly used in Canada.</think> <answer>no</answer>

26 of 30

Multi-Agentic System Fine-Tuning

27 of 30

Motivation

*A. Williams, DELLMPHI

28 of 30

Motivation

*Venkatraman, Siddarth, et al. "Recursive self-aggregation unlocks deep thinking in large language models." arXiv preprint arXiv:2509.26626 (2025).

29 of 30

Problem Setting

Can we tune multi-agent LLM systems to be more efficient?

30 of 30

Quick experiment

Model: Qwen3-4B-Instruct-2507

Setup: Multi-Agent System with three components: one Yes opponent, one No opponent, and one Verifier.

Dataset: General Knowledge Yes/No questions.

�Results (Accuracy):

Base Qwen3-4B-Instruct-2507: 61%�
Base Multi-Agent System: 65%�
Tuned Multi-Agent System: 73%�
Tuned with GRPO (Qwen3-4B-Instruct-2507): 78%