W05: Training: SFT, ICL and Model Scaling
CS6101/DYC1401 Retrieval Augmented Generation
Week 05: 11 Sep 2025 AY 25/26 Sem 1 (T2510)
Vangmay
Zihang Fu
Hongxu
Benjamin Goh
Takanori Aoki
Outline
Section 1: Intro to SFT, ICL, Scaling and Scaling Laws
Section 2: Supervised fine-tuning (SFT)
Section 3: In-context learning (ICL)
Section 4: Model Scaling
Section 5: Results from Scaling SFT and ICL
2
1: Introduction
3
Vangmay
Outline
Section 1: Intro to SFT, ICL, Scaling and Scaling Laws
4
The Why and the What
In the development of LLMs they go through three main stages, Pre-training, Instruction Tuning and Post-training
5
The Why and the What
Supervised fine-tuning (SFT) is the process of training a pretrained model further on a labelled dataset of input-output pairs
6
The How
7
The How: Objective function
The objective function is typically the Maximum Likelihood Estimate, so the model’s weights/parameters are updated to they maximize the probability of generating the correct output according to the trainining set.
8
A mental model: complementary roles
9
Outline
Section 1: Intro to SFT, ICL, Scaling and Scaling Laws
10
In-Context Learning: ICL
ICL refers to the ability of LLMs to learn and generalize from examples provided at inference time.
Non-parametric method
First came out in the paper “Language Models are few shot learners”
They found out that scaling up a LM allows the model to gain a few emergent abilities. For example, unscrambling words, reasoning on the go and adapting based on the prompt.
Fun Fact: This also gave rise to the art of prompt engineering!
11
Outline
Section 1: Intro to SFT, ICL, Scaling and Scaling Laws
12
Scaling Laws: The What
Scaling = Increasing model’s size in data and compute to improve its performance
Key dimensions
13
Scaling Laws: The What
14
Outline
Section 1: Introduction (Vangmay)
Section 2: Supervised Fine Tuning (Vangmay, again)
Section 3: Scaling (Presenter)
15
2: Supervised Fine-Tuning
16
Vangmay
SFT from a RAG Perspective
Problem:
17
RAFT: Adapting Language Model to Domain Specific RAG ->The problem
18
Solution: Novel SFT Strategy
19
Technique
20
Technique
21
Evaluations and Observations
22
QuickFire
23
QuickFire
24
Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects
25
Types of Retrieval Defects
Noisy Documents: Content that is relevant to the query topic but doesn’t directly answer the query. Eg: “Which album features the song Time by Pink Floyd?” The retrieval system might return a general overview of the band itself.
Irrelevant Documents: That bear no connection to the query topics. Often retrieved due to inaccuracies in the retrieval model’s judgement. The model might retrieve another document by suppose Nirvana.
Counterfactual Documents: Suppose there is a document online that says the correct answer is “Dark Side of the moon”and another document is “Wish You Were Here”. Such inaccuracies can seep into the answer.
26
Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects
27
Task 1: Defects Detection
Aims to train the LLM to identify whether each retrieved document contributes to answering the user’s query.
LLM must classify the document into 3 types (noisy, irrelevant, counterfactual).
28
Task 2: Utility Extraction
Train the LLM to extract as much useful information as possible from the defective retrieval result.
Enables the LLM to handle low quality and contaminated contexts without needing extra pre processing.
29
Results: Robust Fine-Tuning (RbFT)
In the Clean setting, RbFT is the only method that surpasses Vanilla RAG.
In the Normal setting, RbFT consistently achieves the best performance across all retrieval defect scenarios and is still the only method that significantly outperforms Vanilla RAG.
In the Hard setting, RbFT continues to outperform all other methods and further widens the gap with the second-best approach
30
Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
Distinguishes between correct and irrelevant context within a RAG setup
The model is trained by constructing a prompt that has
Correct Document, Incorrect document, Question, reference answer written using Correct document.
31
Bonus: FedRag
FedRag is a modular framework for centralized and federated fine tuning of RAG Systems.
It allows to create systems that allow us to combine Finetuning + Rag because current frameworks and libraries only use API Calling.
32
Bonus: LoRA: Low-Rank Adaptation of Large Language Models
Fine Tuning an LLM is still compute-heavy.
LoRA is a technique that instead of updated each and every parameter. It will make fine tuning efficient at the parameter level by freezing the original model weights and only training low-rank adapter matrices that reduce memory and compute costs while preserving performance.
33
Bonus: Quantization
SFT requires alot of computational resources. This mainly arrives from the inference stage of the model.
Solution: Weights of the models are usually stored in Float32 format, so what if we store them in a smaller representation like int8?
Using this lowers the cost of mathematical operations that we have to perform on them and saves inference time.
34
MMR2021] A White paper on quantization.
References
[WJB2021] Finetuned language models are zero-shot learners.
[TYS2025] Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects.
[LZL2025] FineTune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation.
[FVE2025] FEDRAG: A framework for Fine-Tuning Retrieval-Augmented Generation Systems.
[ZTP2024] RAFT: Adapting Language Model to Domain Specific RAG.
[EYP2021] LoRA: Low-Rank Adaptation of Large Language Models.
[MMR2021] A White paper on quantization.
35
3: In-Context Learning
36
Hongxu Liu
Zihang Fu
In this section
37
What is In-Context Learning (ICL)
38
In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration.
[BMR20] Language Models are Few-Shot Learners. NIPS.
[DLD24] A Survey on In-context Learning. EMNLP.
Pros:
- Zero/few-shot adaptability
- Fast iteration & task switching
- Avoids catastrophic forgetting
- Behavior is prompt-controllable
Cons:
- Limited by context length
- Inference cost grows with shots
- Sensitive to example selection/format/order
- Higher variance & OOD brittleness
How #1: Demonstration Selection
39
How #2: Format & Template
40
[SCT24] Quantifying Language Models' Sensitivity to
Spurious Features in Prompt Design or: How I learned to
start worrying about prompt formatting. ICLR.
[HRK24] Does Prompt Formatting Have Any Impact on LLM Performance? arXiv.
How #3: Ordering
41
[GWW24] What makes a good order of examples in in-context learning. ACL.
[BVJ25] OptiSeq: Ordering Examples On-The-Fly for In-Context Learning. arXiv.
Why #1: Bayesian / Kernel Regression View
42
[XRL22] An explanation of in-context learning as implicit bayesian inference. ICLR.
[HWZ23] Explaining Emergent In-Context Learning as Kernel Regression. arXiv.
Why #2: Gradient-descent-as-inference view
43
[DSD23] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers. ACL.
[CHJ24] Exact conversion of in-context learning to model weights in linearized-attention transformers. ICML.
[SMK24] Position: Do pretrained Transformers Learn In-Context by Gradient Descent? ICML.
[DMN24] In-context Learning and Gradient Descent Revisited. NAACL.
Takeaways & Open Questions
44
ICL (LLM) Mechanistic Explainability
45
Hongxu Liu
Anthropic’s Mechanism Interpretability Research
46
[ENO21] A Mathematical Framework for Transformer Circuits
[OEN22] In Context Learning and Induction Heads
Linear Algebra Recap
47
Tensor Notation
48
LLM Preliminaries
49
Tensor Notation + LLM Attention
50
Take a broader (end2end) view
51
Applying associativity + distributivity
Expanded QK/OV circuits
52
Analysis of one-layer attn-only transformer’s QK/OV circuits
53
Discussion: Manual inspection on large (really large) matrices is tiresome. Are there any ways to automate it?
Hint: Focus on expanded OV circuit
Automatic Attribution of Expanded QK/OV Matrices
54
Two-Layer Attention-Only Transformer
55
What’s Wrong with Layer 2 Attention Scores
56
We still need tensor notations, but …
57
“Trinary” Tensor Notations
58
Layer-2 Attention Score Explained
59
Automatic Discovery for Compositions
60
Induction Heads
61
What about larger transformers?
62
[OEN22] In Context Learning and Induction Heads
Empirical Observations
63
What this Leaves to us
64
4: Model Scaling
65
Takanori Aoki
Scaling Law for Neural Language Model training
[KMH20] Scaling Laws for Neural Language Models, arXiv 2001.08361, cs.LG
Empirical rule between Compute (C) / Dataset size (D) / No. of parameters (N) and Test loss (L)
66
LLM training
Interesting observations 1
Performance depends strongly on N, weakly on model shape
[KMH20] Scaling Laws for Neural Language Models, arXiv 2001.08361, cs.LG
67
LLM training
Interesting observations 2
Convergence is inefficient
Small models needs to be fully converged at training while large models can achieve high performance even when their training is stopped early.
Thus, given the fixed computation budget, the conventional "training until convergence" approach is inefficient, and an early stopping strategy (stopping before sufficient learning) with larger model is more rational.
[KMH20] Scaling Laws for Neural Language Models, arXiv 2001.08361, cs.LG
68
LLM training
Scaling law holds across domains and model architectures
[HKK20] Scaling Laws for Autoregressive Generative Modeling, arXiv 2010.14701, cs.LG
69
LLM training
Scaling laws helps make important decisions
70
LLM training
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
71
LLM inference
[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG
Why are we interested in Test-Time Compute
Optimizing the amount of computation required during inference (test-time) may be more efficient than increasing the number of model parameters to achieve a certain performance, given fixed computation budget
72
LLM inference
[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG
Where should we invest Test-time compute?
Proposer that generates token candidates
Verifier that evaluates and shortlists better tokens out of the candidates generated by LLM
73
LLM inference
The figure is from [BJE24] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, arXiv 2407.21787, cs.LG
[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG
Test-time compute-optimal scaling strategy
Target(𝜃, 𝑁, 𝑞) is a distribution over natural language output tokens induced by LLM for a given prompt 𝑞, using test-time compute hyper-parameters 𝜃, and a compute budget of 𝑁. We would like to select the hyper-parameters 𝜃 which maximize the accuracy of the target distribution for a given problem. We express this formally as:
74
LLM inference
[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG
Scaling Test-Time Compute via Verifiers: Search Methods Against a PRM
Best-of-N�Simple, but competitive when a sufficient budget is available.
Beam search�Effective for small budgets and easy problems. The advantage decreases as the budget increases.
Lookahead search�Performance was unstable and in many cases has been inferior because additional computation is required compared to plain Beam Search.
75
LLM inference
Verifier
[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG
Scaling Test-Time Compute via Verifiers: Answer aggregation
Train a reward model to evaluate an LLM output
2-stages aggregation to determine the best LLM output among candidates
76
LLM inference
Verifier
The Figure is from Xi et al. (2024) – Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning, arXiv:2402.05808
[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG
Refining the Proposal Distribution
77
LLM inference
Proposer
→ Combine both approaches depending upon difficulty to a problem to solve !
[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG
LLM training vs inference
- Test-time and pretraining are not 1:1 "interchangeable."�- For easy / moderate problems with small inference → Increasing Test-time compute can efficiently improve performance.�- For difficult problems with large inference →Investing Train-time comput is effective.
78
LLM training� vs inference
[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG
5: More on SFT, ICL and Scaling
79
Benjamin
Goh
Section 5 components:
80
SFT with Scale: SFT memorizes, RL generalizes
81
SFT Memorizes, RL generalizes
82
SFT Memorizes, RL Generalizes: Intro
83
SFT Memorizes, RL Generalizes: Task Descriptions
- GeneralPoints: Arithmetic reasoning capabilities
- Rule variants:
84
SFT Memorizes, RL Generalizes: Task Descriptions
85
SFT Memorizes, RL Generalizes: Task Descriptions
86
SFT Memorizes, RL Generalizes: Results
87
[CSZL25] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. arXiv.
Follow-up (still pre-print): ‘SFT forgets, RL recovers’
88
SFT forgets, RL recovers: Spectral Dynamics
89
To close it off:
90
ICL with Scale:
91
92
[ASZBR24] Many-Shot In-Context Learning arXiv.
93
[ASZBR24] Many-Shot In-Context Learning arXiv.
Many-Shot ICL: Non-human supervised approaches?
94
[ASZBR24] Many-Shot In-Context Learning arXiv.
Many-Shot ICL: Non-human supervised approaches?
95
[ASZBR24] Many-Shot In-Context Learning arXiv.
Many-Shot ICL: Non-human supervised approaches?
96
[ASZBR24] Many-Shot In-Context Learning arXiv.
97
[ASZBR24] Many-Shot In-Context Learning arXiv.
98
[ASZBR24] Many-Shot In-Context Learning arXiv.
99
[ASZBR24] Many-Shot In-Context Learning arXiv.
2. Inference Scaling for Long-Context RAG: Main finds
100
[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv
2. Inference Scaling for Long-Context RAG: Demonstration-Based RAG
101
[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv
2. Inference Scaling for Long-Context RAG: Iter-DRAG
102
[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv
2. Inference Scaling for Long-Context RAG: Performance
103
[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv
2. Inference Scaling for Long-Context RAG: Performance
104
[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv
2. Inference Scaling for Long-Context RAG: Performance
How to optimally allocate inference compute based on the problem?
105
[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv
2. Inference Scaling for Long-Context RAG: Performance
106
[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv
2. Inference Scaling for Long-Context RAG: Performance
107
Conclusions
108