1 of 108

W05: Training: SFT, ICL and Model Scaling

CS6101/DYC1401 Retrieval Augmented Generation

Week 05: 11 Sep 2025 AY 25/26 Sem 1 (T2510)

http://soc-n.us/cs6101-t2510-w05 1

Vangmay

Zihang Fu

Hongxu

Benjamin Goh

Takanori Aoki

2 of 108

Outline

Section 1: Intro to SFT, ICL, Scaling and Scaling Laws

SFT
ICL
Scaling and Scaling Laws

Section 2: Supervised fine-tuning (SFT)

RAFT: Adapting Language Model to Domain Specific RAG
Robust Fine tuning for RAG against Retrieval Defects
Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
LoRA, FedRAG

Section 3: In-context learning (ICL)

Section 4: Model Scaling

LLM training
LLM inference

Section 5: Results from Scaling SFT and ICL

2

3 of 108

1: Introduction

3

Vangmay

4 of 108

Outline

Section 1: Intro to SFT, ICL, Scaling and Scaling Laws

SFT
ICL
Model Scaling

4

5 of 108

The Why and the What

In the development of LLMs they go through three main stages, Pre-training, Instruction Tuning and Post-training

5

6 of 108

The Why and the What

Supervised fine-tuning (SFT) is the process of training a pretrained model further on a labelled dataset of input-output pairs

6

[WJB2021] Finetuned language models are zero-shot learners.

7 of 108

The How

X: Long articles | Y: Summaries
X: Sentences | Y: A list of nouns and entities
X: Questions | Y: A human-like response to the questions

7

[WJB2021] Finetuned language models are zero-shot learners.

8 of 108

The How: Objective function

The objective function is typically the Maximum Likelihood Estimate, so the model’s weights/parameters are updated to they maximize the probability of generating the correct output according to the trainining set.

8

9 of 108

A mental model: complementary roles

SFT is about learning new behaviors (Parametric).
SFT teaches an LLM how to answer a medical question with the same SKILL and LANGUAGE as a doctor.

RAG is about providing knowledge (Non-Parametric).
RAG ensures the LLM has access to the latest medical research to base its answers on.

9

10 of 108

Outline

Section 1: Intro to SFT, ICL, Scaling and Scaling Laws

SFT
ICL
Model Scaling

10

11 of 108

In-Context Learning: ICL

ICL refers to the ability of LLMs to learn and generalize from examples provided at inference time.

Non-parametric method

First came out in the paper “Language Models are few shot learners”

They found out that scaling up a LM allows the model to gain a few emergent abilities. For example, unscrambling words, reasoning on the go and adapting based on the prompt.

Fun Fact: This also gave rise to the art of prompt engineering!

11

12 of 108

Outline

Section 1: Intro to SFT, ICL, Scaling and Scaling Laws

SFT
ICL
Model Scaling

12

13 of 108

Scaling Laws: The What

Scaling = Increasing model’s size in data and compute to improve its performance

Key dimensions

Model size
Dataset size
Compute

13

14 of 108

Scaling Laws: The What

Scaling laws describe how loss decreases predictably as model, compute and data increase
L(N)≈AN^(−α)+B
L(N) = Loss
N: size (parameters, tokens, or compute)
α: scaling exponent
They predict returns from larger models
Help identify diminishing returns when scaling further isn’t optimal

14

15 of 108

Outline

Section 1: Introduction (Vangmay)

Section 2: Supervised Fine Tuning (Vangmay, again)

Section 3: Scaling (Presenter)

15

16 of 108

2: Supervised Fine-Tuning

16

Vangmay

17 of 108

SFT from a RAG Perspective

Problem:

RAG Provides external knowledge but models may hallucinate or ignore retrieval
RAG also has no way to figure out if a retrieved document has factual information
SFT alone will not guarantee grounding in retrieved evidence from external source.
Solution: We need a synergy between RAG for knowledge and SFT for behavior alignment.

17

18 of 108

RAFT: Adapting Language Model to Domain Specific RAG ->The problem

Retrieved documents might be imperfect, LLM can get distracted by irrelevant chunks.
Pretrained LLMs lack domain specificity
SFT alone doesn’t teach the model to use retrieval properly
RAG Might have an imperfect retriever.
LLMs aren’t trained to decide when to use retrieved data and when to not.

18

[ZTP2024] RAFT: Adapting Language Model to Domain Specific RAG.

19 of 108

Solution: Novel SFT Strategy

RAFT utilizes a novel structure for the data. It does fine tuning using question answer pairs while referencing the documents in a simulated imperfect retrieval setting.
For every data entry there is a

Query (Q)
Chain of thought style answer (A*)
set of documents (D_k) These are the documents retrieved from the Retriever.

19

[ZTP2024] RAFT: Adapting Language Model to Domain Specific RAG.

20 of 108

Technique

For every data entry there is a

Query (Q)
Chain of thought style answer (A*)
set of documents (D_k) These are the documents retrieved from the Retriever.

Let (D_k * ) be the golden document, the document from which answer can be retrieved
Rest all are called distractors

20

[ZTP2024] RAFT: Adapting Language Model to Domain Specific RAG.

21 of 108

Technique

Let (D_k * ) be the golden document, the document from which answer can be retrieved
Rest all are called distractors
For a fraction of the dataset (P)
Retain the golden document
For the fraction (1 - P)
Remove the golden document
Conduct finetuning

21

[ZTP2024] RAFT: Adapting Language Model to Domain Specific RAG.

22 of 108

Evaluations and Observations

RAFT Improved RAG performance for all specialize domains
Including a reasoning chain instead of just simple answers guided the model and also enriches the model’s understanding.
Data matters when doing SFT

22

[ZTP2024] RAFT: Adapting Language Model to Domain Specific RAG.

23 of 108

QuickFire

What if they didn’t remove golden document from the entries

23

24 of 108

QuickFire

What if I increased P? Can that cause issues?

24

25 of 108

Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects

Effectiveness of RAG systems is bottlenecked by the reliability of components.
Performance of any RAG system heavily depends on the quality of the retrieved documents and the model’s ability to effectively utilize them.
Approach: Fine tune the model to strengthen these abilities.

25

[TYS2025] Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects.

26 of 108

Types of Retrieval Defects

Noisy Documents: Content that is relevant to the query topic but doesn’t directly answer the query. Eg: “Which album features the song Time by Pink Floyd?” The retrieval system might return a general overview of the band itself.

Irrelevant Documents: That bear no connection to the query topics. Often retrieved due to inaccuracies in the retrieval model’s judgement. The model might retrieve another document by suppose Nirvana.

Counterfactual Documents: Suppose there is a document online that says the correct answer is “Dark Side of the moon”and another document is “Wish You Were Here”. Such inaccuracies can seep into the answer.

26

[TYS2025] Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects.

27 of 108

Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects

27

[TYS2025] Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects.

28 of 108

Task 1: Defects Detection

Aims to train the LLM to identify whether each retrieved document contributes to answering the user’s query.

LLM must classify the document into 3 types (noisy, irrelevant, counterfactual).

28

[TYS2025] Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects.

29 of 108

Task 2: Utility Extraction

Train the LLM to extract as much useful information as possible from the defective retrieval result.

Enables the LLM to handle low quality and contaminated contexts without needing extra pre processing.

29

[TYS2025] Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects.

30 of 108

Results: Robust Fine-Tuning (RbFT)

In the Clean setting, RbFT is the only method that surpasses Vanilla RAG.

In the Normal setting, RbFT consistently achieves the best performance across all retrieval defect scenarios and is still the only method that significantly outperforms Vanilla RAG.

In the Hard setting, RbFT continues to outperform all other methods and further widens the gap with the second-best approach

30

[TYS2025] Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects.

31 of 108

Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation

Distinguishes between correct and irrelevant context within a RAG setup

The model is trained by constructing a prompt that has

Correct Document, Incorrect document, Question, reference answer written using Correct document.

31

[LZL2025] FineTune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation.

32 of 108

Bonus: FedRag

FedRag is a modular framework for centralized and federated fine tuning of RAG Systems.

It allows to create systems that allow us to combine Finetuning + Rag because current frameworks and libraries only use API Calling.

32

[FVE2025] FEDRAG: A framework for Fine-Tuning Retrieval-Augmented Generation Systems.

33 of 108

Bonus: LoRA: Low-Rank Adaptation of Large Language Models

Fine Tuning an LLM is still compute-heavy.

LoRA is a technique that instead of updated each and every parameter. It will make fine tuning efficient at the parameter level by freezing the original model weights and only training low-rank adapter matrices that reduce memory and compute costs while preserving performance.

33

[EYP2021] LoRA: Low-Rank Adaptation of Large Language Models.

34 of 108

Bonus: Quantization

SFT requires alot of computational resources. This mainly arrives from the inference stage of the model.

Solution: Weights of the models are usually stored in Float32 format, so what if we store them in a smaller representation like int8?

Using this lowers the cost of mathematical operations that we have to perform on them and saves inference time.

34

MMR2021] A White paper on quantization.

35 of 108

References

[WJB2021] Finetuned language models are zero-shot learners.

[TYS2025] Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects.

[LZL2025] FineTune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation.

[FVE2025] FEDRAG: A framework for Fine-Tuning Retrieval-Augmented Generation Systems.

[ZTP2024] RAFT: Adapting Language Model to Domain Specific RAG.

[EYP2021] LoRA: Low-Rank Adaptation of Large Language Models.

[MMR2021] A White paper on quantization.

35

36 of 108

3: In-Context Learning

36

Hongxu Liu

Zihang Fu

37 of 108

In this section

What is In-Context Learning (ICL)
How Demonstrations Shape ICL

Selection
Format
Ordering

Why ICL Works

Bayesian / Kernel Regression View
Gradient-descent-as-inference view

Takeaways & Open Questions
Mechanism Interpretability of ICL (Hongxu Liu)

37

38 of 108

What is In-Context Learning (ICL)

38

Definition:

In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration.

[BMR20] Language Models are Few-Shot Learners. NIPS.

[DLD24] A Survey on In-context Learning. EMNLP.

ICL vs. SFT:

Pros:

- Zero/few-shot adaptability

- Fast iteration & task switching

- Avoids catastrophic forgetting

- Behavior is prompt-controllable

Cons:

- Limited by context length

- Inference cost grows with shots

- Sensitive to example selection/format/order

- Higher variance & OOD brittleness

39 of 108

How #1: Demonstration Selection

39

[PDY24] Revisiting Demonstration Selection Strategies in In-Context Learning. ACL.

Conjecture: When the test input is made harder (partial word shuffle), ICL performance drops sharply → performance positively correlates with the model’s understanding.
Methods:
Random: randomly select demos.
BM25: select demos with highest word-overlap to the test input.
TopK: select demos closest to the test input in an embedding space
ConE (Conditional Entropy reranking):

Effective demos are those that improve the model’s understanding of the test input.

40 of 108

How #2: Format & Template

40

[SCT24] Quantifying Language Models' Sensitivity to

Spurious Features in Prompt Design or: How I learned to

start worrying about prompt formatting. ICLR.

[HRK24] Does Prompt Formatting Have Any Impact on LLM Performance? arXiv.

Methods & Scope:
Define format space (from atomic token changes to full template styles).
Include classification, reasoning, code, and translation tasks.
Use Bayesian bandit search or statistical tests to find best/worst formats.
Key insights:
Prompt format matters: switching between different formats can shift accuracy dramatically; there’s no universally best format. Larger models (e.g., GPT-4) are more robust but still sensitive.
Sensitivity persists even with more shots, larger models, or instruction tuning; model preferences don’t transfer reliably across models.

41 of 108

How #3: Ordering

41

[GWW24] What makes a good order of examples in in-context learning. ACL.

[BVJ25] OptiSeq: Ordering Examples On-The-Fly for In-Context Learning. arXiv.

Why it matters: With the same demos, permuting their order can swing accuracy widely; the best order varies by instance and by model → a fixed order rarely generalizes.
DEmO (Dataset-free Example Ordering):
Stage1: Score many permutations with content-free entropy (replace the test input by a meaningless token ‘[MASK]’, ‘[N/A]’ and ‘’) and keep orders with balanced label distribution.

Stage2: For the test input xt, pick the candidate order that maximizes influence.

42 of 108

Why #1: Bayesian / Kernel Regression View

42

[XRL22] An explanation of in-context learning as implicit bayesian inference. ICLR.

[HWZ23] Explaining Emergent In-Context Learning as Kernel Regression. arXiv.

Bayesian (posterior predictive) view:
ICL can be seen as implicit Bayesian inference: the model infers a shared latent concept θ across the demos and predicts using the posterior predictive

Kernel regression view (asymptotic connection):
As the number of in-context demos grows, Bayesian inference ≈ kernel regression: where 𝐾 is a similarity kernel induced by model representations/attention.

Limitations:
Computational mechanism gap: Pure Bayesian explanations don’t specify how practical Transformers implement the inference.
Prompt order sensitivity: Theory assumes exchangeability, but real LMs are highly order-sensitive.
Reasoning beyond pattern matching: Multi-step reasoning (CoT, planning) not captured by these views.

43 of 108

Why #2: Gradient-descent-as-inference view

43

[DSD23] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers. ACL.

[CHJ24] Exact conversion of in-context learning to model weights in linearized-attention transformers. ICML.

[SMK24] Position: Do pretrained Transformers Learn In-Context by Gradient Descent? ICML.

[DMN24] In-context Learning and Gradient Descent Revisited. NAACL.

Core idea: ICL can behave like implicit gradient descent (GD) carried out inside the forward pass of a pre-trained Transformer; attention applies data-dependent updates similar to one-step fine-tuning.
Attention ↔ GD duality: Transformer attention can be written as a linear-attention form whose contribution from demo tokens equals a parameter update ΔW.
Exact weight conversion: In linearized-attention Transformers, the entire ICL effect of a prompt can be converted into explicit weight deltas, with an approximate method for softmax attention.
Limitations:
Real LLMs vs synthetic setups: Many proofs assume ICL-trained objectives or hand-crafted weights; with naturally pretrained models, ICL and GD often diverge.
Metric pitfalls: Re-evaluation finds common similarity metrics can overestimate ICL≈GD; even untrained Transformers can score similarly under some metrics.

44 of 108

Takeaways & Open Questions

The output score or probability distribution of LLMs plays an important role in demo selection and ordering.
There’s no universally best format in ICL. Can we predict the best format for a new instance without search?
For k demonstrations, the size of search space of permutations is k!. How to find the best orders efficiently is also a challenging question.
Although some analytical studies have taken a preliminary step to explain ICL, most of them are partial and limited to simple tasks or models. More relevant research is needed in the future.

44

45 of 108

ICL (LLM) Mechanistic Explainability

What we take for granted:

LLM take tokens as inputs and produces
Providing examples in context improves model performance (i.e., reducing per-token loss )

What we don’t know

Detailed step of producing
How LLM learn from examples: Copying? Fuzzy (tolerant) matching?

Traditional Explainability

Feature Attribution: Gradient, probing, game-theory-based(SHAP)

Mechanism Explainability

Revealing the execution process of NN models

45

Hongxu Liu

46 of 108

Anthropic’s Mechanism Interpretability Research

Two fundamental papers[ENO21][OEN22]:
Key takeaways:

Attention operation is independent, additive, composable, acting as a RW operation on a portion of contextualized vectors (the residual stream)
Single-layer transformers achieve ICL by attending to previous “likely next token” ([B](KV) … [A](Q) [B](O))
Two-layer transformers achieve ICL by attending to the next token of the previous occurrence of current token ([A] [B](KV) … [A](Q) [B] (N))
Large transformers still dedicate lots of ICL ability to the generalized form of the two-layer ones

Elaboration logic:

Different from the (somehow obscure) original paper
Use sketching to intuitively explain complex matrix operations (colors for correspondence)

46

[ENO21] A Mathematical Framework for Transformer Circuits

[OEN22] In Context Learning and Induction Heads

47 of 108

Linear Algebra Recap

Matrix characterize mappings from vectors to vectors
Matrix multiplications (A·B) perform that vector-to-vector mapping concurrently
Performing matrix-to-matrix mapping requires two matrices

47

48 of 108

Tensor Notation

Definition:
A complete matrix-to-matrix mapping
If we treat as column vectors (like residual stream)

operates “within” columns
operates “across” columns

Axioms

Multiplication (“associativity”)
Addition (“distributivity”)
“Identity”

48

49 of 108

LLM Preliminaries

MHSA (of a single head)

(Similar to LoRA) Projection matrices can be multiplied together, forming low-rank matrices ( )
Attention is independent and additive:

49

50 of 108

Tensor Notation + LLM Attention

If we apply tensor notation to attention mechanism

Another Low-Rank Matrix!

Low-rank matrices -> circuits

QK circuit( ): scores
OV circuit ( ): residual R/W

50

51 of 108

Take a broader (end2end) view

Assume a one-layer attention-only transformer

is the one-hot input vector, are token embedding/unembedding correspondingly

51

Applying associativity + distributivity

52 of 108

Expanded QK/OV circuits

Expanded tensor notation

Mapping: (from one-hot tokens to logits)
Expanded OV-circuit ( )

Mapping (linear)
“What an (attended) individual input token contributes to logits”

Expanded QK-circuit ( )

Mapping (bilinear)
“What two (attended) tokens contribute to an attention score entry

P.S. The above two mappings are broadcastable
Thrilling fact: Model is fully understandable! (like getting function names and descriptions in a decompiled binary program)
Question: Given the fact that the mapped vectors are one-hot, are there any direct ways to attribute these two matrices?

52

53 of 108

Analysis of one-layer attn-only transformer’s QK/OV circuits

Most of the heads are doing copying

QK circuit: Tokens attend to a plausibly “next token”
OV circuit: Increase the logits of the attended “next token” (also similar tokens)

A simple yet effective ICL paradigm!
Examples

53

Discussion: Manual inspection on large (really large) matrices is tiresome. Are there any ways to automate it?

Hint: Focus on expanded OV circuit

54 of 108

Automatic Attribution of Expanded QK/OV Matrices

Using eigenvectors

If 𝜆 is positive, it means (linear combination) of (one-hot) tokens increase their own probability after being processed by the expanded OV circuit!
Eigenvalue Visualization:

54

55 of 108

Two-Layer Attention-Only Transformer

Still apply end2end tensor notation and distributivity/associativity

Characteristics

Attention operation get repeated, resulting in exponentially complicated behavior!
Layer 2 attention scores are slightly different

55

56 of 108

What’s Wrong with Layer 2 Attention Scores

Recap: Layer 1 expanded QK circuits

mapping , broadcastable
Directly operates on one-hot token vectors

Layer 2 Attention:

May involve operate on residual stream
That is, , is the residual stream after layer 1
Due to the presence of residual links:

may consist of direct token embeddings / attention results of layer 1
appears on both side (key/query sides), there are four possible conditions

Too complex, any tools to solve it?

56

57 of 108

We still need tensor notations, but …

Target: Tensor notations, starting from one-hot token vectors

Recap (ignore broadcasting):
Vector to vector mapping (matrix )

=> Matrix to Matrix mapping (tensor product )

Two vectors to scalar mapping (Layer 1 expanded QK circuit )

=> ?

57

58 of 108

“Trinary” Tensor Notations

Definition:

Original paper: One complicating factor is that we have to write it as a 6-dimensional tensor, using two tensor products on matrices. This is because we're trying to express a multilinear function of the form [n_context, d_model]×[n_context, d_model] → [n_context, n_context]. In the one-layer case, we could side step this y implicitly doing an outer product, but that no longer works. A natural way to express this is as a (4,2)-tensor (one with 4 input dimensions and output dimensions). Each term will e of the form A_q⊗A_k⊗W where x(A_q⊗A_k⊗W)y=A_q^TxWyA_k, meaning that A_q describes the movement of query-side information between tokens, A_k describes the movement of key-side information between tokens, and W describes how they product together to form an attention score.
My deduction

Two mappings multiplied

=>

“Column-wise” multipliers collapsed (multiplied)
“Column-perturbation” multipliers aside
Axioms still hold true!

58

59 of 108

Layer-2 Attention Score Explained

59

60 of 108

Automatic Discovery for Compositions

Q/K/V composition

Attention head reads from another head’s outputs
Mechanistic explanation:

Q/K composition: gets directly multiplied with another head’s
V composition: gets directly multiplied with another head’s

Automatic discovery: Matrix norms (like Frobenius Norm)

60

61 of 108

Induction Heads

In a 2-layer transformer, most attention heads are not involved in substantial compositions

Recap: ResNet
The majority of composition is V-composition, also “copying” for ICL. Comparison:

One layer: [B] … [A] => [B] (Generic)
Two layers: [A] [B] … [A] => [B] (Specific occurrence, few-shot learning on OOD)

Mechanism: V-side read from a “previous token” layer 1 head

61

62 of 108

What about larger transformers?

Conclusions [OEN22]:

Induction heads similar to that two-layer one still dedicate a great portion of model’s incontext learning abilities
The form might be slightly different

Fuzzy mapping: [A*] [B*] … [A] => [B], like different languages, synonyms
“Strided” Induction heads: [A] … [B] … [A] … => [B]

Only empirical support:

In-context learning score
Prefix matching score: [A] [B] … [A] => [B], attention score from [B] to [A]
Doing Knockouts, etc.

62

[OEN22] In Context Learning and Induction Heads

63 of 108

Empirical Observations

63

64 of 108

What this Leaves to us

Sometimes complex large models might still be quite simple

The “ResNet” principle
The predominance of induction heads show that LLMs, and their ICL abilities, are essentially advanced “copying machines”

Mechanism interpretability for large models are still empirical

Observing patterns (like prefix-matching / ICL scores), doing knockouts
“Degeneration” to “generalized feature attribution methods”

64

65 of 108

4: Model Scaling

65

Takanori Aoki

66 of 108

Scaling Law for Neural Language Model training

[KMH20] Scaling Laws for Neural Language Models, arXiv 2001.08361, cs.LG

Empirical rule between Compute (C) / Dataset size (D) / No. of parameters (N) and Test loss (L)

C / D / N shows a linear relationship with Test Loss on a log–log plot (i.e., Power law relationship)
L has a power-law relationship with each of the three scale factors C, D, N when not bottlenecked by the other two

66

LLM training

67 of 108

Interesting observations 1

Performance depends strongly on N, weakly on model shape

[KMH20] Scaling Laws for Neural Language Models, arXiv 2001.08361, cs.LG

67

LLM training

The main point here is that model performance depends very strongly on the number of parameters, N, but only weakly on the model’s shape or architecture.

Let’s take a look at the graphs.

In the top row, we see three plots showing how changing different architectural ratios affects performance.

These include the feed-forward ratio, the aspect ratio between model width and depth, and the attention head dimension.

Even when we change these values across a wide range, the loss only increases by a few percent.

For example, models with very different aspect ratios can still achieve similar performance.

The note on the right also shows that if performance does drop by about 1%, it can be compensated by just 22% more compute — which is relatively small.

In the bottom row, we see how test loss scales with the number of parameters.

The left plot includes embedding parameters, while the right plot excludes them.

Once we remove the embedding parameters, all the curves for different numbers of layers collapse onto a single trend line.

This shows that, apart from very shallow models with fewer than two layers, depth versus width does not matter much — only the total number of parameters really matters.

So overall, this tells us that scaling up the model size is the dominant factor in improving performance, while architectural details are much less important.

68 of 108

Interesting observations 2

Convergence is inefficient

Small models needs to be fully converged at training while large models can achieve high performance even when their training is stopped early.

Thus, given the fixed computation budget, the conventional "training until convergence" approach is inefficient, and an early stopping strategy (stopping before sufficient learning) with larger model is more rational.

[KMH20] Scaling Laws for Neural Language Models, arXiv 2001.08361, cs.LG

68

LLM training

69 of 108

Scaling law holds across domains and model architectures

[HKK20] Scaling Laws for Autoregressive Generative Modeling, arXiv 2010.14701, cs.LG

69

LLM training

70 of 108

Scaling laws helps make important decisions

[HLR25] Scaling Laws Basics (Lecture 9, Stanford CS336, Spring 2025)

Estimate Return on Investment

LLM training is super costly !

Make training design decision

Train models longer vs train larger models
Collect more data vs get more hardware resources (e.g., GPU, TPU)
Model architecture choice�

The scaling law based design procedure�1. Train a few smaller models�2. Establish a scaling law�3. Select optimal hyperparameter based on the scaling law prediction

70

LLM training

71 of 108

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

71

LLM inference

[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG

72 of 108

Why are we interested in Test-Time Compute

Optimizing the amount of computation required during inference (test-time) may be more efficient than increasing the number of model parameters to achieve a certain performance, given fixed computation budget

72

LLM inference

[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG

73 of 108

Where should we invest Test-time compute?

Proposer that generates token candidates

Verifier that evaluates and shortlists better tokens out of the candidates generated by LLM

73

LLM inference

The figure is from [BJE24] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, arXiv 2407.21787, cs.LG

[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG

74 of 108

Test-time compute-optimal scaling strategy

Target(𝜃, 𝑁, 𝑞) is a distribution over natural language output tokens induced by LLM for a given prompt 𝑞, using test-time compute hyper-parameters 𝜃, and a compute budget of 𝑁. We would like to select the hyper-parameters 𝜃 which maximize the accuracy of the target distribution for a given problem. We express this formally as:

74

LLM inference

[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG

75 of 108

Scaling Test-Time Compute via Verifiers: Search Methods Against a PRM

Best-of-N�Simple, but competitive when a sufficient budget is available.

Beam search�Effective for small budgets and easy problems. The advantage decreases as the budget increases.

Lookahead search�Performance was unstable and in many cases has been inferior because additional computation is required compared to plain Beam Search.

75

LLM inference

Verifier

[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG

76 of 108

Scaling Test-Time Compute via Verifiers: Answer aggregation

Train a reward model to evaluate an LLM output

ORM (Outcome Reward Model) < PRM (Process Reward Model) at an experiment in the paper

2-stages aggregation to determine the best LLM output among candidates

Step-wise aggregation

Use the PRM’s prediction at the last step as the full-answer score

Inter-answer aggregation

Use Best-of-N weighted

76

LLM inference

Verifier

The Figure is from Xi et al. (2024) – Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning, arXiv:2402.05808

[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG

77 of 108

Refining the Proposal Distribution

77

LLM inference

Proposer

Parallel sampling: Ensuring search diversity through verifiers
Sequential revisions: Refining proposal distribution

→ Combine both approaches depending upon difficulty to a problem to solve !

Easy problems: Focus on sequential → Local improvements are sufficient to reach the correct answer.
Difficult problems: Mix in parallel → A broad search is required to reach the correct answer.

[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG

LLM itself revises output through inference (= revision of LLM’s proposal distribution)

Parallel sampling (e.g., Best-of-N) verses sequential revisions

1. Parallel Sampling (Best-of-N)

Generate N candidate answers independently from the model.

Select the best one using a verifier.

Characteristics: While a variety of candidates are obtained at once, each is independent and does not self-correct.

2. Sequential Revisions

The model's initial answer is returned to the input, and a new answer is generated by "correcting errors."

By repeating this process, the answer is gradually improved.

Characteristics: Gradual improvement through self-feedback (revision-based approach).

3. Combining Parallel and Sequential

The parallel-generated candidates are each expanded into a chain of revisions, and the verifier selects from among them.

This approach potentially combines the diversity of parallel and the improvement capabilities of sequential.

78 of 108

LLM training vs inference

- Test-time and pretraining are not 1:1 "interchangeable."�- For easy / moderate problems with small inference → Increasing Test-time compute can efficiently improve performance.�- For difficult problems with large inference →Investing Train-time comput is effective.

78

LLM training� vs inference

[SLX24] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv 2408.03314, cs.LG

79 of 108

5: More on SFT, ICL and Scaling

79

Benjamin

Goh

80 of 108

Section 5 components:

SFT:

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs

ICL with scale:

Many-Shot In-Context Learning
Inference Scaling for Long-Context Retrieval Augmented Generation

80

81 of 108

SFT with Scale: SFT memorizes, RL generalizes

81

82 of 108

SFT Memorizes, RL generalizes

Intro
Task Descriptions
Findings
Follow-up?

82

83 of 108

SFT Memorizes, RL Generalizes: Intro

Questions:

How does SFT and RL compare in-distribution? (task with a certain rule at train-time)
How does SFT and RL compare out-of-distribution? (variants of the training task with different rules)
How do they compare when across increasing training compute?

Setup:

RL experiment initialized from SFT checkpoint, PPO algorithm
Use pre trained Llama 3.2-vision 11B as backbone

83

[CSZL25] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. arXiv.

84 of 108

SFT Memorizes, RL Generalizes: Task Descriptions

- GeneralPoints: Arithmetic reasoning capabilities

- Rule variants:

Visual variants: e.g Different colours
Semantic variants: e.g J, Q, K = 10

84

[CSZL25] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. arXiv.

85 of 108

SFT Memorizes, RL Generalizes: Task Descriptions

IRL: Real-world navigation task
Recognize different landmarks before taking an action.
Follow a set of instructions that contain spatial information, to reach a target
Rule variants:

Visual variants: e.g Different locations, img augmentations
Semantic: e.g Different action space

85

[CSZL25] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. arXiv.

86 of 108

SFT Memorizes, RL Generalizes: Task Descriptions

86

[CSZL25] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. arXiv.

87 of 108

SFT Memorizes, RL Generalizes: Results

SFT (red) overfits, RL (blue) generalizes with increased train compute across both vision and language tasks
However, RL without SFT checkpoint collapses

Smaller base model than DeepSeek so not a direct contradiction (?)

87

[CSZL25] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. arXiv.

88 of 108

Follow-up (still pre-print): ‘SFT forgets, RL recovers’

Similar tasks, similar results
RL does not improve performance beyond intermediate SFT checkpoint, but helps recover it for out-of-distribution tasks (sometimes at the cost of some performance for in-distribution tasks)

88

[CSZL25] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. arXiv.

89 of 108

SFT forgets, RL recovers: Spectral Dynamics

Authors did Singular Value Decomposition (SVD) for several key weight matrices

Tldr, split a matrix into orthogonal and diagonal parts
V: Orthonormal basis change; rotates / flips input space
Σ: Diagonal: Pure ‘signal’ amplification / compression
U: Orthonormal basis change into the output space
Found that for both SFT and RL, the model more learns to ‘re-orient high-dimensional weight space’, rather than altering overall ‘amplification’ or which particular directions are ‘amplified’

When replacing top layers of SFT model singular vector directions (U / V) with that of the pre-trained one’s, some generalization was restored!
(This is still pre-print and super recent so these are all open discussions)

89

[JIN25] RL iIs Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs arXiv.

90 of 108

To close it off:

Caveats:

‘Generalization’ here is not greatly expanded upon, beyond just having rule variants of the task, need further research
Only two tasks (albeit with vision-text components and rule variants)
Slight conflict with DeepSeek-R1 paper (?)

Open questions:

When to use RL or SFT?
Why does the DeepSeek-R1 paper, with larger models, show RL alone is sufficient?
What might be possible reasons why RL ‘generalizes’ better, if SVD can’t explain it?

90

91 of 108

ICL with Scale:

Many-shot ICL
ICL-RAG approach?

91

92 of 108

Many-Shot ICL: Contents

Main Idea
Non-human approaches to generating ICL examples
Many-Shot ICL interesting behaviours and properties

92

[ASZBR24] Many-Shot In-Context Learning arXiv.

93 of 108

Many-Shot ICL: Main Idea

A ‘shot’ is an input-output pair illustrating some desired task
Main idea: Scale few-shot ICL to many-shot ICL, causes improvement in performance across numerous metrics

93

[ASZBR24] Many-Shot In-Context Learning arXiv.

94 of 108

Many-Shot ICL: Non-human supervised approaches?

(Self-)Reinforced ICL:

DeepMind compares model-generating rationales for ICL against human rationales
Start with zero-shot or few-shot CoT prompt, sample multiple correct rationales for each training problem
Potential issues: Wrong reasoning, correct answer?

94

[ASZBR24] Many-Shot In-Context Learning arXiv.

95 of 108

Many-Shot ICL: Non-human supervised approaches?

Unsupervised ICL:

Don’t generate rationale or answers, ONLY generate more problems (like the query)
Why might this improve performance?

95

[ASZBR24] Many-Shot In-Context Learning arXiv.

Paper states: One hypothesis for how many-shot unsupervised ICL might surpass few-shot learning with human demonstrations is that, when the LLM already possesses the required knowledge to solve a task, any information inserted in the prompt that can narrow down what knowledge is needed for the task becomes helpful. This would be consistent with the view that ICL simply “locates” latent concepts (e.g., math problem-solving) the LLM acquired during pre-training [66, 22, 61]. As such, any of the prompt components– inputs, outputs, and their mapping– can help locate such concepts. While Unsupervised ICL is broadly applicable, it may not perform well, for example, when outputs are critical for specifying the task

‘ICL helps locate task-specific mode, from POV of bayesian learner?’

From earlier section of ICL: Many shot might just help to load more task-specific circuits into the residual stream at inference time, and the actual validity of the rationale / answers dont matter (if the latent knowledge in the model can already sort of answer the query)

96 of 108

Many-Shot ICL: Non-human supervised approaches?

96

[ASZBR24] Many-Shot In-Context Learning arXiv.

97 of 108

Many-Shot ICL: Unlearn pre-training biases?

Flip order of labels (e.g [Negative, Neutral, Positive] → [Neutral, Positive, Negative]) or use abstract labels (e.g [A, B, C]) to try an disrupt pre training biases
Few-shot performance worsens, but scaling to many-shot improves!

97

[ASZBR24] Many-Shot In-Context Learning arXiv.

98 of 108

Many-Shot ICL: Learn non-language tasks?

Promising performance in non-language problems.

Binary Linear Classification: Can cluster high N-dim vectors
Increasing shots improves performance (up to a point, depending on dimensionality)

98

[ASZBR24] Many-Shot In-Context Learning arXiv.

99 of 108

Many-Shot ICL: Learn non-language tasks?

Promising performance in non-language problems.

Sequential Parity: If a binary input sequence contains an even or odd number of 1s
Improves performance with increasing shots, passed specialized GPT-2 baseline

99

[ASZBR24] Many-Shot In-Context Learning arXiv.

100 of 108

2. Inference Scaling for Long-Context RAG: Main finds

Demonstration-based RAG (DRAG) and IterDRAG to scale ICL + RAG approach
Linear Performance Gains with Optimal Inference Allocation
Inference Scaling Laws?

100

[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv

101 of 108

2. Inference Scaling for Long-Context RAG: Demonstration-Based RAG

Demonstration-based RAG (DRAG):

Use RAG to retrieve top-k in-context examples AND documents
Arrange as following the diagram, higher relevance is closer to input query

101

[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv

102 of 108

2. Inference Scaling for Long-Context RAG: Iter-DRAG

Iterative-DRAG:

Complex, ‘multi-hop’ queries: multi-shot reasoning or pieces of evidence
Split into sub-queries, iteratively retrieve for more context

102

[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv

103 of 108

2. Inference Scaling for Long-Context RAG: Performance

Many-shot ICL + RAG > Many-shot ICL > Zero-shot query

103

[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv

104 of 108

2. Inference Scaling for Long-Context RAG: Performance

Beyond some point, adding more documents / examples gives diminishing returns / hurts performance

104

[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv

105 of 108

2. Inference Scaling for Long-Context RAG: Performance

How to optimally allocate inference compute based on the problem?

Hyperparameters of:

k: num docs, m: num ICL examples, n: number of times to repeatedly generate using theta:

‘Informativeness’ of docs, shots (how much they improve performance)

105

[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv

106 of 108

2. Inference Scaling for Long-Context RAG: Performance

106

[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv

107 of 108

2. Inference Scaling for Long-Context RAG: Performance

Under optimal configurations, approx. linear improvement

107

[YYBHZ24] Inference Scaling for Long-Context Retrieval Augmented Generation

108 of 108

Conclusions

Lots of novel techniques
What is SFT really doing? When is it good or bad?
How / why is ICL successful? Can we bootstrap more methods onto it?

108