1 of 108

W05: Training: SFT, ICL and Model Scaling

CS6101/DYC1401 Retrieval Augmented Generation

Week 05: 11 Sep 2025 AY 25/26 Sem 1 (T2510)

Vangmay

Zihang Fu

Hongxu

Benjamin Goh

Takanori Aoki

2 of 108

Outline

Section 1: Intro to SFT, ICL, Scaling and Scaling Laws

  • SFT
  • ICL
  • Scaling and Scaling Laws

Section 2: Supervised fine-tuning (SFT)

  • RAFT: Adapting Language Model to Domain Specific RAG
  • Robust Fine tuning for RAG against Retrieval Defects
  • Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
  • LoRA, FedRAG

Section 3: In-context learning (ICL)

Section 4: Model Scaling

  • LLM training
  • LLM inference

Section 5: Results from Scaling SFT and ICL

2

3 of 108

1: Introduction

3

Vangmay

4 of 108

Outline

Section 1: Intro to SFT, ICL, Scaling and Scaling Laws

  • SFT
  • ICL
  • Model Scaling

4

5 of 108

The Why and the What

In the development of LLMs they go through three main stages, Pre-training, Instruction Tuning and Post-training

5

6 of 108

The Why and the What

Supervised fine-tuning (SFT) is the process of training a pretrained model further on a labelled dataset of input-output pairs

6

7 of 108

The How

  • X: Long articles | Y: Summaries
  • X: Sentences | Y: A list of nouns and entities
  • X: Questions | Y: A human-like response to the questions

7

8 of 108

The How: Objective function

The objective function is typically the Maximum Likelihood Estimate, so the model’s weights/parameters are updated to they maximize the probability of generating the correct output according to the trainining set.

8

9 of 108

A mental model: complementary roles

  • SFT is about learning new behaviors (Parametric).
  • SFT teaches an LLM how to answer a medical question with the same SKILL and LANGUAGE as a doctor.

  • RAG is about providing knowledge (Non-Parametric).
  • RAG ensures the LLM has access to the latest medical research to base its answers on.

9

10 of 108

Outline

Section 1: Intro to SFT, ICL, Scaling and Scaling Laws

  • SFT
  • ICL
  • Model Scaling

10

11 of 108

In-Context Learning: ICL

ICL refers to the ability of LLMs to learn and generalize from examples provided at inference time.

Non-parametric method

First came out in the paper “Language Models are few shot learners”

They found out that scaling up a LM allows the model to gain a few emergent abilities. For example, unscrambling words, reasoning on the go and adapting based on the prompt.

Fun Fact: This also gave rise to the art of prompt engineering!

11

12 of 108

Outline

Section 1: Intro to SFT, ICL, Scaling and Scaling Laws

  • SFT
  • ICL
  • Model Scaling

12

13 of 108

Scaling Laws: The What

Scaling = Increasing model’s size in data and compute to improve its performance

Key dimensions

  • Model size
  • Dataset size
  • Compute

13

14 of 108

Scaling Laws: The What

  • Scaling laws describe how loss decreases predictably as model, compute and data increase
  • L(N)≈AN^(−α)+B
  • L(N) = Loss
  • N: size (parameters, tokens, or compute)
  • α: scaling exponent
  • They predict returns from larger models
  • Help identify diminishing returns when scaling further isn’t optimal

14

15 of 108

Outline

Section 1: Introduction (Vangmay)

Section 2: Supervised Fine Tuning (Vangmay, again)

Section 3: Scaling (Presenter)

15

16 of 108

2: Supervised Fine-Tuning

16

Vangmay

17 of 108

SFT from a RAG Perspective

Problem:

  • RAG Provides external knowledge but models may hallucinate or ignore retrieval
  • RAG also has no way to figure out if a retrieved document has factual information
  • SFT alone will not guarantee grounding in retrieved evidence from external source.
  • Solution: We need a synergy between RAG for knowledge and SFT for behavior alignment.

17

18 of 108

RAFT: Adapting Language Model to Domain Specific RAG ->The problem

  • Retrieved documents might be imperfect, LLM can get distracted by irrelevant chunks.
  • Pretrained LLMs lack domain specificity
  • SFT alone doesn’t teach the model to use retrieval properly
  • RAG Might have an imperfect retriever.
  • LLMs aren’t trained to decide when to use retrieved data and when to not.

18

19 of 108

Solution: Novel SFT Strategy

  • RAFT utilizes a novel structure for the data. It does fine tuning using question answer pairs while referencing the documents in a simulated imperfect retrieval setting.
  • For every data entry there is a
    • Query (Q)
    • Chain of thought style answer (A*)
    • set of documents (D_k) These are the documents retrieved from the Retriever.

19

20 of 108

Technique

  • For every data entry there is a
    • Query (Q)
    • Chain of thought style answer (A*)
    • set of documents (D_k) These are the documents retrieved from the Retriever.
  • Let (D_k * ) be the golden document, the document from which answer can be retrieved
  • Rest all are called distractors

20

21 of 108

Technique

  • Let (D_k * ) be the golden document, the document from which answer can be retrieved
  • Rest all are called distractors
  • For a fraction of the dataset (P)
  • Retain the golden document
  • For the fraction (1 - P)
  • Remove the golden document
  • Conduct finetuning

21

22 of 108

Evaluations and Observations

  • RAFT Improved RAG performance for all specialize domains
  • Including a reasoning chain instead of just simple answers guided the model and also enriches the model’s understanding.
  • Data matters when doing SFT

22

23 of 108

QuickFire

  • What if they didn’t remove golden document from the entries

23

24 of 108

QuickFire

  • What if I increased P? Can that cause issues?

24

25 of 108

Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects

  • Effectiveness of RAG systems is bottlenecked by the reliability of components.
  • Performance of any RAG system heavily depends on the quality of the retrieved documents and the model’s ability to effectively utilize them.
  • Approach: Fine tune the model to strengthen these abilities.

25

26 of 108

Types of Retrieval Defects

Noisy Documents: Content that is relevant to the query topic but doesn’t directly answer the query. Eg: “Which album features the song Time by Pink Floyd?” The retrieval system might return a general overview of the band itself.

Irrelevant Documents: That bear no connection to the query topics. Often retrieved due to inaccuracies in the retrieval model’s judgement. The model might retrieve another document by suppose Nirvana.

Counterfactual Documents: Suppose there is a document online that says the correct answer is “Dark Side of the moon”and another document is “Wish You Were Here”. Such inaccuracies can seep into the answer.

26

27 of 108

Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects

27

28 of 108

Task 1: Defects Detection

Aims to train the LLM to identify whether each retrieved document contributes to answering the user’s query.

LLM must classify the document into 3 types (noisy, irrelevant, counterfactual).

28

29 of 108

Task 2: Utility Extraction

Train the LLM to extract as much useful information as possible from the defective retrieval result.

Enables the LLM to handle low quality and contaminated contexts without needing extra pre processing.

29

30 of 108

Results: Robust Fine-Tuning (RbFT)

In the Clean setting, RbFT is the only method that surpasses Vanilla RAG.

In the Normal setting, RbFT consistently achieves the best performance across all retrieval defect scenarios and is still the only method that significantly outperforms Vanilla RAG.

In the Hard setting, RbFT continues to outperform all other methods and further widens the gap with the second-best approach

30

31 of 108

Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation

Distinguishes between correct and irrelevant context within a RAG setup

The model is trained by constructing a prompt that has

Correct Document, Incorrect document, Question, reference answer written using Correct document.

31

32 of 108

Bonus: FedRag

FedRag is a modular framework for centralized and federated fine tuning of RAG Systems.

It allows to create systems that allow us to combine Finetuning + Rag because current frameworks and libraries only use API Calling.

32

33 of 108

Bonus: LoRA: Low-Rank Adaptation of Large Language Models

Fine Tuning an LLM is still compute-heavy.

LoRA is a technique that instead of updated each and every parameter. It will make fine tuning efficient at the parameter level by freezing the original model weights and only training low-rank adapter matrices that reduce memory and compute costs while preserving performance.

33

34 of 108

Bonus: Quantization

SFT requires alot of computational resources. This mainly arrives from the inference stage of the model.

Solution: Weights of the models are usually stored in Float32 format, so what if we store them in a smaller representation like int8?

Using this lowers the cost of mathematical operations that we have to perform on them and saves inference time.

34

35 of 108

References

35

36 of 108

3: In-Context Learning

36

Hongxu Liu

Zihang Fu

37 of 108

In this section

  • What is In-Context Learning (ICL)
  • How Demonstrations Shape ICL
    • Selection
    • Format
    • Ordering
  • Why ICL Works
    • Bayesian / Kernel Regression View
    • Gradient-descent-as-inference view
  • Takeaways & Open Questions
  • Mechanism Interpretability of ICL (Hongxu Liu)

37

38 of 108

What is In-Context Learning (ICL)

38

  • Definition:

In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration.

  • ICL vs. SFT:

Pros:

- Zero/few-shot adaptability

- Fast iteration & task switching

- Avoids catastrophic forgetting

- Behavior is prompt-controllable

Cons:

- Limited by context length

- Inference cost grows with shots

- Sensitive to example selection/format/order

- Higher variance & OOD brittleness

39 of 108

How #1: Demonstration Selection

39

  • Conjecture: When the test input is made harder (partial word shuffle), ICL performance drops sharply performance positively correlates with the model’s understanding.
  • Methods:
  • Random: randomly select demos.
  • BM25: select demos with highest word-overlap to the test input.
  • TopK: select demos closest to the test input in an embedding space
  • ConE (Conditional Entropy reranking):

  • Effective demos are those that improve the model’s understanding of the test input.

40 of 108

How #2: Format & Template

40

  • Methods & Scope:
  • Define format space (from atomic token changes to full template styles).
  • Include classification, reasoning, code, and translation tasks.
  • Use Bayesian bandit search or statistical tests to find best/worst formats.
  • Key insights:
  • Prompt format matters: switching between different formats can shift accuracy dramatically; there’s no universally best format. Larger models (e.g., GPT-4) are more robust but still sensitive.
  • Sensitivity persists even with more shots, larger models, or instruction tuning; model preferences don’t transfer reliably across models.

41 of 108

How #3: Ordering

41

  • Why it matters: With the same demos, permuting their order can swing accuracy widely; the best order varies by instance and by model a fixed order rarely generalizes.
  • DEmO (Dataset-free Example Ordering):
  • Stage1: Score many permutations with content-free entropy (replace the test input by a meaningless token ‘[MASK]’, ‘[N/A]’ and ‘’) and keep orders with balanced label distribution.

  • Stage2: For the test input xt, pick the candidate order that maximizes influence.

42 of 108

Why #1: Bayesian / Kernel Regression View

42

  • Bayesian (posterior predictive) view:
  • ICL can be seen as implicit Bayesian inference: the model infers a shared latent concept θ across the demos and predicts using the posterior predictive

  • Kernel regression view (asymptotic connection):
  • As the number of in-context demos grows, Bayesian inference ≈ kernel regression: where 𝐾 is a similarity kernel induced by model representations/attention.
  • Limitations:
  • Computational mechanism gap: Pure Bayesian explanations don’t specify how practical Transformers implement the inference.
  • Prompt order sensitivity: Theory assumes exchangeability, but real LMs are highly order-sensitive.
  • Reasoning beyond pattern matching: Multi-step reasoning (CoT, planning) not captured by these views.

43 of 108

Why #2: Gradient-descent-as-inference view

43

  • Core idea: ICL can behave like implicit gradient descent (GD) carried out inside the forward pass of a pre-trained Transformer; attention applies data-dependent updates similar to one-step fine-tuning.
  • Attention GD duality: Transformer attention can be written as a linear-attention form whose contribution from demo tokens equals a parameter update ΔW.
  • Exact weight conversion: In linearized-attention Transformers, the entire ICL effect of a prompt can be converted into explicit weight deltas, with an approximate method for softmax attention.
  • Limitations:
  • Real LLMs vs synthetic setups: Many proofs assume ICL-trained objectives or hand-crafted weights; with naturally pretrained models, ICL and GD often diverge.
  • Metric pitfalls: Re-evaluation finds common similarity metrics can overestimate ICL≈GD; even untrained Transformers can score similarly under some metrics.

44 of 108

Takeaways & Open Questions

  • The output score or probability distribution of LLMs plays an important role in demo selection and ordering.
  • There’s no universally best format in ICL. Can we predict the best format for a new instance without search?
  • For k demonstrations, the size of search space of permutations is k!. How to find the best orders efficiently is also a challenging question.
  • Although some analytical studies have taken a preliminary step to explain ICL, most of them are partial and limited to simple tasks or models. More relevant research is needed in the future.

44

45 of 108

ICL (LLM) Mechanistic Explainability

  • What we take for granted:
    • LLM take tokens as inputs and produces
    • Providing examples in context improves model performance (i.e., reducing per-token loss )
  • What we don’t know
    • Detailed step of producing
    • How LLM learn from examples: Copying? Fuzzy (tolerant) matching?
  • Traditional Explainability
    • Feature Attribution: Gradient, probing, game-theory-based(SHAP)
  • Mechanism Explainability
    • Revealing the execution process of NN models

45

Hongxu Liu

46 of 108

Anthropic’s Mechanism Interpretability Research

  • Two fundamental papers[ENO21][OEN22]:
  • Key takeaways:
    • Attention operation is independent, additive, composable, acting as a RW operation on a portion of contextualized vectors (the residual stream)
    • Single-layer transformers achieve ICL by attending to previous “likely next token” ([B](KV) … [A](Q) [B](O))
    • Two-layer transformers achieve ICL by attending to the next token of the previous occurrence of current token ([A] [B](KV) … [A](Q) [B] (N))
    • Large transformers still dedicate lots of ICL ability to the generalized form of the two-layer ones
  • Elaboration logic:
    • Different from the (somehow obscure) original paper
    • Use sketching to intuitively explain complex matrix operations (colors for correspondence)

46

[ENO21] A Mathematical Framework for Transformer Circuits

[OEN22] In Context Learning and Induction Heads

47 of 108

Linear Algebra Recap

  • Matrix characterize mappings from vectors to vectors
  • Matrix multiplications (A·B) perform that vector-to-vector mapping concurrently
  • Performing matrix-to-matrix mapping requires two matrices

47

48 of 108

Tensor Notation

  • Definition:
  • A complete matrix-to-matrix mapping
  • If we treat as column vectors (like residual stream)
    • operates “within” columns
    • operates “across” columns
  • Axioms
    • Multiplication (“associativity”)
    • Addition (“distributivity”)
    • “Identity”

48

49 of 108

LLM Preliminaries

  • MHSA (of a single head)
    • (Similar to LoRA) Projection matrices can be multiplied together, forming low-rank matrices ( )
    • Attention is independent and additive:

49

50 of 108

Tensor Notation + LLM Attention

  • If we apply tensor notation to attention mechanism
    • Another Low-Rank Matrix!
  • Low-rank matrices -> circuits
    • QK circuit( ): scores
    • OV circuit ( ): residual R/W

50

51 of 108

Take a broader (end2end) view

  • Assume a one-layer attention-only transformer
    • is the one-hot input vector, are token embedding/unembedding correspondingly

51

Applying associativity + distributivity

52 of 108

Expanded QK/OV circuits

  • Expanded tensor notation
    • Mapping: (from one-hot tokens to logits)
    • Expanded OV-circuit ( )
      • Mapping (linear)
      • “What an (attended) individual input token contributes to logits”
    • Expanded QK-circuit ( )
      • Mapping (bilinear)
      • “What two (attended) tokens contribute to an attention score entry
    • P.S. The above two mappings are broadcastable
    • Thrilling fact: Model is fully understandable! (like getting function names and descriptions in a decompiled binary program)
    • Question: Given the fact that the mapped vectors are one-hot, are there any direct ways to attribute these two matrices?

52

53 of 108

Analysis of one-layer attn-only transformer’s QK/OV circuits

  • Most of the heads are doing copying
    • QK circuit: Tokens attend to a plausibly “next token”
    • OV circuit: Increase the logits of the attended “next token” (also similar tokens)
  • A simple yet effective ICL paradigm!
  • Examples

53

Discussion: Manual inspection on large (really large) matrices is tiresome. Are there any ways to automate it?

Hint: Focus on expanded OV circuit

54 of 108

Automatic Attribution of Expanded QK/OV Matrices

  • Using eigenvectors
    • If 𝜆 is positive, it means (linear combination) of (one-hot) tokens increase their own probability after being processed by the expanded OV circuit!
    • Eigenvalue Visualization:

54

55 of 108

Two-Layer Attention-Only Transformer

  • Still apply end2end tensor notation and distributivity/associativity

  • Characteristics
    • Attention operation get repeated, resulting in exponentially complicated behavior!
    • Layer 2 attention scores are slightly different

55

56 of 108

What’s Wrong with Layer 2 Attention Scores

  • Recap: Layer 1 expanded QK circuits
    • mapping , broadcastable
    • Directly operates on one-hot token vectors
  • Layer 2 Attention:
    • May involve operate on residual stream
    • That is, , is the residual stream after layer 1
    • Due to the presence of residual links:
      • may consist of direct token embeddings / attention results of layer 1
      • appears on both side (key/query sides), there are four possible conditions
    • Too complex, any tools to solve it?

56

57 of 108

We still need tensor notations, but …

  • Target: Tensor notations, starting from one-hot token vectors
    • Recap (ignore broadcasting):
    • Vector to vector mapping (matrix )
      • => Matrix to Matrix mapping (tensor product )
    • Two vectors to scalar mapping (Layer 1 expanded QK circuit )
      • => ?

57

58 of 108

“Trinary” Tensor Notations

  • Definition:
    • Original paper: One complicating factor is that we have to write it as a 6-dimensional tensor, using two tensor products on matrices. This is because we're trying to express a multilinear function of the form [n_context, d_model]×[n_context, d_model] → [n_context, n_context]. In the one-layer case, we could side step this y implicitly doing an outer product, but that no longer works. A natural way to express this is as a (4,2)-tensor (one with 4 input dimensions and output dimensions). Each term will e of the form A_q⊗A_k⊗W where x(A_q⊗A_k⊗W)y=A_q^TxWyA_k, meaning that A_q​ describes the movement of query-side information between tokens, A_k describes the movement of key-side information between tokens, and W describes how they product together to form an attention score.
    • My deduction
      • Two mappings multiplied
        • =>
      • “Column-wise” multipliers collapsed (multiplied)
      • “Column-perturbation” multipliers aside
      • Axioms still hold true!

58

59 of 108

Layer-2 Attention Score Explained

59

60 of 108

Automatic Discovery for Compositions

  • Q/K/V composition
    • Attention head reads from another head’s outputs
    • Mechanistic explanation:
      • Q/K composition: gets directly multiplied with another head’s
      • V composition: gets directly multiplied with another head’s
    • Automatic discovery: Matrix norms (like Frobenius Norm)

60

61 of 108

Induction Heads

  • In a 2-layer transformer, most attention heads are not involved in substantial compositions
    • Recap: ResNet
    • The majority of composition is V-composition, also “copying” for ICL. Comparison:
      • One layer: [B] … [A] => [B] (Generic)
      • Two layers: [A] [B] … [A] => [B] (Specific occurrence, few-shot learning on OOD)
    • Mechanism: V-side read from a “previous token” layer 1 head

61

62 of 108

What about larger transformers?

  • Conclusions [OEN22]:
    • Induction heads similar to that two-layer one still dedicate a great portion of model’s incontext learning abilities
    • The form might be slightly different
      • Fuzzy mapping: [A*] [B*] … [A] => [B], like different languages, synonyms
      • “Strided” Induction heads: [A] … [B] … [A] … => [B]
    • Only empirical support:
      • In-context learning score
      • Prefix matching score: [A] [B][A] => [B], attention score from [B] to [A]
      • Doing Knockouts, etc.

62

[OEN22] In Context Learning and Induction Heads

63 of 108

Empirical Observations

63

64 of 108

What this Leaves to us

  • Sometimes complex large models might still be quite simple
    • The “ResNet” principle
    • The predominance of induction heads show that LLMs, and their ICL abilities, are essentially advanced “copying machines”
  • Mechanism interpretability for large models are still empirical
    • Observing patterns (like prefix-matching / ICL scores), doing knockouts
    • “Degeneration” to “generalized feature attribution methods”

64

65 of 108

4: Model Scaling

65

Takanori Aoki

66 of 108

Scaling Law for Neural Language Model training

[KMH20] Scaling Laws for Neural Language Models, arXiv 2001.08361, cs.LG

Empirical rule between Compute (C) / Dataset size (D) / No. of parameters (N) and Test loss (L)

  • C / D / N shows a linear relationship with Test Loss on a log–log plot (i.e., Power law relationship)
  • L has a power-law relationship with each of the three scale factors C, D, N when not bottlenecked by the other two

66

LLM training

67 of 108

Interesting observations 1

Performance depends strongly on N, weakly on model shape

[KMH20] Scaling Laws for Neural Language Models, arXiv 2001.08361, cs.LG

67

LLM training

68 of 108

Interesting observations 2

Convergence is inefficient

Small models needs to be fully converged at training while large models can achieve high performance even when their training is stopped early.

Thus, given the fixed computation budget, the conventional "training until convergence" approach is inefficient, and an early stopping strategy (stopping before sufficient learning) with larger model is more rational.

[KMH20] Scaling Laws for Neural Language Models, arXiv 2001.08361, cs.LG

68

LLM training

69 of 108

Scaling law holds across domains and model architectures

69

LLM training

70 of 108

Scaling laws helps make important decisions

  • Estimate Return on Investment
    • LLM training is super costly !
  • Make training design decision
    • Train models longer vs train larger models
    • Collect more data vs get more hardware resources (e.g., GPU, TPU)
    • Model architecture choice�
  • The scaling law based design procedure�1. Train a few smaller models�2. Establish a scaling law�3. Select optimal hyperparameter based on the scaling law prediction

70

LLM training

71 of 108

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

71

LLM inference

72 of 108

Why are we interested in Test-Time Compute

Optimizing the amount of computation required during inference (test-time) may be more efficient than increasing the number of model parameters to achieve a certain performance, given fixed computation budget

72

LLM inference

73 of 108

Where should we invest Test-time compute?

Proposer that generates token candidates

Verifier that evaluates and shortlists better tokens out of the candidates generated by LLM

73

LLM inference

The figure is from [BJE24] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, arXiv 2407.21787, cs.LG

74 of 108

Test-time compute-optimal scaling strategy

Target(𝜃, 𝑁, 𝑞) is a distribution over natural language output tokens induced by LLM for a given prompt 𝑞, using test-time compute hyper-parameters 𝜃, and a compute budget of 𝑁. We would like to select the hyper-parameters 𝜃 which maximize the accuracy of the target distribution for a given problem. We express this formally as:

74

LLM inference

75 of 108

Scaling Test-Time Compute via Verifiers: Search Methods Against a PRM

Best-of-N�Simple, but competitive when a sufficient budget is available.

Beam search�Effective for small budgets and easy problems. The advantage decreases as the budget increases.

Lookahead search�Performance was unstable and in many cases has been inferior because additional computation is required compared to plain Beam Search.

75

LLM inference

Verifier

76 of 108

Scaling Test-Time Compute via Verifiers: Answer aggregation

Train a reward model to evaluate an LLM output

  • ORM (Outcome Reward Model) < PRM (Process Reward Model) at an experiment in the paper

2-stages aggregation to determine the best LLM output among candidates

  1. Step-wise aggregation
    • Use the PRM’s prediction at the last step as the full-answer score
  2. Inter-answer aggregation
    • Use Best-of-N weighted

76

LLM inference

Verifier

77 of 108

Refining the Proposal Distribution

77

LLM inference

Proposer

  • Parallel sampling: Ensuring search diversity through verifiers
  • Sequential revisions: Refining proposal distribution

→ Combine both approaches depending upon difficulty to a problem to solve !

  • Easy problems: Focus on sequential → Local improvements are sufficient to reach the correct answer.
  • Difficult problems: Mix in parallel → A broad search is required to reach the correct answer.

78 of 108

LLM training vs inference

- Test-time and pretraining are not 1:1 "interchangeable."�- For easy / moderate problems with small inference → Increasing Test-time compute can efficiently improve performance.�- For difficult problems with large inference →Investing Train-time comput is effective.

78

LLM training� vs inference

79 of 108

5: More on SFT, ICL and Scaling

79

Benjamin

Goh

80 of 108

Section 5 components:

  • SFT:
    • SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
    • Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs
  • ICL with scale:
    • Many-Shot In-Context Learning
    • Inference Scaling for Long-Context Retrieval Augmented Generation

80

81 of 108

SFT with Scale: SFT memorizes, RL generalizes

81

82 of 108

SFT Memorizes, RL generalizes

  • Intro
  • Task Descriptions
  • Findings
  • Follow-up?

82

83 of 108

SFT Memorizes, RL Generalizes: Intro

  • Questions:
    • How does SFT and RL compare in-distribution? (task with a certain rule at train-time)
    • How does SFT and RL compare out-of-distribution? (variants of the training task with different rules)
    • How do they compare when across increasing training compute?
  • Setup:
    • RL experiment initialized from SFT checkpoint, PPO algorithm
    • Use pre trained Llama 3.2-vision 11B as backbone

83

84 of 108

SFT Memorizes, RL Generalizes: Task Descriptions

- GeneralPoints: Arithmetic reasoning capabilities

- Rule variants:

    • Visual variants: e.g Different colours
    • Semantic variants: e.g J, Q, K = 10

84

85 of 108

SFT Memorizes, RL Generalizes: Task Descriptions

  • IRL: Real-world navigation task
  • Recognize different landmarks before taking an action.
  • Follow a set of instructions that contain spatial information, to reach a target
  • Rule variants:
    • Visual variants: e.g Different locations, img augmentations
    • Semantic: e.g Different action space

85

86 of 108

SFT Memorizes, RL Generalizes: Task Descriptions

86

87 of 108

SFT Memorizes, RL Generalizes: Results

  • SFT (red) overfits, RL (blue) generalizes with increased train compute across both vision and language tasks
  • However, RL without SFT checkpoint collapses
    • Smaller base model than DeepSeek so not a direct contradiction (?)

87

88 of 108

Follow-up (still pre-print): ‘SFT forgets, RL recovers’

  • Similar tasks, similar results
  • RL does not improve performance beyond intermediate SFT checkpoint, but helps recover it for out-of-distribution tasks (sometimes at the cost of some performance for in-distribution tasks)

88

89 of 108

SFT forgets, RL recovers: Spectral Dynamics

  • Authors did Singular Value Decomposition (SVD) for several key weight matrices
    • Tldr, split a matrix into orthogonal and diagonal parts
    • V: Orthonormal basis change; rotates / flips input space
    • Σ: Diagonal: Pure ‘signal’ amplification / compression
    • U: Orthonormal basis change into the output space
    • Found that for both SFT and RL, the model more learns to ‘re-orient high-dimensional weight space’, rather than altering overall ‘amplification’ or which particular directions are ‘amplified’
  • When replacing top layers of SFT model singular vector directions (U / V) with that of the pre-trained one’s, some generalization was restored!
  • (This is still pre-print and super recent so these are all open discussions)

89

90 of 108

To close it off:

  • Caveats:
    • ‘Generalization’ here is not greatly expanded upon, beyond just having rule variants of the task, need further research
    • Only two tasks (albeit with vision-text components and rule variants)
    • Slight conflict with DeepSeek-R1 paper (?)

  • Open questions:
    • When to use RL or SFT?
    • Why does the DeepSeek-R1 paper, with larger models, show RL alone is sufficient?
    • What might be possible reasons why RL ‘generalizes’ better, if SVD can’t explain it?

90

91 of 108

ICL with Scale:

  1. Many-shot ICL
  2. ICL-RAG approach?

91

92 of 108

  1. Many-Shot ICL: Contents
  • Main Idea
  • Non-human approaches to generating ICL examples
  • Many-Shot ICL interesting behaviours and properties

92

[ASZBR24] Many-Shot In-Context Learning arXiv.

93 of 108

  • Many-Shot ICL: Main Idea
  • A ‘shot’ is an input-output pair illustrating some desired task
  • Main idea: Scale few-shot ICL to many-shot ICL, causes improvement in performance across numerous metrics

93

[ASZBR24] Many-Shot In-Context Learning arXiv.

94 of 108

Many-Shot ICL: Non-human supervised approaches?

  • (Self-)Reinforced ICL:
    • DeepMind compares model-generating rationales for ICL against human rationales
    • Start with zero-shot or few-shot CoT prompt, sample multiple correct rationales for each training problem
    • Potential issues: Wrong reasoning, correct answer?

94

[ASZBR24] Many-Shot In-Context Learning arXiv.

95 of 108

Many-Shot ICL: Non-human supervised approaches?

  • Unsupervised ICL:
    • Don’t generate rationale or answers, ONLY generate more problems (like the query)
    • Why might this improve performance?

95

[ASZBR24] Many-Shot In-Context Learning arXiv.

96 of 108

Many-Shot ICL: Non-human supervised approaches?

96

[ASZBR24] Many-Shot In-Context Learning arXiv.

97 of 108

  1. Many-Shot ICL: Unlearn pre-training biases?
  • Flip order of labels (e.g [Negative, Neutral, Positive] → [Neutral, Positive, Negative]) or use abstract labels (e.g [A, B, C]) to try an disrupt pre training biases
  • Few-shot performance worsens, but scaling to many-shot improves!

97

[ASZBR24] Many-Shot In-Context Learning arXiv.

98 of 108

  1. Many-Shot ICL: Learn non-language tasks?
  • Promising performance in non-language problems.
    • Binary Linear Classification: Can cluster high N-dim vectors
    • Increasing shots improves performance (up to a point, depending on dimensionality)

98

[ASZBR24] Many-Shot In-Context Learning arXiv.

99 of 108

  1. Many-Shot ICL: Learn non-language tasks?
  • Promising performance in non-language problems.
    • Sequential Parity: If a binary input sequence contains an even or odd number of 1s
    • Improves performance with increasing shots, passed specialized GPT-2 baseline

99

[ASZBR24] Many-Shot In-Context Learning arXiv.

100 of 108

2. Inference Scaling for Long-Context RAG: Main finds

  • Demonstration-based RAG (DRAG) and IterDRAG to scale ICL + RAG approach
  • Linear Performance Gains with Optimal Inference Allocation
  • Inference Scaling Laws?

100

101 of 108

2. Inference Scaling for Long-Context RAG: Demonstration-Based RAG

  • Demonstration-based RAG (DRAG):
    • Use RAG to retrieve top-k in-context examples AND documents
    • Arrange as following the diagram, higher relevance is closer to input query

101

102 of 108

2. Inference Scaling for Long-Context RAG: Iter-DRAG

  • Iterative-DRAG:
    • Complex, ‘multi-hop’ queries: multi-shot reasoning or pieces of evidence
    • Split into sub-queries, iteratively retrieve for more context

102

103 of 108

2. Inference Scaling for Long-Context RAG: Performance

  • Many-shot ICL + RAG > Many-shot ICL > Zero-shot query

103

104 of 108

2. Inference Scaling for Long-Context RAG: Performance

  • Beyond some point, adding more documents / examples gives diminishing returns / hurts performance

104

105 of 108

2. Inference Scaling for Long-Context RAG: Performance

How to optimally allocate inference compute based on the problem?

    • Hyperparameters of:
      • k: num docs, m: num ICL examples, n: number of times to repeatedly generate using theta:

      • ‘Informativeness’ of docs, shots (how much they improve performance)

105

106 of 108

2. Inference Scaling for Long-Context RAG: Performance

106

107 of 108

2. Inference Scaling for Long-Context RAG: Performance

  • Under optimal configurations, approx. linear improvement

107

108 of 108

Conclusions

  • Lots of novel techniques
  • What is SFT really doing? When is it good or bad?
  • How / why is ICL successful? Can we bootstrap more methods onto it?

108