SEQ-VCR:PREVENTING COLLAPSE IN INTERMEDIATE TRANSFORMER REPRESENTATIONS FOR ENHANCED REASONING
Md Rifat Arefin, Gopeshh Subbaraj, Nicolas Gontier, Yann LeCun, Irina Rish, Ravid Shwartz-Ziv, Christopher Pal
Md Rifat Arefin
Mila/University of Montreal
1
Why transformers struggle with Multiplication?? (Deng et al.)
2
Outline
Background Work
Motivation
Our Solution: Seq-VCR (Sequential Variance Covariance Regularization)
Experimental Results
3
Representation Learning
Representation learning is finding the good description of raw data into structured, meaningful abstractions that are easier to understand, process and reason about.
4
But what makes a good representation??
5
The Quest for Efficient Learning
Occam’s Razor (14th century):
Among competing hypotheses, the one with the fewest assumptions (simpler one) should be preferred.
Aristotle's Posterior Analytics (4th Century BC):
The best demonstration is the one which is derived from the fewer postulates or hypotheses.
6
Kolmogorov Complexity – Measure of Simplicity
The complexity of data is the length of the shortest program(compression) that generates it.
7
Information Bottleneck Principle (Tishby et. al.)
When X: input, Z: latent representation Y: output,
Learning Objective:
L = − I(Z; Y) + β I(X; Z)
�
8
Relevance
Compression
Measuring Compression
Entropy H(X) and H(Z), serves as an upper bound to MI:
9
Prompt Entropy as Compression/Representation Collapse (Giraldo et al., Skean et al.)
10
Eigenvalue of K
α -> 1, this reduces to Shannon entropy.
Training Dynamics and Prompt Entropy (Skean et. al)
11
Pre-training dynamics of Pythia 410M parameter model:
U-Shape like Token Accuracy on Multiplication
12
Finetune GPT2-small on nxn integer Multiplication without Chain of Thought:
Output Token Position
Accuracy
Tokenwise Complexity Imbalance on Multiplication Task
13
Challenges in nxn Multiplication:
Multi-Step Computation:
Complexity Imbalance:
Common solutions for multi-step reasoning
14
Reasoning Traces and Prompt Entropy (Skean et. al)
15
Chain of Thought Reasoning Traces of Qwen 2.5 and Qwen 2.5-Math Models on GSM-8K:
Things required for Multi-step Reasoning
�→ CoT tokens or pause tokens
We propose to use pause tokens as a proxy to add more inference time compute for the model
→ Increased model size or Entropy regularization: Sec-VCR
We aim to increase the representation capacity of same size models by reducing their representation collapse.
16
More Compute Through Pause tokens (Goyal et al.)
<question> </pause_start> <pause> <pause> </pause_end> <answer>
17
Increase Representation capacity: Seq-VCR
where:
18
Seq-VCR
Configurations
19
• Pause: Inserting pause tokens in the input sequence, no regularization.
• Vanilla: Standard training/finetuning without regularization or pause/CoT tokens.
• Seq-VCR: Applying Seq-VCR regularisation, no pause tokens.
• Seq-VCR + Pause: Combining Seq-VCR with pause tokens.
• Pretrained: Pre-trained language Model.
• CoT: Training/Finetuning with with CoT tokens.
Improving Representation Collapse
20
Finetuning Dynamics on Multiplication
21
Training Loss
Exact match Accuracy
Token-wise Accuracy
Results on Multiplication
22
• Accuracy (exact match) on 4 × 4 and 5 × 5 digits Multiplication Tasks. GPT-3.5 and GPT-4 results are taken from Deng et al.) which are produced by 5-shot prompt
Summary
23
Thank You
24