1 of 24

SEQ-VCR:PREVENTING COLLAPSE IN INTERMEDIATE TRANSFORMER REPRESENTATIONS FOR ENHANCED REASONING

Md Rifat Arefin, Gopeshh Subbaraj, Nicolas Gontier, Yann LeCun, Irina Rish, Ravid Shwartz-Ziv, Christopher Pal

Md Rifat Arefin

Mila/University of Montreal

2 of 24

Why transformers struggle with Multiplication?? (Deng et al.)

3 of 24

Outline

Background Work

Representation Learning
Kolmogorov Complexity
Information bottleneck Principle

Motivation

Representation Collapse
Limitations on Multi-Step Reasoning

Our Solution: Seq-VCR (Sequential Variance Covariance Regularization)

Encourages Feature Diversity & Prevents Collapse
Enhances Information Propagation Across Layers

Experimental Results

Improving representational capacity
Improving Performance Multi-Step Arithmetic Reasoning

4 of 24

Representation Learning

Representation learning is finding the good description of raw data into structured, meaningful abstractions that are easier to understand, process and reason about.

5 of 24

But what makes a good representation??

6 of 24

The Quest for Efficient Learning

Occam’s Razor (14th century):

Among competing hypotheses, the one with the fewest assumptions (simpler one) should be preferred.

Aristotle's Posterior Analytics (4th Century BC):

The best demonstration is the one which is derived from the fewer postulates or hypotheses.

7 of 24

Kolmogorov Complexity – Measure of Simplicity

The complexity of data is the length of the shortest program(compression) that generates it.

A datapoint like 123123123123 has low complexity (it can be described as “repeat 123 four times”).

A random sequence like 9s4jX2#@!k5 has high complexity (no compressible pattern).

8 of 24

Information Bottleneck Principle (Tishby et. al.)

When X: input, Z: latent representation Y: output,

Learning Objective:

L = − I(Z; Y) + β I(X; Z)

�

Relevance

Compression

I(Z;Y): The mutual information between the Z and the output Y (the next token), which measures how much information in Z is relevant for predicting the output.

β: A tradeoff parameter that controls the balance between compression (minimizing I(X; Z)) and relevance (maximizing I(Z;Y)).

I(X;Z): The mutual information between the input X (the previous tokens in LLMs) and the latent representation Z, which measures how much relevant information from the input is retained in Z.

9 of 24

Measuring Compression

Entropy H(X) and H(Z), serves as an upper bound to MI:

I(X; Z) = H(X) − H(X ∣ Z) ≤ H(X)
I(X; Z) = H(Z) − H(Z ∣ X) ≤ H(Z)

Compression occurs when H(Z) decreases across layers.

10 of 24

Prompt Entropy as Compression/Representation Collapse (Giraldo et al., Skean et al.)

Eigenvalue of K

Matrix-Based Entropy, is a tractable surrogate for Rényi’s α-order entropy, computed using eigenvalues of a similarity kernel.

We use Linear Kernel, aligning with the linear representation hypothesis (Park et al., 2024), that LLMs encode high-level concepts (truth, honesty etc.) in linearly separable directions.

Assume we have input of N tokens with dimension d. As a linear kernel K, we can use either the Gram matrix (Z^(l)Z^(l)T) or the Covariance matrix (Z^(l)T Z^(l)), where Z^(l) represents token-level representations from the l-th layer. Both matrices share the same nonzero eigenvalues, ensuring that the entropy calculation remains consistent regardless of the choice of kernel.

The entropy captures how well information is spread along orthogonal directions in the representation space.
If representations are well-distributed, entropy is high; if they collapse into a few dominant directions, entropy is low.

α -> 1, this reduces to Shannon entropy.

11 of 24

Training Dynamics and Prompt Entropy (Skean et. al)

Pre-training dynamics of Pythia 410M parameter model:

Representation Collapse in Intermediate Layers: As pre-training progresses, intermediate layers exhibit increased representation collapse.
Information Bottlenecks: Collapse restricts the flow of information across layers, potentially limiting the model’s capacity to integrate knowledge effectively.
Task-Specific Implications:

While beneficial for certain tasks requiring compact representations,
It may hinder multi-step reasoning tasks that require deeper information propagation.

12 of 24

U-Shape like Token Accuracy on Multiplication

Finetune GPT2-small on nxn integer Multiplication without Chain of Thought:

We observer U-shape like token-wise accuracy distribution

Model can predict the peripheral tokens but fails on the middle ones.

Output Token Position

Accuracy

13 of 24

Tokenwise Complexity Imbalance on Multiplication Task

Challenges in nxn Multiplication:

Multi-Step Computation:

The task requires storing intermediate results, demanding a deeper model for accurate processing.

Complexity Imbalance:

Middle token requires more interactions with tokens and better representations

14 of 24

Common solutions for multi-step reasoning

Increasing Model representation capacity: increasing model size��GPT2 < GPT3.5 < GPT4

Inference time compute: decomposition with CoT prompting

15 of 24

Reasoning Traces and Prompt Entropy (Skean et. al)

Chain of Thought Reasoning Traces of Qwen 2.5 and Qwen 2.5-Math Models on GSM-8K:

The base model (Qwen 2.5) exhibits greater prompt compression.
The fine-tuned model (Qwen 2.5-Math) maintains higher entropy, indicating greater information retention.

16 of 24

Things required for Multi-step Reasoning

Inference time compute

�→ CoT tokens or pause tokens

We propose to use pause tokens as a proxy to add more inference time compute for the model

More Representation Capacity

→ Increased model size or Entropy regularization: Sec-VCR

We aim to increase the representation capacity of same size models by reducing their representation collapse.

17 of 24

More Compute Through Pause tokens (Goyal et al.)

Pause tokens are like randomly initialized tokens repeated and appended with input tokens

18 of 24

Increase Representation capacity: Seq-VCR

where:

C is the covariance matrix across the batch dimension of shape dxd
λ₁ and λ₂ are regularization coefficients.
η is a small constant for numerical stability.
Variance Term ensures feature variance does not collapse.
Covariance Term encourages decorrelation between features.

Seq-VCR

Extending VICReg(Bardes et al.) for LLM Representations:

VICReg (Variance-Invariance-Covariance Regularization) was originally proposed for vision models.
We extend VICReg for LLMs to improve representation learning by diagonalizing the Covariance Matrix.

Covariance Diagonalization and Entropy:

Diagonalization increases the entropy of representations.
Encourages decorrelated features preventing collapse, promoting more efficient information propagation

19 of 24

Configurations

• Pause: Inserting pause tokens in the input sequence, no regularization.

• Vanilla: Standard training/finetuning without regularization or pause/CoT tokens.

• Seq-VCR: Applying Seq-VCR regularisation, no pause tokens.

• Seq-VCR + Pause: Combining Seq-VCR with pause tokens.

• Pretrained: Pre-trained language Model.

• CoT: Training/Finetuning with with CoT tokens.

20 of 24

Improving Representation Collapse

21 of 24

Finetuning Dynamics on Multiplication

Training Loss

Exact match Accuracy

Token-wise Accuracy

22 of 24

Results on Multiplication

• Accuracy (exact match) on 4 × 4 and 5 × 5 digits Multiplication Tasks. GPT-3.5 and GPT-4 results are taken from Deng et al.) which are produced by 5-shot prompt

23 of 24

Summary

Matrix-based entropy provides a framework for analyzing LLM representations.
Representation collapse during pre-training restricts information flow, impacting multi-step reasoning.
Seq-VCR regularization enhances representation quality and mitigates collapse.