1 of 24

SEQ-VCR:PREVENTING COLLAPSE IN INTERMEDIATE TRANSFORMER REPRESENTATIONS FOR ENHANCED REASONING

Md Rifat Arefin, Gopeshh Subbaraj, Nicolas Gontier, Yann LeCun, Irina Rish, Ravid Shwartz-Ziv, Christopher Pal

Md Rifat Arefin

Mila/University of Montreal

1

2 of 24

Why transformers struggle with Multiplication?? (Deng et al.)

2

3 of 24

Outline

Background Work

  • Representation Learning
  • Kolmogorov Complexity
  • Information bottleneck Principle

Motivation

  • Representation Collapse
  • Limitations on Multi-Step Reasoning

Our Solution: Seq-VCR (Sequential Variance Covariance Regularization)

  • Encourages Feature Diversity & Prevents Collapse
  • Enhances Information Propagation Across Layers

Experimental Results

  • Improving representational capacity
  • Improving Performance Multi-Step Arithmetic Reasoning

3

4 of 24

Representation Learning

Representation learning is finding the good description of raw data into structured, meaningful abstractions that are easier to understand, process and reason about.

4

5 of 24

But what makes a good representation??

5

6 of 24

The Quest for Efficient Learning

Occam’s Razor (14th century):

Among competing hypotheses, the one with the fewest assumptions (simpler one) should be preferred.

Aristotle's Posterior Analytics (4th Century BC):

The best demonstration is the one which is derived from the fewer postulates or hypotheses.

6

7 of 24

Kolmogorov Complexity – Measure of Simplicity

The complexity of data is the length of the shortest program(compression) that generates it.

  • A datapoint like 123123123123 has low complexity (it can be described as “repeat 123 four times”).

  • A random sequence like 9s4jX2#@!k5 has high complexity (no compressible pattern).

7

8 of 24

Information Bottleneck Principle (Tishby et. al.)

When X: input, Z: latent representation Y: output,

Learning Objective:

L = − I(Z; Y) + β I(X; Z)

8

Relevance

Compression

  • I(Z;Y): The mutual information between the Z and the output Y (the next token), which measures how much information in Z is relevant for predicting the output.
  • β: A tradeoff parameter that controls the balance between compression (minimizing I(X; Z)) and relevance (maximizing I(Z;Y)).
  • I(X;Z): The mutual information between the input X (the previous tokens in LLMs) and the latent representation Z, which measures how much relevant information from the input is retained in Z.

9 of 24

Measuring Compression

Entropy H(X) and H(Z), serves as an upper bound to MI:

          • I(X; Z) = H(X) − H(X ∣ Z) ≤ H(X)
          • I(X; Z) = H(Z) − H(Z ∣ X) ≤ H(Z)

  • Compression occurs when H(Z) decreases across layers.

9

10 of 24

Prompt Entropy as Compression/Representation Collapse (Giraldo et al., Skean et al.)

10

Eigenvalue of K

  • Matrix-Based Entropy, is a tractable surrogate for Rényi’s α-order entropy, computed using eigenvalues of a similarity kernel.
  • We use Linear Kernel, aligning with the linear representation hypothesis (Park et al., 2024), that LLMs encode high-level concepts (truth, honesty etc.) in linearly separable directions.
  • Assume we have input of N tokens with dimension d. As a linear kernel K, we can use either the Gram matrix (Z(l)Z(l)T) or the Covariance matrix (Z(l)T Z(l)), where Z(l) represents token-level representations from the l-th layer. Both matrices share the same nonzero eigenvalues, ensuring that the entropy calculation remains consistent regardless of the choice of kernel.
  • The entropy captures how well information is spread along orthogonal directions in the representation space.
  • If representations are well-distributed, entropy is high; if they collapse into a few dominant directions, entropy is low.

α -> 1, this reduces to Shannon entropy.

11 of 24

Training Dynamics and Prompt Entropy (Skean et. al)

11

Pre-training dynamics of Pythia 410M parameter model:

  • Representation Collapse in Intermediate Layers: As pre-training progresses, intermediate layers exhibit increased representation collapse.
  • Information Bottlenecks: Collapse restricts the flow of information across layers, potentially limiting the model’s capacity to integrate knowledge effectively.
  • Task-Specific Implications:
    1. While beneficial for certain tasks requiring compact representations,
    2. It may hinder multi-step reasoning tasks that require deeper information propagation.

12 of 24

U-Shape like Token Accuracy on Multiplication

12

Finetune GPT2-small on nxn integer Multiplication without Chain of Thought:

  • We observer U-shape like token-wise accuracy distribution

  • Model can predict the peripheral tokens but fails on the middle ones.

Output Token Position

Accuracy

13 of 24

Tokenwise Complexity Imbalance on Multiplication Task

13

Challenges in nxn Multiplication:

Multi-Step Computation:

  • The task requires storing intermediate results, demanding a deeper model for accurate processing.

Complexity Imbalance:

  • Middle token requires more interactions with tokens and better representations

14 of 24

Common solutions for multi-step reasoning

  • Increasing Model representation capacity: increasing model size��GPT2 < GPT3.5 < GPT4

  • Inference time compute: decomposition with CoT prompting

14

15 of 24

Reasoning Traces and Prompt Entropy (Skean et. al)

15

Chain of Thought Reasoning Traces of Qwen 2.5 and Qwen 2.5-Math Models on GSM-8K:

  • The base model (Qwen 2.5) exhibits greater prompt compression.
  • The fine-tuned model (Qwen 2.5-Math) maintains higher entropy, indicating greater information retention.

16 of 24

Things required for Multi-step Reasoning

  1. Inference time compute

→ CoT tokens or pause tokens

We propose to use pause tokens as a proxy to add more inference time compute for the model

  1. More Representation Capacity

→ Increased model size or Entropy regularization: Sec-VCR

We aim to increase the representation capacity of same size models by reducing their representation collapse.

16

17 of 24

More Compute Through Pause tokens (Goyal et al.)

<question> </pause_start> <pause> <pause> </pause_end> <answer>

17

  • Pause tokens are like randomly initialized tokens repeated and appended with input tokens

18 of 24

Increase Representation capacity: Seq-VCR

where:

  • C is the covariance matrix across the batch dimension of shape dxd
  • λ1 and λ2 are regularization coefficients.
  • η is a small constant for numerical stability.
  • Variance Term ensures feature variance does not collapse.
  • Covariance Term encourages decorrelation between features.

18

Seq-VCR

  • Extending VICReg(Bardes et al.) for LLM Representations:
    • VICReg (Variance-Invariance-Covariance Regularization) was originally proposed for vision models.
    • We extend VICReg for LLMs to improve representation learning by diagonalizing the Covariance Matrix.
  • Covariance Diagonalization and Entropy:
    • Diagonalization increases the entropy of representations.
    • Encourages decorrelated features preventing collapse, promoting more efficient information propagation

19 of 24

Configurations

19

• Pause: Inserting pause tokens in the input sequence, no regularization.

• Vanilla: Standard training/finetuning without regularization or pause/CoT tokens.

• Seq-VCR: Applying Seq-VCR regularisation, no pause tokens.

• Seq-VCR + Pause: Combining Seq-VCR with pause tokens.

• Pretrained: Pre-trained language Model.

• CoT: Training/Finetuning with with CoT tokens.

20 of 24

Improving Representation Collapse

20

21 of 24

Finetuning Dynamics on Multiplication

21

Training Loss

Exact match Accuracy

Token-wise Accuracy

22 of 24

Results on Multiplication

22

• Accuracy (exact match) on 4 × 4 and 5 × 5 digits Multiplication Tasks. GPT-3.5 and GPT-4 results are taken from Deng et al.) which are produced by 5-shot prompt

23 of 24

Summary

  • Matrix-based entropy provides a framework for analyzing LLM representations.
  • Representation collapse during pre-training restricts information flow, impacting multi-step reasoning.
  • Seq-VCR regularization enhances representation quality and mitigates collapse.

23

24 of 24

Thank You

24