1 of 35

Misleading Endpoints

Lessons from LLM Training Dynamics

Angelica Chen

2nd Workshop on High-dimensional Learning Dynamics (HiLD)

ICML 2024

2 of 35

How LLMs are commonly evaluated/analyzed

… evaluation metrics for the final or best checkpoint

… interpretability artifacts for the final or best checkpoint

AlpacaEval

Chatbot Arena

Attention visualizations

2

3 of 35

What does this approach miss?

How the model develops during training, what it learns, and how this affects future model performance.

Models with similar final test metrics may take different paths to get there.

Sudden improvement in in-context learning abilities (In-context Learning and Induction Heads, Olsson et al.)

3

But plenty of recent work has discovered interesting phenomena that occurs during training that isn’t apparent from looking at only a single checkpoint.
This figure shows just one such example – in-context reasoning abilities tend to arise abruptly and early on in the training of multi-layer transformer architectures. We wouldn’t be able to tell this from just looking at the final trained model, and this wouldn’t have led to later work that showed the causal relationship between this phase transition and the formation of induction heads
And in fact, as I’ll talk about later on in this presentation, models that look similar at the end of training (i.e. have similar final test metrics) often take very different paths to arrive there. Analyzing these different paths can yield important and helpful insights!

4 of 35

What does this approach miss?

Studying only the endpoints both misses key information about the model and may mislead us into making false conclusions.

In this talk: what we miss (discrete phases of training and phase transitions), and how we may be misled (how analyzing the endpoint of training may mislead us about what the model actually learns)

4

5 of 35

Learning often occurs discontinuously and in discrete phases.

6 of 35

Some examples

Grokking of a 1-layer transformer on a modular addition task.

From "Progress measures for grokking via mechanistic interpretability," Nanda et al.

Distinct memorization and compression phases during BERT-Base pre-training.

From "Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs" Chen et al.

circuit formation

cleanup

memorization

compression

6

7 of 35

Can we learn the latent phases of training?

We extract metrics from the training trajectory,

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics, Hu et al. (and presented at HiLD 2023!)

7

8 of 35

Can we learn the latent phases of training?

We extract metrics from the training trajectory, train a Hidden Markov Model (HMM) on the metrics,

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics, Hu et al. (and presented at HiLD 2023!)

8

9 of 35

Can we learn the latent phases of training?

We extract metrics from the training trajectory, train a Hidden Markov Model (HMM) on the metrics, and use the HMM to label discrete phases during training.

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics, Hu et al. (and presented at HiLD 2023!)

9

10 of 35

Model training is path dependent

Across different random seeds, some models generalize quickly…

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics, Hu et al.

10

11 of 35

Model training is path dependent

Across different random seeds, some models generalize quickly… while others take thousands more epochs! But they end with the same validation loss.

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics, Hu et al.

11

12 of 35

Model training is path dependent

Modeling the trajectory of training allows us to identify detour states, or states that occur only in trajectories where model generalization is slow.

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics, Hu et al.

12

Interestingly, we could predict from the sequence of latent states of each training trajectory how fast the model would generalize. We showed that the models that generalized slower often traversed through specific states that we named “detour states,” whereas the faster generalizing models did not. In short, the presence of detour states tells us how robust the model and training set-up are to randomness.��When we tried these same experiments with ResNet models on MNIST and CIFAR-100 tasks, we found that this sensitivity to randomness (in the seeds) was often due to various causes of training instabilities, and not the task itself. Without batch normalization and residual connections, ResNet’s training maps exhibited far more branching, but when these were re-introduced, the sensitivity to randomness disappears and the training map becomes a linear graph.

13 of 35

Phases are often bookended by steep phase transitions

internal syntax onset

During BERT-Base pre-training, internal syntax (measured by unlabeled attachment score, UAS) arises abruptly at the start of training.

"Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs" Chen et al.

13

14 of 35

Phases are often bookended by steep phase transitions

capabilities onset

internal syntax onset

"Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs" Chen et al.

This is immediately followed and bookended by another phase transition – the onset of linguistic capabilities.

14

15 of 35

Phases are often bookended by steep phase transitions

"Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs" Chen et al.

These two phase transitions decompose the initial loss drop into two phases – internal syntax acquisition and external linguistic capabilities acquisition.

15

16 of 35

What do phase transitions teach us?

"Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs" Chen et al.

These are phase transitions not just in internal representation and external capabilities, but also in model complexity! Reminiscent of the information bottleneck theory – memorization, then compression.

16

17 of 35

What do phase transitions teach us?

"Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs" Chen et al.

Is this phase transition necessary for learning?

Surprisingly, if we suppress internal syntax, linguistic capabilities decline in the long term but loss drops faster. The model learns an alternative strategy in the absence of internal syntax!

17

18 of 35

Analyzing only the endpoint of training (either theoretical or empirical) leads to misleading conclusions.

Understanding training dynamics can help rectify these misunderstandings.

19 of 35

Background: preference learning algorithms

Ubiquitous in LLM applications (e.g. RLHF, DPO)
Central assumption: If we fine-tune LLMs to distinguish between preferred and less preferred outputs, they will generate more preferred outputs more frequently.

Give me some recommendations for my day trip to New York City. I particularly like outdoor attractions and would prefer to take public transport whenever possible.

I’m sorry, I cannot assist with this request.

For a one-day itinerary that highlights outdoor attractions and accessibility via public transport, I would recommend taking the 2/3 trains to Grand Army Plaza and having a picnic in Prospect Park, followed by visits to the Brooklyn Botanical Garden and Prospect Park Zoo.

19

20 of 35

Why do these algorithms work?

"Intuitively, the DPO update increases the relative log probability of preferred to dispreferred response"

- "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," Rafailov et al.

"We recommend a simple recipe: … calibrate the model with rank loss…and KL divergence regularization."

- "Calibrating Sequence likelihood Improves Conditional Language Generation," Zhao et al.

Conventional wisdom and past literature suggest that improving ranking accuracy also improves generation.

20

21 of 35

Does the final model actually learn to rank?

Reference models

Preference-tuned models

"Preference Learning Algorithms Do Not Learn Preference Rankings," Chen et al.

Not much better than random chance!

X = random chance accuracy

21

22 of 35

What does the theoretical endpoint of training suggest?

"Preference Learning Algorithms Do Not Learn Preference Rankings," Chen et al.

Intuition: We can calculate the ranking accuracy of an optimal RLHF or DPO model. We call this the idealized ranking accuracy.

22

23 of 35

The Idealized Ranking Accuracy

The observed ranking accuracies are significantly lower than the idealized ranking accuracies! The results from the empirical endpoint do not match the results predicted by the theoretical endpoint.

"Preference Learning Algorithms Do Not Learn Preference Rankings," Chen et al.

23

24 of 35

Understanding DPO Training