1 of 9

The Overfitting Test

DPO and its derivatives, all fail to overfit a small training data
But changing the loss to CrossEntropy of Chosen does overfit

Example of a naive task where DPO fails:

Create a synthetic dataset for a chat between user and assistant such that if the user says anything that has the word “joke” in it, the assistant should reply saying “I don’t know any jokes”.
DPO will fail, while NLL would learn it very quickly and from very little data

2 of 9

The Overfitting Test

DPO and its derivatives, all fail to overfit a small training data
But changing the loss to CrossEntropy of Chosen does overfit

DPO loss:

-F.logsigmoid(self.beta * (policy_chosen_logp - policy_rejected_logp - ref_chosen_logp + ref_rejected_logp))

SFT loss:

losses = -policy_chosen_logps

(I will talk about KTO later. For now, focus on DPO and IPO)

3 of 9

The Overfitting Test

So, which part of the loss is preventing it from training?
But changing the loss to CrossEntropy of Chosen does overfit

Overfits quickly:

losses = -F.logsigmoid(policy_chosen_logp)

Overfits slowly with a ceiling because of the reference model (as expected)

losses = -F.logsigmoid(self.beta * (policy_chosen_logp - ref_chosen_logp))

Fails to overfit

losses = -policy_chosen_logps + policy_reject_logps

4 of 9

The Overfitting Test

Something is wrong with policy_reject_logps
It shouldn't prevent overfitting (evidenced by the unlikelihood paper)
Here’s the main difference for how the rejected log probs are computed

DPO: -log(prob(rejected_token)
Unlikelihood: log(1-prob(rejected_token))

This makes a big difference

blue: chosen_log_probs
red: rejected_log_probs_in_unlikelihood
yellow: rejected_log_probs_in_dpo

Yellow leads to a degenerate solution

“Why learn blue when I can get -inf loss by setting probability of everything to zero”

5 of 9

The Overfitting Test

So why does KTO overfit?
This is because of a bandaid

DPO and IPO

logits = policy_chosen_logps - policy_rejected_logps - reference_chosen_logps + reference_rejected_logps

KTO

see the .clamp(min=0), a bandaid to prevent the degenerate solution

term1 = sigmoid(policy_chosen_logps - ref_chosen_logps - (policy_rejected_logps - ref_rejected_logps).clamp(min=0))

term2 = sigmoid((policy_chosen_logps - ref_chosen_logps).clamp(min=0) - policy_rejected_logps - ref_rejected_logps)

6 of 9

The Overfitting Test

DPO and IPO using log(1-prob(rejected_token))

They overfit nicely

7 of 9

The Overfitting Test

- Possible that all the “overfitting” in DPO is a manifestation of the degenerate solution caused by log(prob(rejected_token)

- And all the DPO derivatives and different “constraining” methods are bandaids to prevent it

- when in fact, the solution should be to compute rejected log probs differently

8 of 9

The Overfitting Test

What’s wrong with DPO math that led to this?

A: The Bradley-Terry:

Probability that chosen > rejected is the difference

reward(chosen) - reward(rejected)

9 of 9

The Overfitting Test

Observations and Open questions

Now it overfits. Can it generalize? Is the new formulation going to affect generalization?

Tried the new formulation on a synthetic dataset to change a small fact (phone number of a person) and it works much worse compared to DPO

Reason: not sure, but DPO does a better job at this kind of surgical change of the model knowledge