1 of 9

The Overfitting Test

  • DPO and its derivatives, all fail to overfit a small training data
  • But changing the loss to CrossEntropy of Chosen does overfit

Example of a naive task where DPO fails:

  • Create a synthetic dataset for a chat between user and assistant such that if the user says anything that has the word “joke” in it, the assistant should reply saying “I don’t know any jokes”.
  • DPO will fail, while NLL would learn it very quickly and from very little data

2 of 9

The Overfitting Test

  • DPO and its derivatives, all fail to overfit a small training data
  • But changing the loss to CrossEntropy of Chosen does overfit

DPO loss:

-F.logsigmoid(self.beta * (policy_chosen_logp - policy_rejected_logp - ref_chosen_logp + ref_rejected_logp))

SFT loss:

losses = -policy_chosen_logps

(I will talk about KTO later. For now, focus on DPO and IPO)

3 of 9

The Overfitting Test

  • So, which part of the loss is preventing it from training?
  • But changing the loss to CrossEntropy of Chosen does overfit

Overfits quickly:

losses = -F.logsigmoid(policy_chosen_logp)

Overfits slowly with a ceiling because of the reference model (as expected)

losses = -F.logsigmoid(self.beta * (policy_chosen_logp - ref_chosen_logp))

Fails to overfit

losses = -policy_chosen_logps + policy_reject_logps

4 of 9

The Overfitting Test

  • Something is wrong with policy_reject_logps
  • It shouldn't prevent overfitting (evidenced by the unlikelihood paper)
  • Here’s the main difference for how the rejected log probs are computed
    • DPO: -log(prob(rejected_token)
    • Unlikelihood: log(1-prob(rejected_token))
  • This makes a big difference
    • blue: chosen_log_probs
    • red: rejected_log_probs_in_unlikelihood
    • yellow: rejected_log_probs_in_dpo

  • Yellow leads to a degenerate solution
    • “Why learn blue when I can get -inf loss by setting probability of everything to zero”

5 of 9

The Overfitting Test

  • So why does KTO overfit?
  • This is because of a bandaid

DPO and IPO

logits = policy_chosen_logps - policy_rejected_logps - reference_chosen_logps + reference_rejected_logps

KTO

  • see the .clamp(min=0), a bandaid to prevent the degenerate solution

term1 = sigmoid(policy_chosen_logps - ref_chosen_logps - (policy_rejected_logps - ref_rejected_logps).clamp(min=0))

term2 = sigmoid((policy_chosen_logps - ref_chosen_logps).clamp(min=0) - policy_rejected_logps - ref_rejected_logps)

6 of 9

The Overfitting Test

  • DPO and IPO using log(1-prob(rejected_token))

  • They overfit nicely

7 of 9

The Overfitting Test

- Possible that all the “overfitting” in DPO is a manifestation of the degenerate solution caused by log(prob(rejected_token)

- And all the DPO derivatives and different “constraining” methods are bandaids to prevent it

- when in fact, the solution should be to compute rejected log probs differently

8 of 9

The Overfitting Test

What’s wrong with DPO math that led to this?

A: The Bradley-Terry:

  • Probability that chosen > rejected is the difference
    • reward(chosen) - reward(rejected)

9 of 9

The Overfitting Test

Observations and Open questions

  • Now it overfits. Can it generalize? Is the new formulation going to affect generalization?

  • Tried the new formulation on a synthetic dataset to change a small fact (phone number of a person) and it works much worse compared to DPO
    • Reason: not sure, but DPO does a better job at this kind of surgical change of the model knowledge