1 of 52

1

Why Language Models Hallucinate

OpenAI

Arxiv, Sept, 2025

Presenters:Juan Rodriguez, Mike Zhu, Hannuo Zhang

Fall-2025 IFT6167

Paper Presentations

5 November 2025

2 of 52

2

3 of 52

3

Hallucination : overconfident, plausible falsehoods produced by LLM�

4 of 52

4

The paper analyzes the statistical nature of generative errors, focusing on plausible falsehoods called hallucinations.

Even if the with error-free data, the training objectives naturally leads to errors. With real-world noisy data, error rates will be even higher.

The study examines two key stages of training: pretraining and post-training, to explain both the origin and persistence of hallucinations.

5 of 52

5

Errors caused by pretraining:

Links generative error with binary classification. Generating valid outputs is harder than answering “Is this a valid language model output?” (binary classification)

Any language model can act as an Is-It-Valid Classifier.

Valid

Invalid

6 of 52

6

This analysis explains why models generate confident falsehoods instead of saying “I don’t know.”

  • Like students guessing on exams, LLMs guess when uncertain to maximize their expected score.
  • Benchmarks use binary metrics (accuracy, pass rate), rewarding correct-looking answers.

Thus, optimization for such benchmarks encourages hallucinations.

Errors caused by post-training:

7 of 52

7

Pre-Training Errors

  • Pretraining produces a base language model that approximates its training distribution .

  • Generating valid outputs is harder than classifying output validity. This reduction enables analysis via computational learning theory.

  • Errors arise naturally from fitting to the underlying language distribution, specific architecture can introduce additional errors.

8 of 52

8

Language models hallucinate not because of mysterious or emergent behavior, but because the training and evaluation pipelines statistically reward guessing over expressing uncertainty.�

TLDR

Problem

9 of 52

9

Hallucinations arise naturally from errors in binary classification and persist because benchmarks penalize uncertainty �(IDK responses).

Why?

TLDR

10 of 52

10

Modify existing evaluation benchmarks to stop punishing abstentions and reward uncertainty when appropriate.

Solution

TLDR

11 of 52

11

Pre-TrainingGenerative Modeling

Two Stage Training

Post-Training

RL Reward Maximization

Model learns the distribution of language in a large text corpus

The model learns “Autocomplete”

No “I don’t know” answers!

Post-training encourages conversation or correct answers

RL datasets and benchmarks encourage giving answers

No uncertainty encouragement!

12 of 52

12

Example: Learning to Identify Valid Generations

13 of 52

13

Example: Learning to Identify Valid Generations

14 of 52

14

Example: Learning to Identify Valid Generations

15 of 52

15

Example: Learning to Identify Valid Generations

16 of 52

16

Why Hallucinations appear during Pre-Training?

17 of 52

17

“Generative error is hard to quantify”

They Introduce the Is-It-Valid (IIV) problem: A Binary Classification error that's �easier to quantify

18 of 52

18

Theoretical reduction from generative modeling to binary classification.��Decide if a string is valid (+) or an error (−).�

  • Pretraining: models learn a language distribution.�
  • Any generative model can be viewed as a classifier

  • Conclusion: hallucinations are statistically inevitable when the classifier (or model) cannot perfectly separate valid from invalid responses.

The Is-It-Valid (IIV) classification problem:

19 of 52

19

20 of 52

20

Reasons during Pre-Training?

21 of 52

21

Reasons for Hallucination

during Pre-Training

  1. Arbitrary-Fact Hallucinations
  2. Poor model
  3. Computational Hardness
  4. Distribution Shift
  5. GIGO: Garbage in garbage out

22 of 52

22

  1. Arbitrary-fact hallucinations:

Necessary knowledge

not in the training data

23 of 52

23

Examples of singletons:

The birthday of bob is …

The birthday of alice is …

  • Arbitrary-fact hallucinations:

Note: We want sr to be small, there’s no pattern on those samples!

24 of 52

24

2. Poor model:

Architecture choice, hyperparameters or model is Underfit

e.g. A language model that is based on CNNs

25 of 52

25

3. Computational Hardness:

The problem is too complex

Examples:

  • A really hard math problem
  • A coding problem in a new language (like AutoCAD, SVG)
  • Generate an image in ASCII

26 of 52

26

4. Distribution Shift

Out-of-Distribution examples, the model is tested in a very different domain

27 of 52

27

5. GIGO: Garbage In Garbage Out

The training dataset contains errors, or it’s poor quality

28 of 52

28

Summary

During pre-training:

  • Generative error is hard to quantify -> Propose Is-It-Valid (IIV) setup

  • Five reasons:
  • Arbitrary facts/Low frequency
  • Poor model/Architecture
  • Hard problems
  • OOD/ Distribution Shift
  • GIGO: Wrong data

29 of 52

29

Why Hallucinations Survive during Post-Training?

30 of 52

30

Example: Standardized Human Exams

Binary Evaluation Metric

Correct

Wrong

IDK: “I don’t know”

Score: +1

Score: 0

Student Taking Standardized Exams

31 of 52

31

Example: Standardised Human Exams

Binary Evaluation Metric

If two students are uncertain about an answer:

Student A: Indicates an uncertainty: “I don’t know”

Student B: Make a correct guess

Score: 0

Score: +1

  • Student B (making a guess) is likely to achieve higher scores than student A (expressing uncertainty e.g. “IDK”)
  • Students tend to make a guess and fabricate a plausible answer instead of expressing uncertainty

32 of 52

32

Binary Evaluation Metrics Reward Guessing and Penalize Uncertainty

Binary Evaluation Metric

Many language model benchmarks mirror standardized human exams

Model A: Indicates an uncertainty (“IDK”) and never hallucinates

Model B: Hallucinates and make a guess (If the guess is correct)

Score: 0

Score: +1

  • Model B (which hallucinates) is likely to achieve higher scores than Model A (never hallucinates)
  • Language models tend to make a guess and fabricate a plausible answer instead of expressing uncertainty

33 of 52

33

How to Mitigate Hallucinations during Post-Training?

34 of 52

34

Solution: Trinary Evaluation Metric

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

35 of 52

35

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

If p is high:

The model is more confident

If p is low:

The model is less confident /

more uncertain

Mathematical Rationale

Solution: Trinary Evaluation Metric

36 of 52

36

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

Mathematical Rationale

Solution: Trinary Evaluation Metric

37 of 52

37

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

If : then ,

  • The model expects a positive reward
  • It has sufficient confidence in its answer
  • Thus, it chooses to respond

Mathematical Rationale

Solution: Trinary Evaluation Metric

38 of 52

38

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

If : then ,

  • The model is less confident and expects that responding will result in a negative reward
  • Given that IDK yields a 0 score
  • So the model would indicate uncertainty (“IDK”) instead of making a random guess or hallucinating

Mathematical Rationale

Solution: Trinary Evaluation Metric

39 of 52

39

Summary

During post-training:

  • Current language models use binary evaluation metric:
    • Correct: +1
    • Wrong / IDK: 0
  • Binary evaluation metric rewards guessing, penalizes uncertainty, and causes hallucination of language models
  • Solution: Trinary evaluation metric
    • Correct: +1
    • Wrong:
    • IDK: 0

40 of 52

40

Limitations

41 of 52

41

Limitations

Search / RAG can reduce hallucination

Example:

42 of 52

42

Limitations

Search / RAG can reduce hallucination

  • However, as long as we use binary evaluation metrics,

models will still be rewarded for guessing,

and hallucinations will persist

  • Search / RAG does not help with miscalculation

e.g. counting the letters in a given word

43 of 52

43

Limitations

Drawbacks of the proposed evaluation metrics

  • The author proposed a trinary evaluation metrics
  • Correct: +1
  • Wrong:
  • IDK: 0
  • But it didn’t consider the magnitude of errors and the degree of uncertainty
  • It is possible that an answer is partially correct

e.g. What is overfitting?

A: Overfitting is the model performs poorly on both training and test data

X

44 of 52

44

Key Takeaway

  • Hallucination: Overconfident but plausible false statements generated by language models
  • Pre-training: Pre-training errors cause hallucination

  • Post-training: Binary evaluation metrics rewards guessing → Hallucination
  • Proposed solution: Trinary evaluation metrics

45 of 52

45

Why Language Models Hallucinate

OpenAI

Arxiv, Sept, 2025

Presenters:Juan Rodriguez, Mike Zhu, Hannuo Zhang

Fall-2025 IFT6167

Paper Presentations

5 November 2025

Thanks for your attention!

46 of 52

46

Q&A

47 of 52

47

  • It’s the first reason they give: Poor Model – Architecture
  • The right architecture makes a big difference
  • Though, Transformer-based shows the best results
  • Focal loss, temperature scaling? Reasonable to give a try
  • Modify rewards/evaluation is simpler, more elegant

48 of 52

48

  • It can help to mitigate hallucination
  • Because instruction fine-tuning and RLHF helps the language models to mirror human behaviours
  • Humans learn to express uncertainty outside of school

49 of 52

49

  • They actually reveal the confidence/logits
  • Useful to observe the confidence a model gives to a token given the context
  • For current LLMs, it might be high

50 of 52

50

  • Good idea for a benchmark paper and post-training recipe.
  • We did not find a paper that looks at this approach
  • Sounds appealing from an RL standpoint: rollout -> reward

51 of 52

51

  • How do we define valuable and non-valuable facts?
  • One coils use an embedding model, a rule-based algorithm, human annotation?
  • Could we train a classifier on human annotations?
  • Precision/Recall problem -> Risk of removing lots of false positives.
  • Probably, frontier labs might have tried this.

52 of 52

52

  • This only affects the post-training (fine-tuning) phase
  • If using binary evaluation metrics, then the model tends to make a random guess if it is NOT confident about an answer
  • If using trinary evaluation metrics (proposed solution), the model tends to expressing IDK if it is NOT confident enough