1 of 52

Why Language Models Hallucinate

OpenAI

Arxiv, Sept, 2025

Presenters:�Juan Rodriguez, Mike Zhu, Hannuo Zhang

Fall-2025 IFT6167

Paper Presentations

5 November 2025

3 of 52

Hallucination : overconfident, plausible falsehoods produced by LLM�

4 of 52

The paper analyzes the statistical nature of generative errors, focusing on plausible falsehoods called hallucinations.

Even if the with error-free data, the training objectives naturally leads to errors. With real-world noisy data, error rates will be even higher.

The study examines two key stages of training: pretraining and post-training, to explain both the origin and persistence of hallucinations.

5 of 52

Errors caused by pretraining:

Links generative error with binary classification. Generating valid outputs is harder than answering “Is this a valid language model output？” (binary classification)

Any language model can act as an Is-It-Valid Classifier.

Valid

Invalid

6 of 52

This analysis explains why models generate confident falsehoods instead of saying “I don’t know.”

Like students guessing on exams, LLMs guess when uncertain to maximize their expected score.
Benchmarks use binary metrics (accuracy, pass rate), rewarding correct-looking answers.

Thus, optimization for such benchmarks encourages hallucinations.

Errors caused by post-training:

7 of 52

Pre-Training Errors

Pretraining produces a base language model that approximates its training distribution .

Generating valid outputs is harder than classifying output validity. This reduction enables analysis via computational learning theory.

Errors arise naturally from fitting to the underlying language distribution, specific architecture can introduce additional errors.

8 of 52

Language models hallucinate not because of mysterious or emergent behavior, but because the training and evaluation pipelines statistically reward guessing over expressing uncertainty.�

TLDR

Problem

9 of 52

Hallucinations arise naturally from errors in binary classification and persist because benchmarks penalize uncertainty �(IDK responses).

Why?

TLDR

10 of 52

Modify existing evaluation benchmarks to stop punishing abstentions and reward uncertainty when appropriate.

Solution

TLDR

11 of 52

Pre-Training�Generative Modeling

Two Stage Training

Post-Training

RL Reward Maximization

Model learns the distribution of language in a large text corpus

The model learns “Autocomplete”

�

No “I don’t know” answers!

Post-training encourages conversation or correct answers

RL datasets and benchmarks encourage giving answers

No uncertainty encouragement!

12 of 52

Example: Learning to Identify Valid Generations

13 of 52

Example: Learning to Identify Valid Generations

14 of 52

Example: Learning to Identify Valid Generations

15 of 52

Example: Learning to Identify Valid Generations

16 of 52

Why Hallucinations appear during Pre-Training?

17 of 52

“Generative error is hard to quantify”

They Introduce the Is-It-Valid (IIV) problem: A Binary Classification error that's �easier to quantify

18 of 52

Theoretical reduction from generative modeling to binary classification.��Decide if a string is valid (+) or an error (−).�

Pretraining: models learn a language distribution.�
Any generative model can be viewed as a classifier�

Conclusion: hallucinations are statistically inevitable when the classifier (or model) cannot perfectly separate valid from invalid responses.

The Is-It-Valid (IIV) classification problem:

20 of 52

Reasons during Pre-Training?

21 of 52

Reasons for Hallucination

during Pre-Training

Arbitrary-Fact Hallucinations
Poor model
Computational Hardness
Distribution Shift
GIGO: Garbage in garbage out

22 of 52

Arbitrary-fact hallucinations:

Necessary knowledge

not in the training data

23 of 52

Examples of singletons:

The birthday of bob is …

The birthday of alice is …

Arbitrary-fact hallucinations:

Note: We want sr to be small, there’s no pattern on those samples!

24 of 52

2. Poor model:

Architecture choice, hyperparameters or model is Underfit

e.g. A language model that is based on CNNs

25 of 52

3. Computational Hardness:

The problem is too complex

Examples:

A really hard math problem
A coding problem in a new language (like AutoCAD, SVG)
Generate an image in ASCII

26 of 52

4. Distribution Shift

Out-of-Distribution examples, the model is tested in a very different domain

27 of 52

5. GIGO: Garbage In Garbage Out

The training dataset contains errors, or it’s poor quality

28 of 52

Summary

During pre-training:

Generative error is hard to quantify -> Propose Is-It-Valid (IIV) setup

Five reasons:
Arbitrary facts/Low frequency
Poor model/Architecture
Hard problems
OOD/ Distribution Shift
GIGO: Wrong data

29 of 52

Why Hallucinations Survive during Post-Training?

30 of 52

Example: Standardized Human Exams

Binary Evaluation Metric

Correct

Wrong

IDK: “I don’t know”

Score: +1

Score: 0

Student Taking Standardized Exams

31 of 52

Example: Standardised Human Exams

Binary Evaluation Metric

If two students are uncertain about an answer:

Student A: Indicates an uncertainty: “I don’t know”

Student B: Make a correct guess

Score: 0

Score: +1

Student B (making a guess) is likely to achieve higher scores than student A (expressing uncertainty e.g. “IDK”)
Students tend to make a guess and fabricate a plausible answer instead of expressing uncertainty

32 of 52

Binary Evaluation Metrics Reward Guessing and Penalize Uncertainty

Binary Evaluation Metric

Many language model benchmarks mirror standardized human exams

Model A: Indicates an uncertainty (“IDK”) and never hallucinates

Model B: Hallucinates and make a guess (If the guess is correct)

Score: 0

Score: +1

Model B (which hallucinates) is likely to achieve higher scores than Model A (never hallucinates)
Language models tend to make a guess and fabricate a plausible answer instead of expressing uncertainty

33 of 52

How to Mitigate Hallucinations during Post-Training?

34 of 52

Solution: Trinary Evaluation Metric

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

35 of 52

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

If p is high:

The model is more confident

If p is low:

The model is less confident /

more uncertain

Mathematical Rationale

Solution: Trinary Evaluation Metric

36 of 52

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

Mathematical Rationale

Solution: Trinary Evaluation Metric

37 of 52

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

If : then ,

The model expects a positive reward
It has sufficient confidence in its answer
Thus, it chooses to respond

Mathematical Rationale

Solution: Trinary Evaluation Metric

38 of 52

Confidence Threshold :

Correct

Score: +1

Wrong

Score:

IDK

Score: 0

If : then ,

The model is less confident and expects that responding will result in a negative reward
Given that IDK yields a 0 score
So the model would indicate uncertainty (“IDK”) instead of making a random guess or hallucinating

Mathematical Rationale

Solution: Trinary Evaluation Metric

39 of 52

Summary

During post-training:

Current language models use binary evaluation metric:

Correct: +1
Wrong / IDK: 0

Binary evaluation metric rewards guessing, penalizes uncertainty, and causes hallucination of language models
Solution: Trinary evaluation metric

Correct: +1
Wrong:
IDK: 0

40 of 52

Limitations

41 of 52

Limitations

Search / RAG can reduce hallucination

Example:

42 of 52

Limitations

Search / RAG can reduce hallucination

However, as long as we use binary evaluation metrics,

models will still be rewarded for guessing,

and hallucinations will persist

Search / RAG does not help with miscalculation

e.g. counting the letters in a given word

43 of 52

Limitations

Drawbacks of the proposed evaluation metrics

The author proposed a trinary evaluation metrics
Correct: +1
Wrong:
IDK: 0
But it didn’t consider the magnitude of errors and the degree of uncertainty
It is possible that an answer is partially correct

e.g. What is overfitting?

A: Overfitting is the model performs poorly on both training and test data

✓

44 of 52

Key Takeaway

Hallucination: Overconfident but plausible false statements generated by language models
Pre-training: Pre-training errors cause hallucination

Post-training: Binary evaluation metrics rewards guessing → Hallucination
Proposed solution: Trinary evaluation metrics

45 of 52

Why Language Models Hallucinate

OpenAI

Arxiv, Sept, 2025

Presenters:�Juan Rodriguez, Mike Zhu, Hannuo Zhang

Fall-2025 IFT6167

Paper Presentations

5 November 2025

Thanks for your attention!

47 of 52

It’s the first reason they give: Poor Model – Architecture
The right architecture makes a big difference
Though, Transformer-based shows the best results
Focal loss, temperature scaling? Reasonable to give a try
Modify rewards/evaluation is simpler, more elegant

48 of 52

It can help to mitigate hallucination
Because instruction fine-tuning and RLHF helps the language models to mirror human behaviours
Humans learn to express uncertainty outside of school

49 of 52

They actually reveal the confidence/logits
Useful to observe the confidence a model gives to a token given the context
For current LLMs, it might be high

50 of 52

Good idea for a benchmark paper and post-training recipe.
We did not find a paper that looks at this approach
Sounds appealing from an RL standpoint: rollout -> reward

51 of 52

How do we define valuable and non-valuable facts?
One coils use an embedding model, a rule-based algorithm, human annotation?
Could we train a classifier on human annotations?
Precision/Recall problem -> Risk of removing lots of false positives.
Probably, frontier labs might have tried this.

52 of 52

This only affects the post-training (fine-tuning) phase
If using binary evaluation metrics, then the model tends to make a random guess if it is NOT confident about an answer
If using trinary evaluation metrics (proposed solution), the model tends to expressing IDK if it is NOT confident enough

1 of 52

2 of 52

3 of 52

4 of 52

5 of 52

6 of 52

7 of 52

8 of 52

9 of 52

10 of 52

11 of 52

12 of 52

13 of 52

14 of 52

15 of 52

16 of 52

17 of 52

18 of 52

19 of 52

20 of 52

21 of 52

22 of 52

23 of 52

24 of 52

25 of 52

26 of 52

27 of 52

28 of 52

29 of 52

30 of 52

31 of 52

32 of 52

33 of 52

34 of 52

35 of 52

36 of 52

37 of 52

38 of 52

39 of 52

40 of 52

41 of 52

42 of 52

43 of 52

44 of 52

45 of 52

46 of 52

47 of 52

48 of 52

49 of 52

50 of 52

51 of 52

52 of 52