1
Why Language Models Hallucinate
OpenAI
Arxiv, Sept, 2025
Presenters:�Juan Rodriguez, Mike Zhu, Hannuo Zhang
Fall-2025 IFT6167
Paper Presentations
5 November 2025
2
3
Hallucination : overconfident, plausible falsehoods produced by LLM�
4
The paper analyzes the statistical nature of generative errors, focusing on plausible falsehoods called hallucinations.
Even if the with error-free data, the training objectives naturally leads to errors. With real-world noisy data, error rates will be even higher.
The study examines two key stages of training: pretraining and post-training, to explain both the origin and persistence of hallucinations.
5
Errors caused by pretraining:
Links generative error with binary classification. Generating valid outputs is harder than answering “Is this a valid language model output?” (binary classification)
Any language model can act as an Is-It-Valid Classifier.
Valid
Invalid
6
This analysis explains why models generate confident falsehoods instead of saying “I don’t know.”
Thus, optimization for such benchmarks encourages hallucinations.
Errors caused by post-training:
7
Pre-Training Errors
8
Language models hallucinate not because of mysterious or emergent behavior, but because the training and evaluation pipelines statistically reward guessing over expressing uncertainty.�
TLDR
Problem
9
Hallucinations arise naturally from errors in binary classification and persist because benchmarks penalize uncertainty �(IDK responses).
Why?
TLDR
10
Modify existing evaluation benchmarks to stop punishing abstentions and reward uncertainty when appropriate.
Solution
TLDR
11
Pre-Training�Generative Modeling
Two Stage Training
Post-Training
RL Reward Maximization
Model learns the distribution of language in a large text corpus
The model learns “Autocomplete”
�
No “I don’t know” answers!
Post-training encourages conversation or correct answers
RL datasets and benchmarks encourage giving answers
No uncertainty encouragement!
12
Example: Learning to Identify Valid Generations
13
Example: Learning to Identify Valid Generations
14
Example: Learning to Identify Valid Generations
15
Example: Learning to Identify Valid Generations
16
Why Hallucinations appear during Pre-Training?
17
“Generative error is hard to quantify”
They Introduce the Is-It-Valid (IIV) problem: A Binary Classification error that's �easier to quantify
18
Theoretical reduction from generative modeling to binary classification.��Decide if a string is valid (+) or an error (−).�
The Is-It-Valid (IIV) classification problem:
19
20
Reasons during Pre-Training?
21
Reasons for Hallucination
during Pre-Training
22
Necessary knowledge
not in the training data
23
Examples of singletons:
The birthday of bob is …
The birthday of alice is …
Note: We want sr to be small, there’s no pattern on those samples!
24
2. Poor model:
Architecture choice, hyperparameters or model is Underfit
e.g. A language model that is based on CNNs
25
3. Computational Hardness:
The problem is too complex
Examples:
26
4. Distribution Shift
Out-of-Distribution examples, the model is tested in a very different domain
27
5. GIGO: Garbage In Garbage Out
The training dataset contains errors, or it’s poor quality
28
Summary
During pre-training:
29
Why Hallucinations Survive during Post-Training?
30
Example: Standardized Human Exams
Binary Evaluation Metric
Correct
Wrong
IDK: “I don’t know”
Score: +1
Score: 0
Student Taking Standardized Exams
31
Example: Standardised Human Exams
Binary Evaluation Metric
If two students are uncertain about an answer:
Student A: Indicates an uncertainty: “I don’t know”
Student B: Make a correct guess
Score: 0
Score: +1
32
Binary Evaluation Metrics Reward Guessing and Penalize Uncertainty
Binary Evaluation Metric
Many language model benchmarks mirror standardized human exams
Model A: Indicates an uncertainty (“IDK”) and never hallucinates
Model B: Hallucinates and make a guess (If the guess is correct)
Score: 0
Score: +1
33
How to Mitigate Hallucinations during Post-Training?
34
Solution: Trinary Evaluation Metric
Confidence Threshold :
Correct
Score: +1
Wrong
Score:
IDK
Score: 0
35
Confidence Threshold :
Correct
Score: +1
Wrong
Score:
IDK
Score: 0
If p is high:
The model is more confident
If p is low:
The model is less confident /
more uncertain
Mathematical Rationale
Solution: Trinary Evaluation Metric
36
Confidence Threshold :
Correct
Score: +1
Wrong
Score:
IDK
Score: 0
Mathematical Rationale
Solution: Trinary Evaluation Metric
37
Confidence Threshold :
Correct
Score: +1
Wrong
Score:
IDK
Score: 0
If : then ,
Mathematical Rationale
Solution: Trinary Evaluation Metric
38
Confidence Threshold :
Correct
Score: +1
Wrong
Score:
IDK
Score: 0
If : then ,
Mathematical Rationale
Solution: Trinary Evaluation Metric
39
Summary
During post-training:
40
Limitations
41
Limitations
Search / RAG can reduce hallucination
Example:
42
Limitations
Search / RAG can reduce hallucination
models will still be rewarded for guessing,
and hallucinations will persist
e.g. counting the letters in a given word
43
Limitations
Drawbacks of the proposed evaluation metrics
e.g. What is overfitting?
A: Overfitting is the model performs poorly on both training and test data
✓
X
44
Key Takeaway
45
Why Language Models Hallucinate
OpenAI
Arxiv, Sept, 2025
Presenters:�Juan Rodriguez, Mike Zhu, Hannuo Zhang
Fall-2025 IFT6167
Paper Presentations
5 November 2025
Thanks for your attention!
46
Q&A
47
48
49
50
51
52