LLMs are Human-like Annotators
Mounika Marreddy1, Subba Reddy Oota2, Manish Gupta3
1IIIT Hyderabad, India; 2TU Berlin, Germany; 3Microsoft, India
mounika0559@gmail.com, subba.reddy.oota@tu-berlin.de, gmanish@microsoft.com
ECIR 2025
The 47th European Conference on Information Retrieval
Apr 6-10, 2025. Lucca, Italy.
Agenda
ECIR 2025: LLMs are Human-like Annotators
2
Agenda
ECIR 2025: LLMs are Human-like Annotators
3
Deep Learning and Large Language Models
ECIR 2025: LLMs are Human-like Annotators
4
Basic: ANNs, CNNs, RNNs, LSTMs
NLP: Encoder-Decoder, Attention, Transformers, BERT, GPT, T0, BART, T5…
Prompt based models: GPT3, T0/mT0, InstructGPT, Prompt tuning …
GPT-3
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
ECIR 2025: LLMs are Human-like Annotators
5
InstructGPT
ECIR 2025: LLMs are Human-like Annotators
6
1. Supervised fine-tuning (SFT)
2. Reward model (RM) training
3. RL via proximal policy optimization (PPO) on RM
ChatGPT and Prompting
ECIR 2025: LLMs are Human-like Annotators
7
Summarization
ECIR 2025: LLMs are Human-like Annotators
8
Question Answering
Machine Translation
ECIR 2025: LLMs are Human-like Annotators
9
Ads Copywriting
Machine Reading Comprehension
ECIR 2025: LLMs are Human-like Annotators
10
Solving reasoning problems
ECIR 2025: LLMs are Human-like Annotators
11
Loads of LLMs and SLMs
ECIR 2025: LLMs are Human-like Annotators
12
GPT-4o
OpenAI o1, o3 …
LLaMA 3, 3.2, 3.3, …
Small language models
ECIR 2025: LLMs are Human-like Annotators
13
LLaMA 1
ECIR 2025: LLMs are Human-like Annotators
14
LLaMA 2
Llama 2: Open Foundation and Fine-Tuned Chat Models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi and others. July 2023.
ECIR 2025: LLMs are Human-like Annotators
15
Helpfulness human evaluation on ~4k prompts
LLaMA 3
ECIR 2025: LLMs are Human-like Annotators
16
Meta llama 3 instruct model
Meta llama 3 pretrained model
LLaMA 3.2
ECIR 2025: LLMs are Human-like Annotators
17
| Category | Benchmark | Llama 3.2 11B | Llama 3.2 90B | Claude 3 Haiku | GPT-40 mini |
Image | College-level Problems and Mathematical Reasoning
| MMMU (val, O-shot COT, micro avg accuracy) | 50.7 | 60.3 | 50.2 | 59.4 |
MMMU-Pro, Standard (10 opts, test) | 33 | 45.2 | 27.3 | 42.3 | ||
MMMU-Pro, Vision (test) | 23.7 | 33.8 | 20.1 | 36.5 | ||
MathVista (testmini) | 51.5 | 57.3 | 46.4 | 56.7 | ||
Charts and Diagram Understanding
| ChartQA (test, O-shot COT relaxed accuracy) | 83.4 | 85.5 | 81.7 | - | |
A12 Diagram (test) | 91.1 | 92.3 | 86.7 | - | ||
DocVQA (test, ANLS) | 88.4 | 90.1 | 88.8 | - | ||
General VQA | VQAv2 (test) | 75.2 | 78.1 | - | - | |
Text | General | MMLU (O-shot, COT) | 73 | 86 | 75.2 | 82 |
Math | MATH (O-shot, COT) | 51.9 | 68 | 38.9 | 70.2 | |
Reasoning | GPQA (O-shot, COT) | 32.8 | 46.7 | 33.3 | 40.2 | |
Multilingual | MGSM (O-shot, COT) | 68.9 | 86.9 | 75.1 | 87 |
Category | Benchmark | Llama 3.2 1B | Llama 3.2 3B | Gemma 2 2B IT | Phi-3.5 mini IT |
General | MMLU (5-shot) | 49.3 | 63.4 | 57.8 | 69 |
Open-rewrite eval (O-shot, rougeL) | 41.6 | 40.1 | 31.2 | 34.5 | |
TLDR9+ (test, I-shot, rougeL) | 16.8 | 19 | 13.9 | 12.8 | |
IFEval | 59.5 | 77.4 | 61.9 | 59.2 | |
Tool Use | BFCL V2 | 25.7 | 67 | 27.4 | 58.4 |
Nexus | 13.5 | 34.3 | 21 | 26.1 | |
Math | GSM8K (8-shot, COT) | 44.4 | 77.7 | 62.5 | 86.2 |
MATH (O-shot, COT) | 30.6 | 48 | 23.8 | 44.2 | |
Reasoning | ARC Challenge (O-shot) | 59.4 | 78.6 | 76.7 | 87.4 |
GPQA (O-shot) | 27.2 | 32.8 | 27.5 | 31.9 | |
Hellaswag (O-shot) | 41.2 | 69.8 | 61.1 | 81.4 | |
Long Context | InfiniteBench/En.MC (128k) | 38 | 63.3 | - | 39.2 |
InfiniteBench/En.QA (128k) | 20.3 | 19.8 | - | 11.3 | |
NIH/Multi-needle | 75 | 84.7 | - | 52.7 | |
Multilingual | MGSM (O-shot, COT) | 24.5 | 58.2 | 40.2 | 49.8 |
GPT-4
GPT-4 Technical Report. OpenAI. https://cdn.openai.com/papers/gpt-4.pdf
ECIR 2025: LLMs are Human-like Annotators
18
Math word problems and reasoning QA
Mitra, Arindam, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen et al. "Orca 2: Teaching small language models how to reason." arXiv preprint arXiv:2311.11045 (2023).
ECIR 2025: LLMs are Human-like Annotators
19
Math word problems and reasoning QA
Mitra, Arindam, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen et al. "Orca 2: Teaching small language models how to reason." arXiv preprint arXiv:2311.11045 (2023).
ECIR 2025: LLMs are Human-like Annotators
20
Chart understanding and reasoning over data
ECIR 2025: LLMs are Human-like Annotators
21
Spot a data point that stands out in these charts and what that implicates. Then produce a detailed markdown table for all the data shown.
Image understanding and reasoning
ECIR 2025: LLMs are Human-like Annotators
22
Geometrical reasoning
Information seeking about objects
ECIR 2025: LLMs are Human-like Annotators
23
Multimodal reasoning based on visual cues
Multimodal humor understanding
ECIR 2025: LLMs are Human-like Annotators
24
Commonsense reasoning in a multilingual setting
Reasoning and code generation
Create a web app called "Opossum Search":
1. Every time you make a search query, it should redirect you to a google search with the same query, but the word opossum before it.
2. It should be visually similar to Google search,
3. Instead of the google logo, it should have a picture of an opossum from the internet.
4. It should be a single html file, no separate js or css files.
5. It should say "Powered by google search" in the footer
ECIR 2025: LLMs are Human-like Annotators
25
Mathematics: Calculus
ECIR 2025: LLMs are Human-like Annotators
26
Video understanding and reasoning
Agenda
ECIR 2025: LLMs are Human-like Annotators
27
Generating Annotations for NLP Tasks using LLMs
ECIR 2025: LLMs are Human-like Annotators
28
ChatGPT > crowd workers for text-annotation tasks
ECIR 2025: LLMs are Human-like Annotators
29
Accuracy means agreement with the trained annotators.
USDC: A Dataset of User Stance and Dogmatism in Long Conversations
ECIR 2025: LLMs are Human-like Annotators
30
USDC: A Dataset of User Stance and Dogmatism in Long Conversations
ECIR 2025: LLMs are Human-like Annotators
31
Failure Cases
Selective annotation and Prompt retrieval
Hongjin, S. U., Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang et al. "Selective annotation makes language models better few-shot learners." In ICLR. 2022.
ECIR 2025: LLMs are Human-like Annotators
32
ICL perf over varying annotation budgets for HellaSwag commonsense reasoning. LLM=GPT-J
100 annotated examples
LLMs instead of Human Judges?
Bavaresco, A, Raffaella B, Leonardo B, Desmond E, Raquel F, A Gatt, E Ghaleb et al. "Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks." arXiv:2406.18403 (2024).
ECIR 2025: LLMs are Human-like Annotators
33
Generating Annotations for NLP Tasks using LLMs
ECIR 2025: LLMs are Human-like Annotators
34
Chain of thought (CoT) prompting
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).
ECIR 2025: LLMs are Human-like Annotators
35
What are advantages of chain of thought prompting?
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).
Prompting PaLM 540B with just 8 CoT exemplars achieves SOTA on GSM8K math word problems, surpassing even finetuned GPT-3 with a verifier.
ECIR 2025: LLMs are Human-like Annotators
36
Analogical prompting
Yasunaga, Michihiro, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. "Large Language Models as Analogical Reasoners." In ICLR.
ECIR 2025: LLMs are Human-like Annotators
37
Analogical prompting methods
Yasunaga, Michihiro, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. "Large Language Models as Analogical Reasoners." In The Twelfth International Conference on Learning Representations.
Big Bench reasoning tasks with GPT3.5-Turbo
ECIR 2025: LLMs are Human-like Annotators
38
MoT: Memory-of-Thought
Li, Xiaonan, and Xipeng Qiu. "MoT: Memory-of-Thought Enables ChatGPT to Self-Improve." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6354-6374. 2023.
ECIR 2025: LLMs are Human-like Annotators
39
MoT: Memory-of-Thought
Li, Xiaonan, and Xipeng Qiu. "MoT: Memory-of-Thought Enables ChatGPT to Self-Improve." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6354-6374. 2023.
MoT exceeds Few-Shot-CoT and Zero-Shot-CoT
ECIR 2025: LLMs are Human-like Annotators
40
Dynamic Program Prompting and Program Distillation
Jie, Zhanming, and Wei Lu. "Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning." In Findings of the Association for Computational Linguistics: ACL 2023, pp. 10518-10526. 2023.
ECIR 2025: LLMs are Human-like Annotators
41
Dynamic Program Prompting and Program Distillation
Jie, Zhanming, and Wei Lu. "Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning." In Findings of the Association for Computational Linguistics: ACL 2023, pp. 10518-10526. 2023.
Example prediction and retrieved program samples
ECIR 2025: LLMs are Human-like Annotators
42
LeanReasoner: Offloading reasoning to Lean
Jiang, Dongwei, Marcio Fonseca, and Shay B. Cohen. "LeanReasoner: Boosting Complex Logical Reasoning with Lean." In NAACL-HLT, pp. 7490-7503. 2024.
ECIR 2025: LLMs are Human-like Annotators
43
Symbolic reasoning for math word problems
Gaur, Vedant, and Nikunj Saunshi. "Reasoning in Large Language Models Through Symbolic Math Word Problems." In ACL Findings, pp. 5889-5903. 2023.
Self-prompting: Prompt LLM with numeric problem and its response to the problem, and then ask it to solve the symbolic problem.
ECIR 2025: LLMs are Human-like Annotators
44
Symbolic Rule Learning for Robust Numerical Reasoning
Al-Negheimish, Hadeel, Pranava Madhyastha, and Alessandra Russo. "Augmenting Large Language Models with Symbolic Rule Learning for Robust Numerical Reasoning." In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS'23.
ECIR 2025: LLMs are Human-like Annotators
45
Generating Annotations for NLP Tasks using LLMs
ECIR 2025: LLMs are Human-like Annotators
46
ZeroGen: Efficient Zero-shot Learning via Dataset Generation
ECIR 2025: LLMs are Human-like Annotators
47
ZeroGen: Efficient Zero-shot Learning via Dataset Generation
ECIR 2025: LLMs are Human-like Annotators
48
ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback
ECIR 2025: LLMs are Human-like Annotators
49
ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback
ECIR 2025: LLMs are Human-like Annotators
50
SuperGen (Supervision Generation)
ECIR 2025: LLMs are Human-like Annotators
51
SuperGen (Supervision Generation)
ECIR 2025: LLMs are Human-like Annotators
52
TarGen: Targeted Data Generation
ECIR 2025: LLMs are Human-like Annotators
53
Generation using Single Formatting Example
ECIR 2025: LLMs are Human-like Annotators
54
Generation with Active Learning
ECIR 2025: LLMs are Human-like Annotators
55
Generation with Active Learning
ECIR 2025: LLMs are Human-like Annotators
56
Generating Annotations for NLP Tasks using LLMs
ECIR 2025: LLMs are Human-like Annotators
57
Dictionary-assisted training data generation
ECIR 2025: LLMs are Human-like Annotators
58
Prompt guided unlabelled data annotation
Prompt guided training data generation
Dictionary-assisted training data generation
ECIR 2025: LLMs are Human-like Annotators
59
PersonaChatGen: Generating Personalized Dialogues
ECIR 2025: LLMs are Human-like Annotators
60
PersonaChatGen: Generating Personalized Dialogues
ECIR 2025: LLMs are Human-like Annotators
61
Mixture of Soft Prompts for Controllable Data Generation
ECIR 2025: LLMs are Human-like Annotators
62
Mixture of Soft Prompts for Controllable Data Generation
Soft prompts are initialized with the name and description of attribute, e.g. “song is a musical song or melody”.
ECIR 2025: LLMs are Human-like Annotators
63
Attributed Training Data Generator
ECIR 2025: LLMs are Human-like Annotators
64
Attributed Training Data Generator
ECIR 2025: LLMs are Human-like Annotators
65
Summary
ECIR 2025: LLMs are Human-like Annotators
66
Agenda
ECIR 2025: LLMs are Human-like Annotators
67
ECIR 2025: LLMs are Human-like Annotators
68
What is reasoning?
Augmented Language Models: a Survey (Mialon et.al, 2023)
ECIR 2025: LLMs are Human-like Annotators
69
Reasoning Problems
Augmented Language Models: a Survey (Mialon et.al, 2023)
ECIR 2025: LLMs are Human-like Annotators
70
Multi-step reasoning is often seen as a weakness in language models
Towards Reasoning in Large Language Models: A Survey (Huang et.al, 2023)
Former research on reasoning in small language models through fully
supervised finetuning on specific datasets
Reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters
ECIR 2025: LLMs are Human-like Annotators
71
Reasoning and Commonsense Benchmarks
Source: https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond#different-types-of-llm-benchmarks
ECIR 2025: LLMs are Human-like Annotators
72
How is reasoning measured (in the literature)?
GPT-4 Technical Report (OpenAI).
ECIR 2025: LLMs are Human-like Annotators
73
Chain of thought prompting and Self consistency
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
Prompt: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with? |
11 apples |
ECIR 2025: LLMs are Human-like Annotators
74
Chain of thought prompting: Arithmetic Reasoning
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
ECIR 2025: LLMs are Human-like Annotators
75
Chain of thought prompting and Self consistency
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
ECIR 2025: LLMs are Human-like Annotators
76
Chain of thought prompting: Symbolic Reasoning
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
ECIR 2025: LLMs are Human-like Annotators
77
Chain of thought prompting: Commonsense Reasoning
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
ECIR 2025: LLMs are Human-like Annotators
78
More Advances: Self consistency
Self-consistency improves chain of thought reasoning in language models. (Wang et al., 2022)
ECIR 2025: LLMs are Human-like Annotators
79
STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning
STaR: Self-Taught Reasoner. (Zelikman et al., 2022)
ECIR 2025: LLMs are Human-like Annotators
80
Program-aided Language Models (PAL)
PAL: Program-aided Language Models (Gao et.al, 2023)
ECIR 2025: LLMs are Human-like Annotators
81
Tool-Integrated Reasoning (TORA)
TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)
ECIR 2025: LLMs are Human-like Annotators
82
Tool-Integrated Reasoning (TORA)
TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)
ECIR 2025: LLMs are Human-like Annotators
83
Tool-Integrated Reasoning (TORA)
TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)
ECIR 2025: LLMs are Human-like Annotators
84
Plan-and-Solve Prompting
PAL: Program-aided Language Models (Gao et.al, 2023)
ECIR 2025: LLMs are Human-like Annotators
85
Can we use LLMs to benchmark reasoning datasets?
ECIR 2025: LLMs are Human-like Annotators
86
Reasoning datasets: CriticBench
ECIR 2025: LLMs are Human-like Annotators
87
Reasoning datasets: Question collection on CriticBench
Response collection from LLMs:
Response annotation:
Domains:
Question collection:
ECIR 2025: LLMs are Human-like Annotators
88
Reasoning datasets: Evaluation process on CriticBench
ECIR 2025: LLMs are Human-like Annotators
89
Reasoning datasets: Annotation example of CriticBench
ECIR 2025: LLMs are Human-like Annotators
90
Reasoning datasets: Key Factors in Critical Reasoning
ECIR 2025: LLMs are Human-like Annotators
91
Reasoning datasets: Average performance on CriticBench
ECIR 2025: LLMs are Human-like Annotators
92
Reasoning datasets: Consistency of GQC Knowledge
ECIR 2025: LLMs are Human-like Annotators
93
Human preference benchmarks with reasoning tasks
Why human preference benchmarks?
ECIR 2025: LLMs are Human-like Annotators
94
Human preference benchmarks: LLMs as judges
Four popular benchmarks
ECIR 2025: LLMs are Human-like Annotators
95
LLM-as-a-judge
ECIR 2025: LLMs are Human-like Annotators
96
Limitations of LLM-as-a-judge
Position bias
Verbosity bias
Self-appreciation bias
Limited reasoning ability
ECIR 2025: LLMs are Human-like Annotators
97
Human preference benchmarks: MT-Bench-101
ECIR 2025: LLMs are Human-like Annotators
98
MT-Bench-101: Hierarchical Ability Taxonomy
13 tasks
3-level abilities
ECIR 2025: LLMs are Human-like Annotators
99
MT-Bench-101: Model’s performance
Agenda
ECIR 2025: LLMs are Human-like Annotators
100
ECIR 2025: LLMs are Human-like Annotators
101
Why Focus on Evaluation
Source: https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/
ECIR 2025: LLMs are Human-like Annotators
102
LLM Evaluation vs. Human Evaluation
ECIR 2025: LLMs are Human-like Annotators
103
How to scale “human evaluation”?
ECIR 2025: LLMs are Human-like Annotators
104
LLM Evaluation
Model
LLM Benchmark
Shots
Instruction
Task
Dataset
Metric
Shots
Instruction
Task
Dataset
Metric
Shots
Instruction
Task
Dataset
Metric
Shots
Instruction
Task
Dataset
Metric
Aggregation
System Prompt
System Format
Hyperparams
ECIR 2025: LLMs are Human-like Annotators
105
LLM Evaluation
Task:
Dataset:
Metric:
Instruction:
Shot:
ECIR 2025: LLMs are Human-like Annotators
106
LLM Evaluation
Model
System Prompt
System Format
Hyperparams
<SYS> you are helpful Model </SYS>
<instruction> Translate this sentence to French
<user> I like pizza
<assistant> J'aime la pizza
ECIR 2025: LLMs are Human-like Annotators
107
LLM Evaluation: Alpaca
ECIR 2025: LLMs are Human-like Annotators
108
LLM Evaluation: G-Eval
G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment (Yang et.al, 2023)
ECIR 2025: LLMs are Human-like Annotators
109
LLM Evaluation: GPT-Score
https://github.com/confident-ai/deepeval
ECIR 2025: LLMs are Human-like Annotators
110
Language model-written evaluations
https://github.com/confident-ai/deepeval
Agenda
ECIR 2025: LLMs are Human-like Annotators
111
Generate a synthetic dataset using LLMs
112
AutoLabel:
Prodigy:
Labelbox:
LLM-data-annotation:
AutoLabel
113
AutoLabel: Question Answering
114
AutoLabel: Question Answering
115
AutoLabel: Question Answering
116
LLMs can label data as well as humans, but 100x faster
117
LLMs can label data: Quality Evaluation
118
LLMs can label data: Quality Evaluation
119
LLMs can label data: Quality Evaluation
120
Prodigy
121
https://demo.prodi.gy/?=null&view_id=ner_manual
What Prodigy isn’t:
Usage:
AutoLabel tools: Which one is better
122
Agenda
ECIR 2025: LLMs are Human-like Annotators
123
Agenda
ECIR 2025: LLMs are Human-like Annotators
124
Hallucination
125
Evolution of Hallucination in LLMs
126
Evolution of Hallucination in LLMs
127
Evolution of Hallucination in LLMs
128
Taxonomy of Hallucinations
ECIR 2025: LLMs are Human-like Annotators
129
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Taxonomy of Hallucinations: Causes
ECIR 2025: LLMs are Human-like Annotators
130
Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)
Taxonomy of Hallucinations: Metrics and Benchmarks
ECIR 2025: LLMs are Human-like Annotators
131
Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)
Taxonomy of Hallucinations: Mitigation
ECIR 2025: LLMs are Human-like Annotators
132
Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)
Agenda
ECIR 2025: LLMs are Human-like Annotators
133
Hallucination Types
134
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Orientation, Category and Degree
135
Hallucination Types: Orientation
136
Factual Mirage:
Intrinsic
Extrensic
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Orientation
137
Silver Lining:
Intrinsic
Extrensic
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
138
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
139
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
140
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
141
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
142
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
143
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Degree
144
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Detection: SelfCheckGPT
145
Gpteval: Nlg evaluation using gpt-4 with better human alignment (Yang et.al, 2023)
Hallucination Detection: FACTScore
146
"Factscore: Fine-grained atomic evaluation of factual precision in long form text generation (Sewon et.al, 2023)
Hallucination eLiciTation dataset
147
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Vulnerability Index (HVI)
148
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Agenda
ECIR 2025: LLMs are Human-like Annotators
149
Hallucination Mitigation
150
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)
Hallucination Mitigation: Chain-Of-Verification (CoVe)
151
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)
Is hallucination always bad?
152
https://www.washingtonpost.com/opinions/2023/12/27/artificial-intelligence-hallucinations/
Thanks!
ECIR 2025: LLMs are Human-like Annotators
153