LLMs are Human-like Annotators
Mounika Marreddy1, Subba Reddy Oota2, Lucie Flek1, Manish Gupta3
1Univ of Bonn, Germany; 2TU Berlin, Germany; 3Microsoft, India;
mmarredd@uni-bonn.de, subba.reddy.oota@tu-berlin.de, flek@bit.uni-bonn.de,gmanish@microsoft.com
KR 2024
21st International Conference on Principles of Knowledge Representation and Reasoning
Nov 2 - 8, 2024. Hanoi, Vietnam
Agenda
KR 2024: LLMs are Human-like Annotators
2
Agenda
KR 2024: LLMs are Human-like Annotators
3
Deep Learning and Large Language Models
KR 2024: LLMs are Human-like Annotators
4
Basic: ANNs, CNNs, RNNs, LSTMs
NLP: Encoder-Decoder, Attention, Transformers, BERT, GPT, T0, BART, T5…
Prompt based models: GPT3, T0/mT0, InstructGPT, Prompting
GPT-3
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
KR 2024: LLMs are Human-like Annotators
5
InstructGPT
KR 2024: LLMs are Human-like Annotators
6
1. Supervised fine-tuning (SFT)
2. Reward model (RM) training
3. RL via proximal policy optimization (PPO) on RM
ChatGPT and Prompting
KR 2024: LLMs are Human-like Annotators
7
Summarization
KR 2024: LLMs are Human-like Annotators
8
Question Answering
Machine Translation
KR 2024: LLMs are Human-like Annotators
9
Ads Copywriting
Machine Reading Comprehension
KR 2024: LLMs are Human-like Annotators
10
Solving reasoning problems
KR 2024: LLMs are Human-like Annotators
11
Chain of thought (CoT) prompting
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).
KR 2024: LLMs are Human-like Annotators
12
What are advantages of chain of thought prompting?
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).
Prompting PaLM 540B with just 8 CoT exemplars achieves SOTA on GSM8K math word problems, surpassing even finetuned GPT-3 with a verifier.
KR 2024: LLMs are Human-like Annotators
13
CoT improves Commonsense Reasoning
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).
KR 2024: LLMs are Human-like Annotators
14
Loads of LLMs and SLMs
KR 2024: LLMs are Human-like Annotators
15
GPT-4o
OpenAI O1
…
Small language models
KR 2024: LLMs are Human-like Annotators
16
LLaMA 1
KR 2024: LLMs are Human-like Annotators
17
LLaMA 2
Llama 2: Open Foundation and Fine-Tuned Chat Models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi and others. July 2023.
KR 2024: LLMs are Human-like Annotators
18
Helpfulness human evaluation on ~4k prompts
LLaMA 3
KR 2024: LLMs are Human-like Annotators
19
Meta llama 3 instruct model
Meta llama 3 pretrained model
LLaMA 3.2
KR 2024: LLMs are Human-like Annotators
20
| Category | Benchmark | Llama 3.2 11B | Llama 3.2 90B | Claude 3 Haiku | GPT-40 mini |
Image | College-level Problems and Mathematical Reasoning
| MMMU (val, O-shot COT, micro avg accuracy) | 50.7 | 60.3 | 50.2 | 59.4 |
MMMU-Pro, Standard (10 opts, test) | 33 | 45.2 | 27.3 | 42.3 | ||
MMMU-Pro, Vision (test) | 23.7 | 33.8 | 20.1 | 36.5 | ||
MathVista (testmini) | 51.5 | 57.3 | 46.4 | 56.7 | ||
Charts and Diagram Understanding
| ChartQA (test, O-shot COT relaxed accuracy) | 83.4 | 85.5 | 81.7 | - | |
A12 Diagram (test) | 91.1 | 92.3 | 86.7 | - | ||
DocVQA (test, ANLS) | 88.4 | 90.1 | 88.8 | - | ||
General VQA | VQAv2 (test) | 75.2 | 78.1 | - | - | |
Text | General | MMLU (O-shot, COT) | 73 | 86 | 75.2 | 82 |
Math | MATH (O-shot, COT) | 51.9 | 68 | 38.9 | 70.2 | |
Reasoning | GPQA (O-shot, COT) | 32.8 | 46.7 | 33.3 | 40.2 | |
Multilingual | MGSM (O-shot, COT) | 68.9 | 86.9 | 75.1 | 87 |
Category | Benchmark | Llama 3.2 1B | Llama 3.2 3B | Gemma 2 2B IT | Phi-3.5 mini IT |
General | MMLU (5-shot) | 49.3 | 63.4 | 57.8 | 69 |
Open-rewrite eval (O-shot, rougeL) | 41.6 | 40.1 | 31.2 | 34.5 | |
TLDR9+ (test, I-shot, rougeL) | 16.8 | 19 | 13.9 | 12.8 | |
IFEval | 59.5 | 77.4 | 61.9 | 59.2 | |
Tool Use | BFCL V2 | 25.7 | 67 | 27.4 | 58.4 |
Nexus | 13.5 | 34.3 | 21 | 26.1 | |
Math | GSM8K (8-shot, COT) | 44.4 | 77.7 | 62.5 | 86.2 |
MATH (O-shot, COT) | 30.6 | 48 | 23.8 | 44.2 | |
Reasoning | ARC Challenge (O-shot) | 59.4 | 78.6 | 76.7 | 87.4 |
GPQA (O-shot) | 27.2 | 32.8 | 27.5 | 31.9 | |
Hellaswag (O-shot) | 41.2 | 69.8 | 61.1 | 81.4 | |
Long Context | InfiniteBench/En.MC (128k) | 38 | 63.3 | - | 39.2 |
InfiniteBench/En.QA (128k) | 20.3 | 19.8 | - | 11.3 | |
NIH/Multi-needle | 75 | 84.7 | - | 52.7 | |
Multilingual | MGSM (O-shot, COT) | 24.5 | 58.2 | 40.2 | 49.8 |
GPT-4
GPT-4 Technical Report. OpenAI. https://cdn.openai.com/papers/gpt-4.pdf
KR 2024: LLMs are Human-like Annotators
21
Math word problems and reasoning QA
Mitra, Arindam, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen et al. "Orca 2: Teaching small language models how to reason." arXiv preprint arXiv:2311.11045 (2023).
KR 2024: LLMs are Human-like Annotators
22
Math word problems and reasoning QA
Mitra, Arindam, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen et al. "Orca 2: Teaching small language models how to reason." arXiv preprint arXiv:2311.11045 (2023).
KR 2024: LLMs are Human-like Annotators
23
Chart understanding and reasoning over data
KR 2024: LLMs are Human-like Annotators
24
Spot a data point that stands out in these charts and what that implicates. Then produce a detailed markdown table for all the data shown.
Image understanding and reasoning
KR 2024: LLMs are Human-like Annotators
25
Geometrical reasoning
Information seeking about objects
KR 2024: LLMs are Human-like Annotators
26
Multimodal reasoning based on visual cues
Multimodal humor understanding
KR 2024: LLMs are Human-like Annotators
27
Commonsense reasoning in a multilingual setting
Reasoning and code generation
Create a web app called "Opossum Search":
1. Every time you make a search query, it should redirect you to a google search with the same query, but the word opossum before it.
2. It should be visually similar to Google search,
3. Instead of the google logo, it should have a picture of an opossum from the internet.
4. It should be a single html file, no separate js or css files.
5. It should say "Powered by google search" in the footer
KR 2024: LLMs are Human-like Annotators
28
Mathematics: Calculus
KR 2024: LLMs are Human-like Annotators
29
Video understanding and reasoning
Agenda
KR 2024: LLMs are Human-like Annotators
30
Generating Annotations for Reasoning Tasks using LLMs
KR 2024: LLMs are Human-like Annotators
31
Generating Annotations for Reasoning Tasks using LLMs
KR 2024: LLMs are Human-like Annotators
32
LLMs with CoT are Non-Causal Reasoners
Bao, Guangsheng, Hongbo Zhang, Linyi Yang, Cunxiang Wang, and Yue Zhang. "Llms with chain-of-thought are non-causal reasoners." arXiv preprint arXiv:2402.16048 (2024).
KR 2024: LLMs are Human-like Annotators
33
LLMs with CoT are Non-Causal Reasoners
Bao, Guangsheng, Hongbo Zhang, Linyi Yang, Cunxiang Wang, and Yue Zhang. "Llms with chain-of-thought are non-causal reasoners." arXiv preprint arXiv:2402.16048 (2024).
KR 2024: LLMs are Human-like Annotators
34
Automatic Reasoning Chain Evaluation
Hao, Shibo, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie et al. "LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models." In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
KR 2024: LLMs are Human-like Annotators
35
Automatic Reasoning Chain Evaluation
Hao, Shibo, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie et al. "LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models." In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
Common types of false positive reasoning chains detected by AutoRace
KR 2024: LLMs are Human-like Annotators
36
Dynamic Program Prompting and Program Distillation
Jie, Zhanming, and Wei Lu. "Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning." In Findings of the Association for Computational Linguistics: ACL 2023, pp. 10518-10526. 2023.
KR 2024: LLMs are Human-like Annotators
37
Dynamic Program Prompting and Program Distillation
Jie, Zhanming, and Wei Lu. "Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning." In Findings of the Association for Computational Linguistics: ACL 2023, pp. 10518-10526. 2023.
Example prediction and retrieved program samples
KR 2024: LLMs are Human-like Annotators
38
Generating Annotations for Reasoning Tasks using LLMs
KR 2024: LLMs are Human-like Annotators
39
MoT: Memory-of-Thought
Li, Xiaonan, and Xipeng Qiu. "MoT: Memory-of-Thought Enables ChatGPT to Self-Improve." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6354-6374. 2023.
KR 2024: LLMs are Human-like Annotators
40
MoT: Memory-of-Thought
Li, Xiaonan, and Xipeng Qiu. "MoT: Memory-of-Thought Enables ChatGPT to Self-Improve." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6354-6374. 2023.
MoT exceeds Few-Shot-CoT and Zero-Shot-CoT
KR 2024: LLMs are Human-like Annotators
41
Selective annotation and Prompt retrieval
Hongjin, S. U., Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang et al. "Selective annotation makes language models better few-shot learners." In ICLR. 2022.
KR 2024: LLMs are Human-like Annotators
42
ICL perf over varying annotation budgets for HellaSwag commonsense reasoning. LLM=GPT-J
100 annotated examples
Example from HellaSwag CommonSense Reasoning
Generating Annotations for Reasoning Tasks using LLMs
KR 2024: LLMs are Human-like Annotators
43
Analogical prompting
Yasunaga, Michihiro, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. "Large Language Models as Analogical Reasoners." In ICLR.
KR 2024: LLMs are Human-like Annotators
44
Analogical prompting methods
Yasunaga, Michihiro, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. "Large Language Models as Analogical Reasoners." In The Twelfth International Conference on Learning Representations.
Big Bench reasoning tasks with GPT3.5-Turbo
KR 2024: LLMs are Human-like Annotators
45
LeanReasoner: Offloading reasoning to Lean
Jiang, Dongwei, Marcio Fonseca, and Shay B. Cohen. "LeanReasoner: Boosting Complex Logical Reasoning with Lean." In NAACL-HLT, pp. 7490-7503. 2024.
KR 2024: LLMs are Human-like Annotators
46
LeanReasoner: Offloading reasoning to Lean
Jiang, Dongwei, Marcio Fonseca, and Shay B. Cohen. "LeanReasoner: Boosting Complex Logical Reasoning with Lean." In NAACL-HLT, pp. 7490-7503. 2024.
KR 2024: LLMs are Human-like Annotators
47
LeanReasoner: Offloading reasoning to Lean
Sample proofs created by LeanReasoner without pretraining (left), finetuned on Intuitive data (middle), and finetuned on Concise data (right).
Jiang, Dongwei, Marcio Fonseca, and Shay B. Cohen. "LeanReasoner: Boosting Complex Logical Reasoning with Lean." In NAACL-HLT, pp. 7490-7503. 2024.
KR 2024: LLMs are Human-like Annotators
48
Event Relation Logical Prediction
Chen, Meiqi, Yubo Ma, Kaitao Song, Yixin Cao, Yan Zhang, and Dongsheng Li. "Learning to teach large language models logical reasoning." arXiv preprint arXiv:2310.09158 (2023).
KR 2024: LLMs are Human-like Annotators
49
Enabling LLMs for Event Relation Logical Prediction
Chen, Meiqi, Yubo Ma, Kaitao Song, Yixin Cao, Yan Zhang, and Dongsheng Li. "Learning to teach large language models logical reasoning." arXiv preprint arXiv:2310.09158 (2023).
KR 2024: LLMs are Human-like Annotators
50
Event Relation Logical Prediction Results
Chen, Meiqi, Yubo Ma, Kaitao Song, Yixin Cao, Yan Zhang, and Dongsheng Li. "Learning to teach large language models logical reasoning." arXiv preprint arXiv:2310.09158 (2023).
KR 2024: LLMs are Human-like Annotators
51
Generating Annotations for Reasoning Tasks using LLMs
KR 2024: LLMs are Human-like Annotators
52
Symbolic reasoning for math word problems
Gaur, Vedant, and Nikunj Saunshi. "Reasoning in Large Language Models Through Symbolic Math Word Problems." In ACL Findings, pp. 5889-5903. 2023.
KR 2024: LLMs are Human-like Annotators
53
Symbolic reasoning for math word problems
Gaur, Vedant, and Nikunj Saunshi. "Reasoning in Large Language Models Through Symbolic Math Word Problems." In ACL Findings, pp. 5889-5903. 2023.
KR 2024: LLMs are Human-like Annotators
54
Symbolic Rule Learning for Robust Numerical Reasoning
Al-Negheimish, Hadeel, Pranava Madhyastha, and Alessandra Russo. "Augmenting Large Language Models with Symbolic Rule Learning for Robust Numerical Reasoning." In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS'23.
KR 2024: LLMs are Human-like Annotators
55
Symbolic Rule Learning for Robust Numerical Reasoning
Al-Negheimish, Hadeel, Pranava Madhyastha, and Alessandra Russo. "Augmenting Large Language Models with Symbolic Rule Learning for Robust Numerical Reasoning." In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS'23.
KR 2024: LLMs are Human-like Annotators
56
Agenda
KR 2024: LLMs are Human-like Annotators
57
KR 2024: LLMs are Human-like Annotators
58
What is reasoning?
Augmented Language Models: a Survey (Mialon et.al, 2023)
KR 2024: LLMs are Human-like Annotators
59
Reasoning Problems
Augmented Language Models: a Survey (Mialon et.al, 2023)
KR 2024: LLMs are Human-like Annotators
60
Multi-step reasoning is often seen as a weakness in language models
Towards Reasoning in Large Language Models: A Survey (Huang et.al, 2023)
Former research on reasoning in small language models through fully
supervised finetuning on specific datasets
Reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters
KR 2024: LLMs are Human-like Annotators
61
Reasoning and Commonsense Benchmarks
Source: https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond#different-types-of-llm-benchmarks
KR 2024: LLMs are Human-like Annotators
62
How is reasoning measured (in the literature)?
GPT-4 Technical Report (OpenAI).
KR 2024: LLMs are Human-like Annotators
63
Chain of thought prompting and Self consistency
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
Prompt: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with? |
11 apples |
KR 2024: LLMs are Human-like Annotators
64
Chain of thought prompting: Arithmetic Reasoning
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
KR 2024: LLMs are Human-like Annotators
65
Chain of thought prompting and Self consistency
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
KR 2024: LLMs are Human-like Annotators
66
Chain of thought prompting: Symbolic Reasoning
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
KR 2024: LLMs are Human-like Annotators
67
Chain of thought prompting: Commonsense Reasoning
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)
KR 2024: LLMs are Human-like Annotators
68
More Advances: Self consistency
Self-consistency improves chain of thought reasoning in language models. (Wang et al., 2022)
KR 2024: LLMs are Human-like Annotators
69
STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning
STaR: Self-Taught Reasoner. (Zelikman et al., 2022)
KR 2024: LLMs are Human-like Annotators
70
Program-aided Language Models (PAL)
PAL: Program-aided Language Models (Gao et.al, 2023)
KR 2024: LLMs are Human-like Annotators
71
Tool-Integrated Reasoning (TORA)
TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)
KR 2024: LLMs are Human-like Annotators
72
Tool-Integrated Reasoning (TORA)
TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)
KR 2024: LLMs are Human-like Annotators
73
Tool-Integrated Reasoning (TORA)
TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)
KR 2024: LLMs are Human-like Annotators
74
Plan-and-Solve Prompting
PAL: Program-aided Language Models (Gao et.al, 2023)
KR 2024: LLMs are Human-like Annotators
75
Can we use LLMs to benchmark reasoning datasets?
KR 2024: LLMs are Human-like Annotators
76
Reasoning datasets: CriticBench
KR 2024: LLMs are Human-like Annotators
77
Reasoning datasets: Question collection on CriticBench
Response collection from LLMs:
Response annotation:
Domains:
Question collection:
KR 2024: LLMs are Human-like Annotators
78
Reasoning datasets: Evaluation process on CriticBench
KR 2024: LLMs are Human-like Annotators
79
Reasoning datasets: Annotation example of CriticBench
KR 2024: LLMs are Human-like Annotators
80
Reasoning datasets: Key Factors in Critical Reasoning
KR 2024: LLMs are Human-like Annotators
81
Reasoning datasets: Average performance on CriticBench
KR 2024: LLMs are Human-like Annotators
82
Reasoning datasets: Consistency of GQC Knowledge
KR 2024: LLMs are Human-like Annotators
83
Human preference benchmarks with reasoning tasks
Why human preference benchmarks?
KR 2024: LLMs are Human-like Annotators
84
Human preference benchmarks: LLMs as judges
Four popular benchmarks
KR 2024: LLMs are Human-like Annotators
85
LLM-as-a-judge
KR 2024: LLMs are Human-like Annotators
86
Limitations of LLM-as-a-judge
Position bias
Verbosity bias
Self-appreciation bias
Limited reasoning ability
KR 2024: LLMs are Human-like Annotators
87
Human preference benchmarks: MT-Bench-101
KR 2024: LLMs are Human-like Annotators
88
MT-Bench-101: Hierarchical Ability Taxonomy
13 tasks
3-level abilities
KR 2024: LLMs are Human-like Annotators
89
MT-Bench-101: Model’s performance
Agenda
KR 2024: LLMs are Human-like Annotators
90
KR 2024: LLMs are Human-like Annotators
91
Why Focus on Evaluation
Source: https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/
KR 2024: LLMs are Human-like Annotators
92
LLM Evaluation vs. Human Evaluation
KR 2024: LLMs are Human-like Annotators
93
How to scale “human evaluation”?
KR 2024: LLMs are Human-like Annotators
94
LLM Evaluation
Model
LLM Benchmark
Shots
Instruction
Task
Dataset
Metric
Shots
Instruction
Task
Dataset
Metric
Shots
Instruction
Task
Dataset
Metric
Shots
Instruction
Task
Dataset
Metric
Aggregation
System Prompt
System Format
Hyperparams
KR 2024: LLMs are Human-like Annotators
95
LLM Evaluation
Task:
Dataset:
Metric:
Instruction:
Shot:
KR 2024: LLMs are Human-like Annotators
96
LLM Evaluation
Model
System Prompt
System Format
Hyperparams
<SYS> you are helpful Model </SYS>
<instruction> Translate this sentence to French
<user> I like pizza
<assistant> J'aime la pizza
KR 2024: LLMs are Human-like Annotators
97
LLM Evaluation: Alpaca
KR 2024: LLMs are Human-like Annotators
98
LLM Evaluation: G-Eval
G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment (Yang et.al, 2023)
KR 2024: LLMs are Human-like Annotators
99
LLM Evaluation: GPT-Score
https://github.com/confident-ai/deepeval
KR 2024: LLMs are Human-like Annotators
100
Language model-written evaluations
https://github.com/confident-ai/deepeval
Agenda
KR 2024: LLMs are Human-like Annotators
101
Generate a synthetic dataset using LLMs
102
AutoLabel:
Prodigy:
Labelbox:
LLM-data-annotation:
AutoLabel
103
AutoLabel: Question Answering
104
AutoLabel: Question Answering
105
AutoLabel: Question Answering
106
LLMs can label data as well as humans, but 100x faster
107
LLMs can label data: Quality Evaluation
108
LLMs can label data: Quality Evaluation
109
LLMs can label data: Quality Evaluation
110
Prodigy
111
https://demo.prodi.gy/?=null&view_id=ner_manual
What Prodigy isn’t:
Usage:
AutoLabel tools: Which one is better
112
Agenda
KR 2024: LLMs are Human-like Annotators
113
Agenda
KR 2024: LLMs are Human-like Annotators
114
Hallucination
115
Evolution of Hallucination in LLMs
116
Evolution of Hallucination in LLMs
117
Evolution of Hallucination in LLMs
118
Taxonomy of Hallucinations
119
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Taxonomy of Hallucinations: Causes
120
Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)
Taxonomy of Hallucinations: Metrics and Benchmarks
121
Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)
Taxonomy of Hallucinations: Mitigation
122
Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)
Agenda
KR 2024: LLMs are Human-like Annotators
123
Hallucination Types
124
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Orientation, Category and Degree
125
Hallucination Types: Orientation
126
Factual Mirage:
Intrinsic
Extrensic
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Orientation
127
Silver Lining:
Intrinsic
Extrensic
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
128
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
129
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
130
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
131
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
132
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Category
133
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Types: Degree
134
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Detection: SelfCheckGPT
135
Gpteval: Nlg evaluation using gpt-4 with better human alignment (Yang et.al, 2023)
Hallucination Detection: FACTScore
136
"Factscore: Fine-grained atomic evaluation of factual precision in long form text generation (Sewon et.al, 2023)
Hallucination eLiciTation dataset
137
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Hallucination Vulnerability Index (HVI)
138
A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)
Agenda
KR 2024: LLMs are Human-like Annotators
139
Hallucination Mitigation
140
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)
Hallucination Mitigation: Chain-Of-Verification (CoVe)
141
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)
Is hallucination always bad?
142
https://www.washingtonpost.com/opinions/2023/12/27/artificial-intelligence-hallucinations/
A big thank you!
KR 2024: LLMs are Human-like Annotators
143