1 of 143

LLMs are Human-like Annotators

Mounika Marreddy¹, Subba Reddy Oota², Lucie Flek¹, Manish Gupta³

¹Univ of Bonn, Germany; ²TU Berlin, Germany; ³Microsoft, India;

mmarredd@uni-bonn.de, subba.reddy.oota@tu-berlin.de, flek@bit.uni-bonn.de,gmanish@microsoft.com

KR 2024

21st International Conference on Principles of Knowledge Representation and Reasoning

Nov 2 - 8, 2024. Hanoi, Vietnam

Proposal Doc

2 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]: Manish
Generating Annotations for Reasoning Tasks using LLMs [25 mins]: Manish
Reasoning Datasets with LLM Annotations [25 mins]: Subba
Tea break [30 min]
Evaluation of LLM Generated Annotations [25 mins]: Subba
Autolabel Tools to Label Reasoning Datasets [25 mins]: Mounika
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]: Mounika

KR 2024: LLMs are Human-like Annotators

2

3 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]
Generating Annotations for Reasoning Tasks using LLMs [25 mins]
Reasoning Datasets with LLM Annotations [25 mins]
Tea break [30 min]
Evaluation of LLM Generated Annotations [25 mins]
Autolabel Tools to Label Reasoning Datasets [25 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

3

4 of 143

Deep Learning and Large Language Models

KR 2024: LLMs are Human-like Annotators

4

Basic: ANNs, CNNs, RNNs, LSTMs

NLP: Encoder-Decoder, Attention, Transformers, BERT, GPT, T0, BART, T5…

Prompt based models: GPT3, T0/mT0, InstructGPT, Prompting

5 of 143

GPT-3

Humans do not require large supervised datasets to learn most language tasks
This is enough

A brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”)
A tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”)

In-context learning

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).

KR 2024: LLMs are Human-like Annotators

5

6 of 143

InstructGPT

Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

KR 2024: LLMs are Human-like Annotators

6

1. Supervised fine-tuning (SFT)

2. Reward model (RM) training

3. RL via proximal policy optimization (PPO) on RM

7 of 143

ChatGPT and Prompting

Generate labelled training data

Cheaper: Reduction in $ cost vs UHRS
Faster turnaround
Agility: Big dev savings on hitapp creation and judge training
Quality: Higher label quality

Directly use GPT models rather than train your own.
Prompt engineering: good task description; examples; multiple prompts help.
Can control output length, output language, output style.

KR 2024: LLMs are Human-like Annotators

7

8 of 143

Summarization

KR 2024: LLMs are Human-like Annotators

8

Question Answering

9 of 143

Machine Translation

KR 2024: LLMs are Human-like Annotators

9

Ads Copywriting

10 of 143

Machine Reading Comprehension

KR 2024: LLMs are Human-like Annotators

10

11 of 143

Solving reasoning problems

KR 2024: LLMs are Human-like Annotators

11

12 of 143

Chain of thought (CoT) prompting

Chain of thought—a series of intermediate natural language reasoning steps that lead to the final output.
It could be 0-shot or few-shot.
Prompt as a triple: <input, chain of thought, output>.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

KR 2024: LLMs are Human-like Annotators

12

13 of 143

What are advantages of chain of thought prompting?

Allows models to decompose multi-step problems into intermediate steps
Improves interpretability: how the model might have arrived at a particular answer and help debug where the reasoning path went wrong
CoT reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, etc.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

Prompting PaLM 540B with just 8 CoT exemplars achieves SOTA on GSM8K math word problems, surpassing even finetuned GPT-3 with a verifier.

KR 2024: LLMs are Human-like Annotators

13

14 of 143

CoT improves Commonsense Reasoning

5 datasets

CSQA asks commonsense questions about the world involving complex semantics that often require prior knowledge.
StrategyQA involves inferring a multi-hop strategy to answer questions.
Date Understanding involves inferring a date from a given context
Sports Understanding involves determining whether a sentence relating to sports is plausible or implausible.
SayCan involves mapping a natural language instruction to a sequence of robot actions from a discrete set.

PaLM 540B with CoT achieved SOTA on StrategyQA (75.6% vs 69.4%) and outperforms an unaided sports enthusiast on sports understanding (95.4% vs 84%).

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

KR 2024: LLMs are Human-like Annotators

14

15 of 143

Loads of LLMs and SLMs

KR 2024: LLMs are Human-like Annotators

15

GPT-4o

OpenAI O1

…

16 of 143

Small language models

Models based on llama and Falcon

llama: open source models
Falcon: clean web data at scale
LoRA: fast finetuning with low rank adaptation: Alpaca
Vicuna: Conversation tuning
Falcon Instruct, Alpaca: Instruction tuning
Orca: Explanation tuning

Models based on llama2

llama-2: RLHF
Orca 2: Cautious Reasoning and Progressive learning

Models based on Mistral

Mistral: Sliding Window attention, Pre-fill and chunking, Rolling buffers
Mixtral: MoE
SOLAR: Depth-up scaling, alignment tuning, DPO

The Phi series: Clean textbook quality data

KR 2024: LLMs are Human-like Annotators

16

17 of 143

LLaMA 1

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).

Mathematical reasoning

MATH: 12K middle school and high school math problems in LaTeX.
GSM8k: Middle school math problems.
LLaMA-65B >Minerva 62B

Minerva: PaLM models finetuned on 38.5B tokens from ArXiv and Math pages

Code generation

HumanEval and MBPP.
LLaMA > LaMDA and PaLM

Massive Multitask Language Understanding (MMLU)

MCQs on humanities, STEM and social sciences.
Instruction tuned LLaMA-I 65B led to better results

KR 2024: LLMs are Human-like Annotators

17

18 of 143

LLaMA 2

A collection of pretrained and fine-tuned LLMs: 7B, 13B, 34B, 70B.
Fine-tuned LLMs: Llama 2-Chat.
Llama 2 models > open-source chat models on most benchmarks.
Commercial use license.

Llama 2: Open Foundation and Fine-Tuned Chat Models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi and others. July 2023.

KR 2024: LLMs are Human-like Annotators

18

Helpfulness human evaluation on ~4k prompts

19 of 143

LLaMA 3

Introducing Meta Llama 3: The most capable openly available LLM to date. 18-Apr-2024. Meta.

KR 2024: LLMs are Human-like Annotators

19

Meta llama 3 instruct model

Meta llama 3 pretrained model

20 of 143

LLaMA 3.2

Meta AI. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. Sep 2024.

KR 2024: LLMs are Human-like Annotators

20

	Category	Benchmark	Llama 3.2 11B	Llama 3.2 90B	Claude 3 Haiku	GPT-40 mini
Image	College-level Problems and Mathematical Reasoning	MMMU (val, O-shot COT, micro avg accuracy)	50.7	60.3	50.2	59.4
		MMMU-Pro, Standard (10 opts, test)	33	45.2	27.3	42.3
		MMMU-Pro, Vision (test)	23.7	33.8	20.1	36.5
		MathVista (testmini)	51.5	57.3	46.4	56.7
	Charts and Diagram Understanding	ChartQA (test, O-shot COT relaxed accuracy)	83.4	85.5	81.7	-
		A12 Diagram (test)	91.1	92.3	86.7	-
		DocVQA (test, ANLS)	88.4	90.1	88.8	-
	General VQA	VQAv2 (test)	75.2	78.1	-	-
Text	General	MMLU (O-shot, COT)	73	86	75.2	82
	Math	MATH (O-shot, COT)	51.9	68	38.9	70.2
	Reasoning	GPQA (O-shot, COT)	32.8	46.7	33.3	40.2
	Multilingual	MGSM (O-shot, COT)	68.9	86.9	75.1	87

Category	Benchmark	Llama 3.2 1B	Llama 3.2 3B	Gemma 2 2B IT	Phi-3.5 mini IT
General	MMLU (5-shot)	49.3	63.4	57.8	69
	Open-rewrite eval (O-shot, rougeL)	41.6	40.1	31.2	34.5
	TLDR9+ (test, I-shot, rougeL)	16.8	19	13.9	12.8
	IFEval	59.5	77.4	61.9	59.2
Tool Use	BFCL V2	25.7	67	27.4	58.4
Tool Use	Nexus	13.5	34.3	21	26.1
Math	GSM8K (8-shot, COT)	44.4	77.7	62.5	86.2
Math	MATH (O-shot, COT)	30.6	48	23.8	44.2
Reasoning	ARC Challenge (O-shot)	59.4	78.6	76.7	87.4
	GPQA (O-shot)	27.2	32.8	27.5	31.9
	Hellaswag (O-shot)	41.2	69.8	61.1	81.4
Long Context	InfiniteBench/En.MC (128k)	38	63.3	-	39.2
	InfiniteBench/En.QA (128k)	20.3	19.8	-	11.3
	NIH/Multi-needle	75	84.7	-	52.7
Multilingual	MGSM (O-shot, COT)	24.5	58.2	40.2	49.8

21 of 143

GPT-4

GPT-4 exhibits human-level performance on the majority of these professional and academic exams.
It passes a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers

GPT-4 Technical Report. OpenAI. https://cdn.openai.com/papers/gpt-4.pdf

KR 2024: LLMs are Human-like Annotators

21

22 of 143

Math word problems and reasoning QA

Mitra, Arindam, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen et al. "Orca 2: Teaching small language models how to reason." arXiv preprint arXiv:2311.11045 (2023).

KR 2024: LLMs are Human-like Annotators

22

23 of 143

Math word problems and reasoning QA

Mitra, Arindam, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen et al. "Orca 2: Teaching small language models how to reason." arXiv preprint arXiv:2311.11045 (2023).

KR 2024: LLMs are Human-like Annotators

23

24 of 143

Chart understanding and reasoning over data

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

KR 2024: LLMs are Human-like Annotators

24

Spot a data point that stands out in these charts and what that implicates. Then produce a detailed markdown table for all the data shown.

25 of 143

Image understanding and reasoning

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

KR 2024: LLMs are Human-like Annotators

25

Geometrical reasoning

26 of 143

Information seeking about objects

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

Prompt: Do you know the precise location where this image was taken?
Model Response: This image was taken in New York City. The Empire State Building is visible in the background. The street in the foreground is 8th Avenue. The cross street is West 34th Street.

KR 2024: LLMs are Human-like Annotators

26

Multimodal reasoning based on visual cues

27 of 143

Multimodal humor understanding

Prompt: Can you explain this meme?
Model Response: The meme is about a person who is playing a game at 300 FPS, but their monitor is only 75Hz. The person is trying to catch the game, but it is too fast for their monitor to display. The meme is funny because it is relatable to anyone who has ever played a game on a monitor that is not fast enough.

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

KR 2024: LLMs are Human-like Annotators

27

Commonsense reasoning in a multilingual setting

Prompt: Explain what i should call my grandparents on each side of my family based on this chart
Model Response: On your father’s side, you would call your grandparents Yéye and Nainai. On your mother’s side, you would call your grandparents Wàigong and Wàipó.

28 of 143

Reasoning and code generation

Create a web app called "Opossum Search":

1. Every time you make a search query, it should redirect you to a google search with the same query, but the word opossum before it.

2. It should be visually similar to Google search,

3. Instead of the google logo, it should have a picture of an opossum from the internet.

4. It should be a single html file, no separate js or css files.

5. It should say "Powered by google search" in the footer

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

KR 2024: LLMs are Human-like Annotators

28

29 of 143

Mathematics: Calculus

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

Prompt (video)

How could this person improve their technique?

Model Response

This soccer player needs to work on their ball-striking mechanics. Their foot is not making good contact with the ball, which is causing the ball to fly up and away from the goal. They also need to work on their body positioning. They should be leaning into the shot with their non-kicking leg in front of their kicking leg, and they should be following through with their kicking leg.

KR 2024: LLMs are Human-like Annotators

29

Video understanding and reasoning

30 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]
Generating Annotations for Reasoning Tasks using LLMs [25 mins]
Reasoning Datasets with LLM Annotations [25 mins]
Tea break [30 min]
Evaluation of LLM Generated Annotations [25 mins]
Autolabel Tools to Label Reasoning Datasets [25 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

30

31 of 143

Generating Annotations for Reasoning Tasks using LLMs

Arithmetic Reasoning
Commonsense Reasoning
Logical Reasoning
Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

31

32 of 143

Generating Annotations for Reasoning Tasks using LLMs

Arithmetic Reasoning
Commonsense Reasoning
Logical Reasoning
Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

32

33 of 143

LLMs with CoT are Non-Causal Reasoners

CoT and Answer do not fully align.

Bao, Guangsheng, Hongbo Zhang, Linyi Yang, Cunxiang Wang, and Yue Zhang. "Llms with chain-of-thought are non-causal reasoners." arXiv preprint arXiv:2402.16048 (2024).

KR 2024: LLMs are Human-like Annotators

33

ICL, SFT, and RLHF significantly impact the causal relations.

34 of 143

LLMs with CoT are Non-Causal Reasoners

Structural Causal Model
Larger LLMs tend to possess optimal SCM types

Bao, Guangsheng, Hongbo Zhang, Linyi Yang, Cunxiang Wang, and Yue Zhang. "Llms with chain-of-thought are non-causal reasoners." arXiv preprint arXiv:2402.16048 (2024).

Instruction (Z): task outline, a few examples, question.
CoT (X): step-by-step reasoning process
Answer (Y)
Ideally, Y is fully determined by X.

KR 2024: LLMs are Human-like Annotators

34

35 of 143

Automatic Reasoning Chain Evaluation

Hao, Shibo, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie et al. "LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models." In ICLR 2024 Workshop on Large Language Model (LLM) Agents .

KR 2024: LLMs are Human-like Annotators

35

36 of 143

Automatic Reasoning Chain Evaluation

GSM8K and Multistep-Arithmetics

SocREval (Socratic method): ask GPT4 to generate a reference reasoning chain before evaluation; 1-shot exemplar.
Direct (trace): ask GPT-4 to directly evaluate a reasoning chain; 3-shot exemplar.
Direct (step): ask GPT-4 to check the reasoning step by step ; 3-shot exemplar.
CoT (step): ask GPT-4 to generate a reasoning process before evaluating each reasoning step ; 3-shot exemplar.

Hao, Shibo, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie et al. "LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models." In ICLR 2024 Workshop on Large Language Model (LLM) Agents .

Common types of false positive reasoning chains detected by AutoRace

KR 2024: LLMs are Human-like Annotators

36

37 of 143

Dynamic Program Prompting and Program Distillation

Jie, Zhanming, and Wei Lu. "Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning." In Findings of the Association for Computational Linguistics: ACL 2023 , pp. 10518-10526. 2023.

KR 2024: LLMs are Human-like Annotators

37

38 of 143

Dynamic Program Prompting and Program Distillation

Jie, Zhanming, and Wei Lu. "Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning." In Findings of the Association for Computational Linguistics: ACL 2023 , pp. 10518-10526. 2023.

Example prediction and retrieved program samples

KR 2024: LLMs are Human-like Annotators

38

39 of 143

Generating Annotations for Reasoning Tasks using LLMs

Arithmetic Reasoning
Commonsense Reasoning
Logical Reasoning
Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

39

40 of 143

MoT: Memory-of-Thought

Finetuning is expensive.
Memory helps humans improve themselves in terms of decision-making, reasoning, judgment, etc.
MoT can improve LLMs via prethinking and recalling.
Pre-thinking

LLM pre-thinks on the unlabeled dataset

Few-Shot-CoT with multiple-path decoding strategy
Each demo has question, rationale and answer.
Get multiple <rationale, answer> pairs and choose majority-voted answer

Saves the high-confidence thoughts as external memory

Answer-entropy to filter out uncertain thoughts
Store <question, rationale, answer> as memory.

Recalling at test time

Given a test question, LLM recalls relevant memory to help itself reason and answer it.

Li, Xiaonan, and Xipeng Qiu. "MoT: Memory-of-Thought Enables ChatGPT to Self-Improve." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 6354-6374. 2023.

KR 2024: LLMs are Human-like Annotators

40

41 of 143

MoT: Memory-of-Thought

Recalling

Cluster memory items. Get topK semantically relevant memory item candidates from each cluster using SBERT.
Let LLM choose best memory candidates from each cluster.
Few-Shot-CoT with these memory items as extra context.

Baselines

MoT (no rationale): removes rationales in the retrieved memory and thus lets the LLM directly output the answer
MoT (no thinking): keeps rationales in the retrieved memory but forces the LLM to directly answer the question without CoT at recall stage.

Li, Xiaonan, and Xipeng Qiu. "MoT: Memory-of-Thought Enables ChatGPT to Self-Improve." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 6354-6374. 2023.

MoT exceeds Few-Shot-CoT and Zero-Shot-CoT

KR 2024: LLMs are Human-like Annotators

41

42 of 143

Selective annotation and Prompt retrieval

Unsupervised, graph-based selective annotation method, vote-k, to select diverse, representative examples to annotate.

Graph with each sample as node and SBERT sim to build edges.
Choose high degree nodes where with degree discounting for already chosen nodes.
Choose samples with model confidence scores from each of the 10 buckets.

Hongjin, S. U., Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang et al. "Selective annotation makes language models better few-shot learners." In ICLR . 2022.

GPT-J with 6B parameters.

KR 2024: LLMs are Human-like Annotators

42

ICL perf over varying annotation budgets for HellaSwag commonsense reasoning. LLM=GPT-J

100 annotated examples

Example from HellaSwag CommonSense Reasoning

43 of 143

Generating Annotations for Reasoning Tasks using LLMs

Arithmetic Reasoning
Commonsense Reasoning
Logical Reasoning
Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

43

44 of 143

Analogical prompting

CoT needs labeled exemplars of the reasoning process.
To solve problems, humans think about related problems or high-level knowledge.
Analogical prompting

Prompt LLMs to self-generate relevant exemplars in the context, before proceeding to solve the given problem.
Avoids need for labelled exemplars. Can tailor the generated exemplars and knowledge to each problem

Yasunaga, Michihiro, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. "Large Language Models as Analogical Reasoners." In ICLR .

KR 2024: LLMs are Human-like Annotators

44

45 of 143

Analogical prompting methods

Self-generated exemplars

# Problem: [x]
# Relevant problems: Recall three relevant and distinct problems. For each problem, describe it and explain the solution.
# Solve the initial problem:

Self-generated knowledge + exemplars

# Tutorial: Identify core concepts in the problem and provide a tutorial.
generating knowledge before exemplars yields superior results

Yasunaga, Michihiro, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. "Large Language Models as Analogical Reasoners." In The Twelfth International Conference on Learning Representations .

Big Bench reasoning tasks with GPT3.5-Turbo

KR 2024: LLMs are Human-like Annotators

45

46 of 143

LeanReasoner: Offloading reasoning to Lean

Lean: a theorem proving symbolic solver framework
Offloading reasoning to Lean: Reduces the risk of logical inconsistencies
SOTA perf on FOLIO and ProofWriter.
Fine-tuning on <100 in-domain samples for each dataset.

Jiang, Dongwei, Marcio Fonseca, and Shay B. Cohen. "LeanReasoner: Boosting Complex Logical Reasoning with Lean." In NAACL-HLT , pp. 7490-7503. 2024.

KR 2024: LLMs are Human-like Annotators

46

47 of 143

LeanReasoner: Offloading reasoning to Lean

Logical reasoning contextual MCQ task: Input is context para, question, options.
Formalizer (LLM) converts the context and question to formalized context and formalized question.
Tactic Generator

ReProver model

Dense Passage Retriever to select premises when provided with the current goal
Google’s Byte-T5 generator that generates tactics using the goal and the retrieved premises.

Proof search

Build a proof tree that incrementally evolves the goal through tactic invocations.
Expand those goals where our generative model has the highest global confidence.

Jiang, Dongwei, Marcio Fonseca, and Shay B. Cohen. "LeanReasoner: Boosting Complex Logical Reasoning with Lean." In NAACL-HLT , pp. 7490-7503. 2024.

GPT-4 Base: few-shot examples include both formalization and proof generation in a single prompt.
GPT-4 Base Comments: augment these examples with line-by-line comments in Lean code.
GPT-4 Base Separate: separate prompts for formalization and proof generation.

KR 2024: LLMs are Human-like Annotators

47

48 of 143

LeanReasoner: Offloading reasoning to Lean

Sample proofs created by LeanReasoner without pretraining (left), finetuned on Intuitive data (middle), and finetuned on Concise data (right).

Jiang, Dongwei, Marcio Fonseca, and Shay B. Cohen. "LeanReasoner: Boosting Complex Logical Reasoning with Lean." In NAACL-HLT , pp. 7490-7503. 2024.

KR 2024: LLMs are Human-like Annotators

48

49 of 143

Event Relation Logical Prediction

Chen, Meiqi, Yubo Ma, Kaitao Song, Yixin Cao, Yan Zhang, and Dongsheng Li. "Learning to teach large language models logical reasoning." arXiv preprint arXiv:2310.09158 (2023).

Coreference ∈{Yes, No}
Temporal ∈ {NO_TEMPORAL, BEFORE, OVERLAP, CONTAINS, SIMULTANEOUS, ENDS-ON, BEGINS-ON}
Causal ∈ {NO_CAUSAL, PRECONDITION, CAUSE}
Subevent ∈ {Yes, No}.
ChatGPT incorrectly predicts the temporal and causal relations between events “FIRE” and “collapsed” as “simultaneous” and “cause”.
60% of the answers from ChatGPT on the MAVEN-ERE (event relation extraction) dataset are logically inconsistent

KR 2024: LLMs are Human-like Annotators

49

50 of 143

Enabling LLMs for Event Relation Logical Prediction

3 ways of logic acquisition

Generative-based approach

Encourages LLMs to generate rationale themselves,
Uncertainty of the generated rationales may also bring some biases, leading to an incorrect subsequent answer

Retrieval-based approach

Collects constraints from realistic data
Retrieves relevant contents and adds them to the LLM instruction.
Ensures correctness of logic

Finetuning-based approach

Constructs a high-order event relation logical prediction dataset (LLM-ERL)
Fine-tune specialized LLMs

Chen, Meiqi, Yubo Ma, Kaitao Song, Yixin Cao, Yan Zhang, and Dongsheng Li. "Learning to teach large language models logical reasoning." arXiv preprint arXiv:2310.09158 (2023).

KR 2024: LLMs are Human-like Annotators

50

51 of 143

Event Relation Logical Prediction Results

Chen, Meiqi, Yubo Ma, Kaitao Song, Yixin Cao, Yan Zhang, and Dongsheng Li. "Learning to teach large language models logical reasoning." arXiv preprint arXiv:2310.09158 (2023).

KR 2024: LLMs are Human-like Annotators

51

52 of 143

Generating Annotations for Reasoning Tasks using LLMs

Arithmetic Reasoning
Commonsense Reasoning
Logical Reasoning
Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

52

53 of 143

Symbolic reasoning for math word problems

CoT based explanations are neither verifiable nor concise.
Symbolic evaluation

SVAMP 🡪 SVAMP-Sym
GPT-3’s davinci-002 achieves good zero-shot accuracy on symbolic problems (64.2%), comparable to the numeric accuracy of 68.9%

Gaur, Vedant, and Nikunj Saunshi. "Reasoning in Large Language Models Through Symbolic Math Word Problems." In ACL Findings , pp. 5889-5903. 2023.

Self-prompting

Prompt LLM with numeric problem and its response to the problem, and then ask it to solve the symbolic problem.

KR 2024: LLMs are Human-like Annotators

53

54 of 143

Symbolic reasoning for math word problems

“Raw”: eval on SVAMP and (SVAMP-Sym) dataset for numeric (symbolic) MWPs
(-F): output before the filtering step.
“SP” =self-prompting; “SP + AP” = two-stage self-prompting where an additional “Alignment Prompt” is added
Sim: between the full responses for numeric and symbolic problems.
CoT is better; SP improves sim; variables (i,j,k,l) have slightly worse accuracy; SP+AP is the best.

Gaur, Vedant, and Nikunj Saunshi. "Reasoning in Large Language Models Through Symbolic Math Word Problems." In ACL Findings , pp. 5889-5903. 2023.

KR 2024: LLMs are Human-like Annotators

54

55 of 143

Symbolic Rule Learning for Robust Numerical Reasoning

Numerical reasoning for machine reading comprehension (RC) remains a difficult challenge.
ICL with LLMs to decompose complex questions into simpler subquestions that are easier to answer with single-span RC models.
Symbolic learning methods like ILASP to learn rules for recomposing partial answers.
Benefits

Data efficiency: no training or fine-tuning.
Neuro-symbolic approach 🡺 robust numerical reasoning
Interpretable and verifiable reasoning traces.

Al-Negheimish, Hadeel, Pranava Madhyastha, and Alessandra Russo. "Augmenting Large Language Models with Symbolic Rule Learning for Robust Numerical Reasoning." In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS'23 .

KR 2024: LLMs are Human-like Annotators

55

56 of 143

Symbolic Rule Learning for Robust Numerical Reasoning

Few-shot (3-shot) examples
300 examples from the DROP training set, with annotations of CoT reasoning traces and question decompositions.
2 approaches to select few-shot examples for each test question

KNN of the complex test question in the embedding space of the annotated questions using sentence embeddings
Define a canonical set of examples from the annotated questions for each given type, and transform the task into type-prediction using Alpaca 7B.

Types: addition, subtraction and negation

Al-Negheimish, Hadeel, Pranava Madhyastha, and Alessandra Russo. "Augmenting Large Language Models with Symbolic Rule Learning for Robust Numerical Reasoning." In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS'23 .

KR 2024: LLMs are Human-like Annotators

56

57 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]
Generating Annotations for Reasoning Tasks using LLMs [25 mins]
Reasoning Datasets with LLM Annotations [25 mins]
Tea break [30 min]
Evaluation of LLM Generated Annotations [25 mins]
Autolabel Tools to Label Reasoning Datasets [25 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

57

58 of 143

KR 2024: LLMs are Human-like Annotators

58

What is reasoning?

Reasoning is the ability to make inferences using evidence and logic.
Reasoning can be divided into multiple types of skills such as Commonsense, Mathematical, and Symbolic reasoning etc.
Often, reasoning involves deductions from inference chains, called as multi-step reasoning.

Augmented Language Models: a Survey (Mialon et.al, 2023)

Do language models truly understand and apply common sense reasoning?

59 of 143

KR 2024: LLMs are Human-like Annotators

59

Reasoning Problems

Augmented Language Models: a Survey (Mialon et.al, 2023)

Hard Language Tasks: require multiple steps of reasoning to solve

Arithmetic reasoning is the ability to understand and apply mathematical concepts and principles in order to solve problems involving arithmetic operations. This involves using logical thinking and mathematical principles to determine the correct course of action when solving mathematical problems.

Commonsense Reasoning is the use of everyday knowledge and understanding to make judgments and predictions about new situations. It is a fundamental aspect of human intelligence that enables us to navigate our environment, understand others, and make decisions with incomplete information

Symbolic reasoning is a form of reasoning that involves the manipulation of symbols according to formal rules. In symbolic reasoning, we use abstract symbols to represent concepts and relationships, and then manipulate those symbols according to precise rules in order to draw conclusions or solve problems.

60 of 143

KR 2024: LLMs are Human-like Annotators

60

Multi-step reasoning is often seen as a weakness in language models

Towards Reasoning in Large Language Models: A Survey (Huang et.al, 2023)

It is unclear to what extent LLMs are capable of reasoning

Former research on reasoning in small language models through fully

supervised finetuning on specific datasets

Creating a dataset containing explicit reasoning can be difficult and time-consuming
training on a specific dataset limits application to a specific domain

Reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters

61 of 143

KR 2024: LLMs are Human-like Annotators

61

Reasoning and Commonsense Benchmarks

Source: https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond#different-types-of-llm-benchmarks

How can we assess reasoning abilities in language models?”

TruthfulQA — Truthfulness
MMLU — Language understanding
HellaSwag — Commonsense reasoning
BIG-Bench Hard — Challenging reasoning tasks
HumanEval — Coding challenges
CodeXGLUE — Programming tasks
Chatbot Arena — Human-ranked ELO-based benchmark
MT Bench — Complex conversational ability

TruthfulQA evaluates models on their ability to provide accurate and truthful answers, which is crucial for combating misinformation and promoting ethical AI usage.

MMLU is aimed at evaluating models based on the knowledge they acquired during pre-training, focusing solely on zero-shot and few-shot settings.

Codeing: HumanEval consists of 164 unique programming tasks designed to evaluate a model’s code generation abilities. These tasks cover a broad spectrum, from algorithms to the comprehension of programming languages.

The Chatbot Arena is an open platform for ranking language models using over 200K human votes. Users can anonymously quiz and judge pairs of AI models like ChatGPT or Claude without knowing their identities, and votes are counted towards rankings only if the model identities stay hidden.

MT-bench evaluates chat assistants’ quality by presenting them with a series of multi-turn open-ended questions, utilizing LLMs as judges.

62 of 143

KR 2024: LLMs are Human-like Annotators

62

How is reasoning measured (in the literature)?

GPT-4 Technical Report (OpenAI).

While SOTA models at the time struggled to score above 50% with pre-training, GPT-4 achieved a record-high of 95.3% with just 10-shot prompting in 2023.

63 of 143

KR 2024: LLMs are Human-like Annotators

63

Chain of thought prompting and Self consistency

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Prompt: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?
11 apples

Keep in mind that the authors claim that this is an emergent ability that arises with sufficiently large language models.

64 of 143

KR 2024: LLMs are Human-like Annotators

64

Chain of thought prompting: Arithmetic Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Do not positively impact performance for small models
Few-shot CoT achieves better performance on LLM than zero-shot CoT.

Manually composed 8 exemplars
All contains equations with flexible formats
Benchmarked on:

GSM8K (Cobbe et al. 2021)
SVAMP (Patel et al., 2021)
MAWPS (Koncel-Kedziorski et al., 2016)

65 of 143

KR 2024: LLMs are Human-like Annotators

65

Chain of thought prompting and Self consistency

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Chain-of-thought prompting improves performance on a range of arithmetic reasoning tasks

66 of 143

KR 2024: LLMs are Human-like Annotators

66

Chain of thought prompting: Symbolic Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Standard prompting fails out-of-domain tests for both tasks
Zero-shot CoT using Instruct-GPT-3 175B achieves the similar performance as few-shot CoT in both tasks using 540B PaLM model

67 of 143

KR 2024: LLMs are Human-like Annotators

67

Chain of thought prompting: Commonsense Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

CoT show minimal benefits on CSQA and StrategyQA tasks
Few-shot achieves better performance than Zero-shot CoT on 175B GPT-3 model for CSQA and Strategy QA tasks, but Zero-shot CoT shows significant improvement for Date understanding task.

68 of 143

KR 2024: LLMs are Human-like Annotators

68

More Advances: Self consistency

Self-consistency improves chain of thought reasoning in language models. (Wang et al., 2022)

Change greedy decode (single-path) to self-consistency (multi-path) in few-shot CoT

69 of 143

KR 2024: LLMs are Human-like Annotators

69

STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning

STaR: Self-Taught Reasoner. (Zelikman et al., 2022)

Self-Taught Reasoner (STaR), which iteratively improves a model’s ability to generate rationales to solve problems.
participants were 74% more likely to prefer the STaR-generated rationales over the human-generated rationales

70 of 143

KR 2024: LLMs are Human-like Annotators

70

Program-aided Language Models (PAL)

PAL: Program-aided Language Models (Gao et.al, 2023)

PAL enhance problem-solving by generating code to represent intermediate reasoning steps (contrast this with CoT prompting, which uses natural language to reason).

71 of 143

KR 2024: LLMs are Human-like Annotators

71

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

TORA models exhibit remarkable improvements over previous state-of-the-art approaches across all scales

72 of 143

KR 2024: LLMs are Human-like Annotators

72

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

Utilize GPT-4 to synthesize high-quality trajectories on arithmetic reasoning datasets.

73 of 143

KR 2024: LLMs are Human-like Annotators

73

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

TORA provides valuable insights for future research, contributing to the development of more advanced and versatile reasoning agents.

74 of 143

KR 2024: LLMs are Human-like Annotators

74

Plan-and-Solve Prompting

PAL: Program-aided Language Models (Gao et.al, 2023)

Zero-shot-CoT prompting
Plan-and-Solve (PS) prompting
answer extraction prompting

Zero-shot PS + Prompting consistently outperforms Zero-shot-CoT across all reasoning datasets by a large margin

75 of 143

KR 2024: LLMs are Human-like Annotators

75

Can we use LLMs to benchmark reasoning datasets?

With models like GPT-4, it's now possible to synthetically produce datasets that are more comprehensive and diverse than human-labeled ones, in far less time

76 of 143

KR 2024: LLMs are Human-like Annotators

76

Reasoning datasets: CriticBench

CRITICBENCH is designed to assess the two key aspects of LLMs’ critical reasoning:

critique
correction

77 of 143

KR 2024: LLMs are Human-like Annotators

77

Reasoning datasets: Question collection on CriticBench

Response collection from LLMs:

Employ various LLMs, including GPT-3.5, GPT4, LLaMa2 (7B, 13B, and 70B variants), and vicuna (7B, 13B, and 33B variants)
Each LLM generate response for each question and provide valid reasoning

Response annotation:

Response correctness:

rule-based matching, followed by a more detailed evaluation using GPT-4 and Humans

Domains:

15 datasets spanning 5 domains: mathematical, commonsense, symbolic, coding, and algorithmic.

Question collection:

Randomly selecting quantity of data from existing datasets

78 of 143

KR 2024: LLMs are Human-like Annotators

78

Reasoning datasets: Evaluation process on CriticBench

79 of 143

KR 2024: LLMs are Human-like Annotators

79

Reasoning datasets: Annotation example of CriticBench

GPT-4 evaluation is closer to Human evaluation

80 of 143

KR 2024: LLMs are Human-like Annotators

80

Reasoning datasets: Key Factors in Critical Reasoning

LLMs struggle more with incorrect answers in detail-oriented tasks like algorithmic tasks compared to logic centric tasks like code generation

81 of 143

KR 2024: LLMs are Human-like Annotators

81

Reasoning datasets: Average performance on CriticBench

The knowledge acquired by LLMs is not entirely consistent across generation, critique, and correction tasks.

82 of 143

KR 2024: LLMs are Human-like Annotators

82

Reasoning datasets: Consistency of GQC Knowledge

Analysis across different task types found that models perform better in Q and C for tasks focused on logic compared to those requiring attention to detail.

83 of 143

KR 2024: LLMs are Human-like Annotators

83

Human preference benchmarks with reasoning tasks

How to construct a benchmark which consists of real-world multi-turn conversations?
How to automatically evaluate the model’s responses?

Why human preference benchmarks?

Existing NLP benchmarks fail to accurately evaluate LLM chat assistants
Knowledge benchmark (e.g., MMLU)

Not open-ended

Instruction benchmark (e.g., Super-Natural Instructions)

Only one turn
Human annotation efforts

Conversational benchmark (e.g., CoQA)

Too easy for state-of-the-art LLMs

84 of 143

KR 2024: LLMs are Human-like Annotators

84

Human preference benchmarks: LLMs as judges

LLM judges provide not only scores but also explanations, offering interpretable outputs and valuable insights.

Four popular benchmarks

MT-Bench
Chatbot Arena
MT-Bench-101
HumanEval

85 of 143

KR 2024: LLMs are Human-like Annotators

85

LLM-as-a-judge

The agreement between GPT-4 judges and humans reaches over 80%, the same level agreement among humans.

86 of 143

KR 2024: LLMs are Human-like Annotators

86

Limitations of LLM-as-a-judge

Position bias

LLM favor the answers in the first position

Verbosity bias

LLM favor long answers

Self-appreciation bias

LLM favor its own answers of answers similar to its own answer

Limited reasoning ability

LLMs fail to judge hard math/reasoning/code questions

It's crucial to acknowledge the limitations of LLM-as-a-judge, such as its inability to detect hallucinations and penalize LLM generated answers accordingly, and potential errors when grading math/reasoning questions.

87 of 143

KR 2024: LLMs are Human-like Annotators

87

Human preference benchmarks: MT-Bench-101

The agreement between GPT-4 and human expert evaluations reached 87%, utilizing our designed evaluation approach.

88 of 143

KR 2024: LLMs are Human-like Annotators

88

MT-Bench-101: Hierarchical Ability Taxonomy

Utilized GPT-4 to construct data and it is the most powerful model for multi-turn dialogues.

13 tasks

3-level abilities

89 of 143

KR 2024: LLMs are Human-like Annotators

89

MT-Bench-101: Model’s performance

Closed-source models consistently exhibit superior performance compared to open-source counterparts across all evaluated tasks.
Content confusion and format rephrasing are relatively less difficult, while the mathematical reasoning task is the most challenging

90 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]
Generating Annotations for Reasoning Tasks using LLMs [25 mins]
Reasoning Datasets with LLM Annotations [25 mins]
Tea break [30 min]
Evaluation of LLM Generated Annotations [25 mins]
Autolabel Tools to Label Reasoning Datasets [25 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

90

91 of 143

KR 2024: LLMs are Human-like Annotators

91

Why Focus on Evaluation

Source: https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/

92 of 143

KR 2024: LLMs are Human-like Annotators

92

LLM Evaluation vs. Human Evaluation

93 of 143

KR 2024: LLMs are Human-like Annotators

93

How to scale “human evaluation”?

Chatbot arena allows users to vote for the superior response manually, which leverage multiple LLMs to autonomously determine which response stands out.

94 of 143

KR 2024: LLMs are Human-like Annotators

94

LLM Evaluation

Model

LLM Benchmark

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Aggregation

System Prompt

System Format

Hyperparams

95 of 143

KR 2024: LLMs are Human-like Annotators

95

LLM Evaluation

Task:

What is that concrete problem that we want to address (e.g., classification, summarization, commonsense reasoning..)

Dataset:

What dataset we want to use?

Metric:

How we evaluate the performance?

Instruction:

Evaluate model ability to perform an unseen task given context in the form of instructions

Shot:

The model is given with or without any prior examples at inference time

96 of 143

KR 2024: LLMs are Human-like Annotators

96

LLM Evaluation

Model

System Prompt

System Format

Hyperparams

<SYS> you are helpful Model </SYS>

<instruction> Translate this sentence to French

<user> I like pizza

<assistant> J'aime la pizza

97 of 143

KR 2024: LLMs are Human-like Annotators

97

LLM Evaluation: Alpaca

GPT-4 based evaluation of chatbot output agrees well with human evaluation; known before
Surprisingly, judge Llama 2 (70B) can compete with judge GPT-4 (1760B). But, Judge Llama 2 is more biased (more ties) and less capable of following the instructions carefully.

98 of 143

KR 2024: LLMs are Human-like Annotators

98

LLM Evaluation: G-Eval

LLM consistently gives higher scores to GPT-3.5 summaries, even though human judges prefer summaries written in human language

G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment (Yang et.al, 2023)

Can we rely on LLMs to assess LLM outputs
Do LLMs show a bias towards the outputs they generate during evaluations?

99 of 143

KR 2024: LLMs are Human-like Annotators

99

LLM Evaluation: GPT-Score

https://github.com/confident-ai/deepeval

100 of 143

KR 2024: LLMs are Human-like Annotators

100

Language model-written evaluations

https://github.com/confident-ai/deepeval

Generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size.
Overall, LM-written evaluations are high quality and quickly discover many novel LM behaviors.

101 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]
Generating Annotations for Reasoning Tasks using LLMs [25 mins]
Reasoning Datasets with LLM Annotations [25 mins]
Tea break [30 min]
Evaluation of LLM Generated Annotations [25 mins]
Autolabel Tools to Label Reasoning Datasets [25 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

101

102 of 143

Generate a synthetic dataset using LLMs

102

AutoLabel:

Labeling Agent provides a method to generate synthetic datasets
Supports synthetic dataset generation for classification and entity matching tasks

Prodigy:

LLMs, which allow you to provide a prompt in order to annotate an NLP task.

Labelbox:

Easily curate and annotate data
Generate high-quality human feedback data for computer vision and language models, evaluate and improve model performance

LLM-data-annotation:

Combines human expertise with the efficiency of Large Language Models (LLMs) like OpenAI's GPT-3.5 to simplify dataset annotation and model improvement.

103 of 143

AutoLabel

103

How to use AutoLabel tool to annotate a task?

104 of 143

AutoLabel: Question Answering

104

First step: specify a labeling configuration

105 of 143

AutoLabel: Question Answering

105

Second step: do a dry-run on test dataset using the LLM specified in config.json by running agent.plan

106 of 143

AutoLabel: Question Answering

106

Final step: run the labeling with agent.run

107 of 143

LLMs can label data as well as humans, but 100x faster

107

Refuel LLM (84.2%) outperforms trained human annotators (80.4%), GPT-3-5-turbo (81.3%), PaLM-2 (82.3%) and Claude (79.3%) across a benchmark of 15 text labeling datasets.

108 of 143

LLMs can label data: Quality Evaluation

108

Refuel LLM outperforms human annotators and all closed source LLMs, excluding GPT-4

109 of 143

LLMs can label data: Quality Evaluation

109

RefuelLLM-2 (83.82%) outperforms all current state-of-the-art LLMs for data labeling and enrichment, including GPT-4-Turbo (80.88%), Claude-3-Opus (79.19%) and Gemini-1.5-Pro (74.59%)

110 of 143

LLMs can label data: Quality Evaluation

110

Refuel-LLM-2 is competitive or superior in terms of output quality, compared to current state-of-the-art LLMs

111 of 143

Prodigy

111

https://demo.prodi.gy/?=null&view_id=ner_manual

What Prodigy isn’t:

software as a service” – it’s a tool that you can download, install and run yourself

Usage:

Annotate NLP tasks, Audio and vision tasks

112 of 143

AutoLabel tools: Which one is better

112

Autolabel currently supports Chain-of-thought prompting as well suitable for reasoning tasks.
Labeling tasks spanning categories such as classification, entity resolution, matching, reading comprehension, reasoning and information extraction

113 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]
Generating Annotations for Reasoning Tasks using LLMs [25 mins]

Reasoning Datasets with LLM Annotations [25 mins]
Evaluation of LLM Generated Annotations [25 mins]

Autolabel Tools to Label Reasoning Datasets [25 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

113

114 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]
Generating Annotations for Reasoning Tasks using LLMs [25 mins]
Reasoning Datasets with LLM Annotations [25 mins]
Tea break [30 min]
Evaluation of LLM Generated Annotations [25 mins]
Autolabel Tools to Label Reasoning Datasets [25 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

Taxonomy of hallucinations
Hallucination detection
Methods to mitigate hallucination

KR 2024: LLMs are Human-like Annotators

114

115 of 143

Hallucination

115

How did hallucination evolve in LLMs?

In the context of language models: the generation of text or responses that seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.

116 of 143

Evolution of Hallucination in LLMs

116

LLMs simply generate text without knowing the specific details of the visit's purpose.

117 of 143

Evolution of Hallucination in LLMs

117

Despite advanced capabilities of MLLM, like all LLMs, are prone to hallucination

118 of 143

Evolution of Hallucination in LLMs

118

The problem of hallucination also exists in other foundation models such as image, video, and audio as well

119 of 143

Taxonomy of Hallucinations

119

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Like their text-only counterparts in LLMs, Video- and Audio-based language models are also prone to hallucinations.

120 of 143

Taxonomy of Hallucinations: Causes

120

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

121 of 143

Taxonomy of Hallucinations: Metrics and Benchmarks

121

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

122 of 143

Taxonomy of Hallucinations: Mitigation

122

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

123 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]
Generating Annotations for Reasoning Tasks using LLMs [25 mins]

Reasoning Datasets with LLM Annotations [25 mins]
Evaluation of LLM Generated Annotations [25 mins]

Autolabel Tools to Label Reasoning Datasets [25 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

Taxonomy of hallucinations
Hallucination detection
Methods to mitigate hallucination

KR 2024: LLMs are Human-like Annotators

123

124 of 143

Hallucination Types

124

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Hallucination can occur in any NLG task, necessitating a thorough examination based on the fundamental principles of text generation from a given prompt.

125 of 143

Hallucination Types: Orientation, Category and Degree

125

https://arxiv.org/pdf/2310.04988

126 of 143

Hallucination Types: Orientation

126

Factual Mirage:

an LLM engages in hallucination or distortion of a given prompt that is factually correct.

Intrinsic

Extrensic

The generated text that appears to be true or real but is actually misleading or false.

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

127 of 143

Hallucination Types: Orientation

127

Silver Lining:

an LLM indulges in hallucination by conjuring an elaborate and captivating narrative based on a given prompt that is factually incorrect.

Intrinsic

Extrensic

LLM produces unfaithful content or is inconsistent with the provided source prompt.

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

128 of 143

Hallucination Types: Category

128

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

an LLMs generating text that exhibits a mashed fusion of events from different timelines.

129 of 143

Hallucination Types: Category

129

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

LLMs generate an incorrect location with an event.

130 of 143

Hallucination Types: Category

130

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

an LLM generates quotations attributed to either fictional or real characters without sufficient evidence to verify the authenticity of such statements.

131 of 143

Hallucination Types: Category

131

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

an LLM fabricates an imaginary personality in relation to a past event, without concrete evidence.

132 of 143

Hallucination Types: Category

132

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

an LLM generates numeric values related to past events, such as dates, ages, or monetary amounts, that are inconsistent with the actual facts.

133 of 143

Hallucination Types: Category

133

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

LLMs generate an imprecise expansion for an acronym.

134 of 143

Hallucination Types: Degree

134

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Alarming indicates added information pieces that bear a radical dissemblance from the topic fed via the prompt.
Moderate indicates a level of hallucination that introduces facts that are either fictitious or tangential to the topic at hand.
Mild indicates minor hallucination which is superficial in terms of its impact.

How can we detect and evaluate LLM’s hallucinations across different types?

135 of 143

Hallucination Detection: SelfCheckGPT

135

Gpteval: Nlg evaluation using gpt-4 with better human alignment (Yang et.al, 2023)

SelfCheckGPT can effectively detect hallucinated sentences with higher accuracy compared to several baseline methods
SelfCheckGPT's prompting method achieved the highest performance in detecting non-factual sentences

136 of 143

Hallucination Detection: FACTScore

136

"Factscore: Fine-grained atomic evaluation of factual precision in long form text generation (Sewon et.al, 2023)

GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.

137 of 143

Hallucination eLiciTation dataset

137

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Selected 15 LLMs, and used them to generate a total of 75,000 text passages, with each LLM producing 5,000 text prose entries.
The text prompts provided to these LLMs consisted of tweets from NYTimes and headlines sourced from the Politifact dataset

138 of 143

Hallucination Vulnerability Index (HVI)

138

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Larger LLMs without RLHF are prone to both orientations of hallucination
For smaller LLMs like T5, Dolly, etc., Generated Golem, Virtual Voice, and Geographic Erratum categories of hallucination are rarely observed.

139 of 143

Agenda

Introductions to LLMs and Recap Their Capabilities [25 mins]
Generating Annotations for Reasoning Tasks using LLMs [25 mins]
Reasoning Datasets with LLM Annotations [25 mins]
Tea break [30 min]
Evaluation of LLM Generated Annotations [25 mins]
Autolabel Tools to Label Reasoning Datasets [25 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

Taxonomy of hallucinations
Hallucination detection
Methods to mitigate hallucination

KR 2024: LLMs are Human-like Annotators

139

140 of 143

Hallucination Mitigation

140

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)

Leveraging external knowledge to validate the correctness: RAG
Modifying the decoding strategy
Sampling multiple outputs and checking their consistency: SelfCheckGPT

Proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average

141 of 143

Hallucination Mitigation: Chain-Of-Verification (CoVe)

141

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)

CoVe reduces inaccuracies in LLMs' responses by verifying facts through structured questioning.
Significant reduction in factual hallucinations.
Final responses are more reliable and factually accurate.

142 of 143

Is hallucination always bad?

142

https://www.washingtonpost.com/opinions/2023/12/27/artificial-intelligence-hallucinations/

Numerals: Models have been shown to hallucinate a lot while generating numerals, such as dates, quantities, and scalars.
Long Text: Models often tend to self-contradict while generating the output.
Reasoning: Misunderstanding facts/information present in the source text can lead to hallucinations and errors.
When Contextual Knowledge Conflicts with the Parametric Knowledge: Models have been shown to prioritize the parametric knowledge (acquired during pre-training) over the contextual knowledge which leads to hallucinations.

143 of 143

A big thank you!

Tutorial, Code and Material

https://sites.google.com/view/lllms-as-human-like-annotators

KR 2024: LLMs are Human-like Annotators

143