1 of 153

LLMs are Human-like Annotators

Mounika Marreddy¹, Subba Reddy Oota², Manish Gupta³

¹IIIT Hyderabad, India; ²TU Berlin, Germany; ³Microsoft, India

mounika0559@gmail.com, subba.reddy.oota@tu-berlin.de, gmanish@microsoft.com

ECIR 2025

The 47th European Conference on Information Retrieval

Apr 6-10, 2025. Lucca, Italy.

Proposal Doc

2 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]: Manish
Generating Annotations for NLP Tasks using LLMs [30 mins]: Manish
Benchmarking the LLM Annotations and Human Annotations [30 mins]: Subba
Coffee break [30 min]
Evaluation of LLM Generated Annotations [30 mins]: Subba
Autolabel Tools to Label Reasoning Datasets [20 mins]: Mounika
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]: Mounika

ECIR 2025: LLMs are Human-like Annotators

2

3 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]
Generating Annotations for NLP Tasks using LLMs [30 mins]
Benchmarking the LLM Annotations and Human Annotations [30 mins]
Coffee break [30 min]
Evaluation of LLM Generated Annotations [30 mins]
Autolabel Tools to Label Reasoning Datasets [20 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

3

4 of 153

Deep Learning and Large Language Models

ECIR 2025: LLMs are Human-like Annotators

4

Basic: ANNs, CNNs, RNNs, LSTMs

NLP: Encoder-Decoder, Attention, Transformers, BERT, GPT, T0, BART, T5…

Prompt based models: GPT3, T0/mT0, InstructGPT, Prompt tuning …

5 of 153

GPT-3

Humans do not require large supervised datasets to learn most language tasks
This is enough

A brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”)
A tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”)

In-context learning

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).

ECIR 2025: LLMs are Human-like Annotators

5

6 of 153

InstructGPT

Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

ECIR 2025: LLMs are Human-like Annotators

6

1. Supervised fine-tuning (SFT)

2. Reward model (RM) training

3. RL via proximal policy optimization (PPO) on RM

7 of 153

ChatGPT and Prompting

Generate labelled training data

Cheaper: Reduction in $ cost vs UHRS
Faster turnaround
Agility: Big dev savings on hitapp creation and judge training
Quality: Higher label quality

Directly use GPT models rather than train your own.
Prompt engineering: good task description; examples; multiple prompts help.
Can control output length, output language, output style.

ECIR 2025: LLMs are Human-like Annotators

7

8 of 153

Summarization

ECIR 2025: LLMs are Human-like Annotators

8

Question Answering

9 of 153

Machine Translation

ECIR 2025: LLMs are Human-like Annotators

9

Ads Copywriting

10 of 153

Machine Reading Comprehension

ECIR 2025: LLMs are Human-like Annotators

10

11 of 153

Solving reasoning problems

ECIR 2025: LLMs are Human-like Annotators

11

12 of 153

Loads of LLMs and SLMs

ECIR 2025: LLMs are Human-like Annotators

12

GPT-4o

OpenAI o1, o3 …

LLaMA 3, 3.2, 3.3, …

13 of 153

Small language models

Models based on llama and Falcon

llama: open source models
Falcon: clean web data at scale
LoRA: fast finetuning with low rank adaptation: Alpaca
Vicuna: Conversation tuning
Falcon Instruct, Alpaca: Instruction tuning
Orca: Explanation tuning

Models based on llama2

llama-2: RLHF
Orca 2: Cautious Reasoning and Progressive learning

Models based on Mistral

Mistral: Sliding Window attention, Pre-fill and chunking, Rolling buffers
Mixtral: MoE
SOLAR: Depth-up scaling, alignment tuning, DPO

The Phi series: Clean textbook quality data

ECIR 2025: LLMs are Human-like Annotators

13

14 of 153

LLaMA 1

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).

Mathematical reasoning

MATH: 12K middle school and high school math problems in LaTeX.
GSM8k: Middle school math problems.
LLaMA-65B >Minerva 62B

Minerva: PaLM models finetuned on 38.5B tokens from ArXiv and Math pages

Code generation

HumanEval and MBPP.
LLaMA > LaMDA and PaLM

Massive Multitask Language Understanding (MMLU)

MCQs on humanities, STEM and social sciences.
Instruction tuned LLaMA-I 65B led to better results

ECIR 2025: LLMs are Human-like Annotators

14

15 of 153

LLaMA 2

A collection of pretrained and fine-tuned LLMs: 7B, 13B, 34B, 70B.
Fine-tuned LLMs: Llama 2-Chat.
Llama 2 models > open-source chat models on most benchmarks.
Commercial use license.

Llama 2: Open Foundation and Fine-Tuned Chat Models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi and others. July 2023.

ECIR 2025: LLMs are Human-like Annotators

15

Helpfulness human evaluation on ~4k prompts

16 of 153

LLaMA 3

Introducing Meta Llama 3: The most capable openly available LLM to date. 18-Apr-2024. Meta.

ECIR 2025: LLMs are Human-like Annotators

16

Meta llama 3 instruct model

Meta llama 3 pretrained model

17 of 153

LLaMA 3.2

Meta AI. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. Sep 2024.

ECIR 2025: LLMs are Human-like Annotators

17

	Category	Benchmark	Llama 3.2 11B	Llama 3.2 90B	Claude 3 Haiku	GPT-40 mini
Image	College-level Problems and Mathematical Reasoning	MMMU (val, O-shot COT, micro avg accuracy)	50.7	60.3	50.2	59.4
		MMMU-Pro, Standard (10 opts, test)	33	45.2	27.3	42.3
		MMMU-Pro, Vision (test)	23.7	33.8	20.1	36.5
		MathVista (testmini)	51.5	57.3	46.4	56.7
	Charts and Diagram Understanding	ChartQA (test, O-shot COT relaxed accuracy)	83.4	85.5	81.7	-
		A12 Diagram (test)	91.1	92.3	86.7	-
		DocVQA (test, ANLS)	88.4	90.1	88.8	-
	General VQA	VQAv2 (test)	75.2	78.1	-	-
Text	General	MMLU (O-shot, COT)	73	86	75.2	82
	Math	MATH (O-shot, COT)	51.9	68	38.9	70.2
	Reasoning	GPQA (O-shot, COT)	32.8	46.7	33.3	40.2
	Multilingual	MGSM (O-shot, COT)	68.9	86.9	75.1	87

Category	Benchmark	Llama 3.2 1B	Llama 3.2 3B	Gemma 2 2B IT	Phi-3.5 mini IT
General	MMLU (5-shot)	49.3	63.4	57.8	69
	Open-rewrite eval (O-shot, rougeL)	41.6	40.1	31.2	34.5
	TLDR9+ (test, I-shot, rougeL)	16.8	19	13.9	12.8
	IFEval	59.5	77.4	61.9	59.2
Tool Use	BFCL V2	25.7	67	27.4	58.4
Tool Use	Nexus	13.5	34.3	21	26.1
Math	GSM8K (8-shot, COT)	44.4	77.7	62.5	86.2
Math	MATH (O-shot, COT)	30.6	48	23.8	44.2
Reasoning	ARC Challenge (O-shot)	59.4	78.6	76.7	87.4
	GPQA (O-shot)	27.2	32.8	27.5	31.9
	Hellaswag (O-shot)	41.2	69.8	61.1	81.4
Long Context	InfiniteBench/En.MC (128k)	38	63.3	-	39.2
	InfiniteBench/En.QA (128k)	20.3	19.8	-	11.3
	NIH/Multi-needle	75	84.7	-	52.7
Multilingual	MGSM (O-shot, COT)	24.5	58.2	40.2	49.8

18 of 153

GPT-4

GPT-4 exhibits human-level performance on the majority of these professional and academic exams.
It passes a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers

GPT-4 Technical Report. OpenAI. https://cdn.openai.com/papers/gpt-4.pdf

ECIR 2025: LLMs are Human-like Annotators

18

19 of 153

Math word problems and reasoning QA

Mitra, Arindam, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen et al. "Orca 2: Teaching small language models how to reason." arXiv preprint arXiv:2311.11045 (2023).

ECIR 2025: LLMs are Human-like Annotators

19

20 of 153

Math word problems and reasoning QA

Mitra, Arindam, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen et al. "Orca 2: Teaching small language models how to reason." arXiv preprint arXiv:2311.11045 (2023).

ECIR 2025: LLMs are Human-like Annotators

20

21 of 153

Chart understanding and reasoning over data

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

ECIR 2025: LLMs are Human-like Annotators

21

Spot a data point that stands out in these charts and what that implicates. Then produce a detailed markdown table for all the data shown.

22 of 153

Image understanding and reasoning

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

ECIR 2025: LLMs are Human-like Annotators

22

Geometrical reasoning

23 of 153

Information seeking about objects

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

Prompt: Do you know the precise location where this image was taken?
Model Response: This image was taken in New York City. The Empire State Building is visible in the background. The street in the foreground is 8th Avenue. The cross street is West 34th Street.

ECIR 2025: LLMs are Human-like Annotators

23

Multimodal reasoning based on visual cues

24 of 153

Multimodal humor understanding

Prompt: Can you explain this meme?
Model Response: The meme is about a person who is playing a game at 300 FPS, but their monitor is only 75Hz. The person is trying to catch the game, but it is too fast for their monitor to display. The meme is funny because it is relatable to anyone who has ever played a game on a monitor that is not fast enough.

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

ECIR 2025: LLMs are Human-like Annotators

24

Commonsense reasoning in a multilingual setting

Prompt: Explain what i should call my grandparents on each side of my family based on this chart
Model Response: On your father’s side, you would call your grandparents Yéye and Nainai. On your mother’s side, you would call your grandparents Wàigong and Wàipó.

25 of 153

Reasoning and code generation

Create a web app called "Opossum Search":

1. Every time you make a search query, it should redirect you to a google search with the same query, but the word opossum before it.

2. It should be visually similar to Google search,

3. Instead of the google logo, it should have a picture of an opossum from the internet.

4. It should be a single html file, no separate js or css files.

5. It should say "Powered by google search" in the footer

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

ECIR 2025: LLMs are Human-like Annotators

25

26 of 153

Mathematics: Calculus

Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, Google, 2023.

Prompt (video)

How could this person improve their technique?

Model Response

This soccer player needs to work on their ball-striking mechanics. Their foot is not making good contact with the ball, which is causing the ball to fly up and away from the goal. They also need to work on their body positioning. They should be leaning into the shot with their non-kicking leg in front of their kicking leg, and they should be following through with their kicking leg.

ECIR 2025: LLMs are Human-like Annotators

26

Video understanding and reasoning

27 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]
Generating Annotations for NLP Tasks using LLMs [30 mins]
Benchmarking the LLM Annotations and Human Annotations [30 mins]
Coffee break [30 min]
Evaluation of LLM Generated Annotations [30 mins]
Autolabel Tools to Label Reasoning Datasets [20 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

27

28 of 153

Generating Annotations for NLP Tasks using LLMs

Are LLMs good annotators?
How can we get better annotation accuracy from LLMs?
How can we use LLMs to generate samples and then label them?
Generating diverse and attribute-specific datasets.

ECIR 2025: LLMs are Human-like Annotators

28

29 of 153

ChatGPT > crowd workers for text-annotation tasks

ChatGPT > crowd workers by 25%

Relevance: whether a tweet is about content moderation or, in a separate task, about politics
Topic detection: whether a tweet is about a set of six predefined topics (i.e., Section 230, Trump Ban, Complaint, Platform Policies, Twitter Support, and others)
Stance detection: whether a tweet is in favor of, against, or neutral about repealing Section 230 (on content moderation)
General frame detection: whether a tweet contains a set of two opposing frames (content moderation as a “problem” and “solution”).
Policy frame detection: whether a tweet contains a set of fourteen policy frames.

ChatGPT’s intercoder agreement exceeds that of both crowd workers and trained annotators.
Per-annotation cost of ChatGPT is < $0.003: ~30x cheaper than MTurk.

Gilardi, F., Alizadeh, M., Kubli, M.: Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120(30), e2305016120 (2023)

ECIR 2025: LLMs are Human-like Annotators

29

Accuracy means agreement with the trained annotators.

30 of 153

USDC: A Dataset of User Stance and Dogmatism in Long Conversations

Input: entire conversation and top two authors.
Stance: Strongly In Favor, Strongly Against, Stance Not Inferrable, Somewhat In Favor, or Somewhat Against
Dogmatism: Firm but Open, Open to Dialogue, Flexible or Deeply Rooted.

Marreddy, M., Oota, S.R., Chinni, V.C., Gupta, M., Flek, L.: Usdc: A dataset of user stance and dogmatism in long conversations. arXiv preprint arXiv:2406.16833 (2024)

ECIR 2025: LLMs are Human-like Annotators

30

31 of 153

USDC: A Dataset of User Stance and Dogmatism in Long Conversations

Marreddy, M., Oota, S.R., Chinni, V.C., Gupta, M., Flek, L.: Usdc: A dataset of user stance and dogmatism in long conversations. arXiv preprint arXiv:2406.16833 (2024)

Inter-annotator Agreement with LLMs as Annotators

LLMs: 0.485 (stance), 0.435 (dogmatism)
Humans: 0.34 (stance), 0.44 (dogmastism)

Verification using Human Interaction

200 test conversations.
IAA: 0.56 (stance); 0.45 (dogmatism)

Stance Detection Evaluation on SPINOS

Use SLMs finetuned on USDC stance.
Evaluate on 5-class SPINOS dataset.
F1: SPINOS paper (0.341), random baseline (0.230), majority baseline (0.124), ours (0.320).

Complex and cumbersome nature of conv understanding

Understanding user opinions and their shifts in multi-user conversational contexts.
Long-range memory capabilities.

Voting over ({Mistral Large, GPT-4}×{zero-shot, one-shot, few-shot}).

ECIR 2025: LLMs are Human-like Annotators

31

Failure Cases

32 of 153

Selective annotation and Prompt retrieval

Unsupervised, graph-based selective annotation method, vote-k, to select diverse, representative examples to annotate.

Graph with each sample as node and SBERT sim to build edges.
Choose high degree nodes where with degree discounting for already chosen nodes.
Choose samples with model confidence scores from each of the 10 buckets.

Hongjin, S. U., Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang et al. "Selective annotation makes language models better few-shot learners." In ICLR . 2022.

ECIR 2025: LLMs are Human-like Annotators

32

ICL perf over varying annotation budgets for HellaSwag commonsense reasoning. LLM=GPT-J

100 annotated examples

33 of 153

LLMs instead of Human Judges?

Judge-Bench: 20 NLP datasets
Evaluate 11 LLMs.
substantial variance across models and datasets
Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human (blue) or model-generated (red).
LLMs should be carefully validated against human judgments before being used as evaluators.

Bavaresco, A, Raffaella B, Leonardo B, Desmond E, Raquel F, A Gatt, E Ghaleb et al. "Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks." arXiv:2406.18403 (2024).

ECIR 2025: LLMs are Human-like Annotators

33

34 of 153

Generating Annotations for NLP Tasks using LLMs

Are LLMs good annotators?
How can we get better annotation accuracy from LLMs?
How can we use LLMs to generate samples and then label them?
Generating diverse and attribute-specific datasets.

ECIR 2025: LLMs are Human-like Annotators

34

35 of 153

Chain of thought (CoT) prompting

Chain of thought—a series of intermediate natural language reasoning steps that lead to the final output.
It could be 0-shot or few-shot.
Prompt as a triple: <input, chain of thought, output>.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

ECIR 2025: LLMs are Human-like Annotators

35

36 of 153

What are advantages of chain of thought prompting?

Allows models to decompose multi-step problems into intermediate steps
Improves interpretability
PaLM 540B with CoT

achieved SOTA on StrategyQA
outperforms an unaided sports enthusiast on sports understanding.

CoT reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, etc.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

Prompting PaLM 540B with just 8 CoT exemplars achieves SOTA on GSM8K math word problems, surpassing even finetuned GPT-3 with a verifier.

ECIR 2025: LLMs are Human-like Annotators

36

37 of 153

Analogical prompting

CoT needs labeled exemplars of the reasoning process.
To solve problems, humans think about related problems or high-level knowledge.
Analogical prompting

Prompt LLMs to self-generate relevant exemplars in the context, before proceeding to solve the given problem.
Avoids need for labelled exemplars. Can tailor the generated exemplars and knowledge to each problem

Yasunaga, Michihiro, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. "Large Language Models as Analogical Reasoners." In ICLR .

ECIR 2025: LLMs are Human-like Annotators

37

38 of 153

Analogical prompting methods

Self-generated exemplars

# Problem: [x]
# Relevant problems: Recall three relevant and distinct problems. For each problem, describe it and explain the solution.
# Solve the initial problem:

Self-generated knowledge + exemplars

# Tutorial: Identify core concepts in the problem and provide a tutorial.
Generating knowledge before exemplars yields superior results

Yasunaga, Michihiro, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. "Large Language Models as Analogical Reasoners." In The Twelfth International Conference on Learning Representations .

Big Bench reasoning tasks with GPT3.5-Turbo

ECIR 2025: LLMs are Human-like Annotators

38

39 of 153

MoT: Memory-of-Thought

Finetuning is expensive.
Memory helps humans improve themselves in terms of decision-making, reasoning, judgment, etc.
MoT can improve LLMs via prethinking and recalling.
Pre-thinking

LLM pre-thinks on the unlabeled dataset

Few-Shot-CoT
Each demo has question, rationale and answer.
Get multiple <rationale, answer> pairs and choose majority-voted answer

Saves the high-confidence thoughts as external memory

Answer-entropy to filter out uncertain thoughts
Store <question, rationale, answer> as memory.

Recalling at test time

Given a test question, LLM recalls relevant memory to help itself reason and answer it.

Li, Xiaonan, and Xipeng Qiu. "MoT: Memory-of-Thought Enables ChatGPT to Self-Improve." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 6354-6374. 2023.

ECIR 2025: LLMs are Human-like Annotators

39

40 of 153

MoT: Memory-of-Thought

Recalling

Cluster memory items. Get topK semantically relevant memory item candidates from each cluster using SBERT.
Few-Shot-CoT with these memory items as extra context.

Baselines

MoT (no rationale): removes rationales in the retrieved memory and thus lets the LLM directly output the answer
MoT (no thinking): keeps rationales in the retrieved memory but forces the LLM to directly answer the question without CoT at recall stage.

Li, Xiaonan, and Xipeng Qiu. "MoT: Memory-of-Thought Enables ChatGPT to Self-Improve." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 6354-6374. 2023.

MoT exceeds Few-Shot-CoT and Zero-Shot-CoT

ECIR 2025: LLMs are Human-like Annotators

40

Tree of thoughts
Graph of thoughts

41 of 153

Dynamic Program Prompting and Program Distillation

Jie, Zhanming, and Wei Lu. "Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning." In Findings of the Association for Computational Linguistics: ACL 2023 , pp. 10518-10526. 2023.

ECIR 2025: LLMs are Human-like Annotators

41

42 of 153

Dynamic Program Prompting and Program Distillation

Jie, Zhanming, and Wei Lu. "Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning." In Findings of the Association for Computational Linguistics: ACL 2023 , pp. 10518-10526. 2023.

Example prediction and retrieved program samples

ECIR 2025: LLMs are Human-like Annotators

42

43 of 153

LeanReasoner: Offloading reasoning to Lean

Lean: a theorem proving symbolic solver framework
Offloading reasoning to Lean: Reduces the risk of logical inconsistencies
SOTA perf on FOLIO and ProofWriter.
Fine-tuning on <100 in-domain samples for each dataset.
GPT4 prompts for formalization and proof generation.

Jiang, Dongwei, Marcio Fonseca, and Shay B. Cohen. "LeanReasoner: Boosting Complex Logical Reasoning with Lean." In NAACL-HLT , pp. 7490-7503. 2024.

ECIR 2025: LLMs are Human-like Annotators

43

44 of 153

Symbolic reasoning for math word problems

Gaur, Vedant, and Nikunj Saunshi. "Reasoning in Large Language Models Through Symbolic Math Word Problems." In ACL Findings , pp. 5889-5903. 2023.

Self-prompting: Prompt LLM with numeric problem and its response to the problem, and then ask it to solve the symbolic problem.

ECIR 2025: LLMs are Human-like Annotators

44

45 of 153

Symbolic Rule Learning for Robust Numerical Reasoning

Numerical reasoning for machine reading comprehension (RC) remains a difficult challenge.
ICL with LLMs to decompose complex questions into simpler sub-questions that are easier to answer with single-span RC models.
Symbolic learning methods like ILASP to learn rules for recomposing partial answers.
Benefits

Data efficiency: no training or fine-tuning.
Neuro-symbolic approach 🡺 robust numerical reasoning
Interpretable and verifiable reasoning traces.

Al-Negheimish, Hadeel, Pranava Madhyastha, and Alessandra Russo. "Augmenting Large Language Models with Symbolic Rule Learning for Robust Numerical Reasoning." In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS'23 .

ECIR 2025: LLMs are Human-like Annotators

45

46 of 153

Generating Annotations for NLP Tasks using LLMs

Are LLMs good annotators?
How can we get better annotation accuracy from LLMs?
How can we use LLMs to generate samples and then label them?
Generating diverse and attribute-specific datasets.

ECIR 2025: LLMs are Human-like Annotators

46

47 of 153

ZeroGen: Efficient Zero-shot Learning via Dataset Generation

Tiny task model (TAM) has orders of magnitude fewer parameters than PLMs.
Variant of knowledge distillation but (a) does not require any human annotations (b) Flexible arch choice of student models.
TAM > PLM with only ∼0.4% number of parameters
In some low-resourced settings, TAM trained with synthesized data even outperforms the same model trained with human annotations in a fully supervised manner.

Ye, J., Gao, J., Li, Q., Xu, H., Feng, J., Wu, Z., Yu, T., Kong, L.: Zerogen: Efficient zero-shot learning via dataset generation. In EMNLP 2022. pp. 11653–11669 (2022)

ECIR 2025: LLMs are Human-like Annotators

47

48 of 153

ZeroGen: Efficient Zero-shot Learning via Dataset Generation

Label is wrapped up into a label-descriptive prompt
For sentence-pair classification tasks, we need to generate two sequences that bear certain relationships (e.g., premise and hypothesis in NLI, context and question in QA).

First generate and/or sample a conditional context (premise in NLI and context in QA).
The context is then concatenated with a sampled label and transformed into a prompt T
Giving the prompt T, generate the other sentence

Ye, J., Gao, J., Li, Q., Xu, H., Feng, J., Wu, Z., Yu, T., Kong, L.: Zerogen: Efficient zero-shot learning via dataset generation. In EMNLP 2022. pp. 11653–11669 (2022)

ECIR 2025: LLMs are Human-like Annotators

48

49 of 153

ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback

ZeroGen: suffers from low-quality issues (e.g., low informativeness, redundancy).
ProGen

Multi-phase dataset generation
In each phase, the generation is steered by feedback from the previously generated dataset, so as to synthesize a dataset with higher quality.
Feedback from the task-specific model to guide the generation of new training data via in-context examples.

Ye, J., Gao, J., Wu, Z., Feng, J., Yu, T., Kong, L.: ProGen: Progressive zeroshot dataset generation via in-context feedback. In EMNLP 2022. pp. 3671–3683 (2022)

ECIR 2025: LLMs are Human-like Annotators

49

50 of 153

ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback

Ye, J., Gao, J., Wu, Z., Feng, J., Yu, T., Kong, L.: ProGen: Progressive zeroshot dataset generation via in-context feedback. In EMNLP 2022. pp. 3671–3683 (2022)

ECIR 2025: LLMs are Human-like Annotators

50

51 of 153

SuperGen (Supervision Generation)

Meng, Y., Huang, J., Zhang, Y., Han, J.: Generating training data with language models: Towards zero-shot language understanding. NeurIPS 35, 462–477 (2022)

ECIR 2025: LLMs are Human-like Annotators

51

52 of 153

SuperGen (Supervision Generation)

Meng, Y., Huang, J., Zhang, Y., Han, J.: Generating training data with language models: Towards zero-shot language understanding. NeurIPS 35, 462–477 (2022)

ECIR 2025: LLMs are Human-like Annotators

52

53 of 153

TarGen: Targeted Data Generation

Multi-step prompting strategy (for WiC task)

Create a set of prompts (boxes 1, 2) to generate instance seeds unique to each task instance.
Create label-specific prompts (box 3) that generate instances based on instance seeds
Pass instances to self-correction module.

Verify alignment between generated instances and their labels, as well as the alignment between these instances and the task description.

Models trained on the synthetic datasets for 8 SuperGLUE tasks perform ∼1−3% points higher than those trained on original datasets.

Gupta, H., Scaria, K., Anantheswaran, U., Verma, S., Parmar, M., Sawant, S.A., Mishra, S., Baral, C.: Targen: Targeted data generation with large language models. COLM (2024)

ECIR 2025: LLMs are Human-like Annotators

53

54 of 153

Generation using Single Formatting Example

Labeling data requires careful data selection, while generating data necessitates task-specific prompt engineering.

Labelers: Curating raw data for tasks in specialized domains, such as those in the biomedical or legal fields, can be particularly challenging
Generators: Requires careful curation of few-shot examples, or composition of prompts that highlight the semantic meaning of labels.

Self-reference strategy: iteratively samples from the pool of newly created examples to seed the prompt for the next round of generation.

random, contrastive, similar, and tree sampling (use examples from step 1 only).

Tree-based and Contrastive incurred the lowest cost.

Lee, D.H., Pujara, J., Sewak, M., White, R., Jauhar, S.: Making large language models better data creators. In EMNLP 2023. pp. 15349–15360 (2023)

ECIR 2025: LLMs are Human-like Annotators

54

55 of 153

Generation with Active Learning

Zhang, R., Li, Y., Ma, Y., Zhou, M., Zou, L.: LLMaAA: Making large language models as active annotators. In EMNLP 2023. pp. 13088–13103 (2023)

Optimizing LLM as Better Annotator

Few-shot inference without finetuning.
k-NN few-shot example retrieval
Label Verbalizer

“per:parents” 🡪 either “subject is the parent of object” or “object is the parent of subject”.

Active Data Acquisition

Random
Maximum Entropy
Least Confidence
K-Means Diversity sampling

ECIR 2025: LLMs are Human-like Annotators

55

56 of 153

Generation with Active Learning

FewGen: enhances ZeroGen with in-context examples uniformly sampled from the demonstration pool.

Zhang, R., Li, Y., Ma, Y., Zhou, M., Zou, L.: LLMaAA: Making large language models as active annotators. In EMNLP 2023. pp. 13088–13103 (2023)

ECIR 2025: LLMs are Human-like Annotators

56

57 of 153

Generating Annotations for NLP Tasks using LLMs

Are LLMs good annotators?
How can we get better annotation accuracy from LLMs?
How can we use LLMs to generate samples and then label them?
Generating diverse and attribute-specific datasets.

ECIR 2025: LLMs are Human-like Annotators

57

58 of 153

Dictionary-assisted training data generation

Dictionary-Assisted Training Data Generation (DADG)

First query the head-tail entity pairs under relation P6, relation ID of “head of government”, from Wikidata.
GPT-3 can then be instructed to generate a sentence with the specified entity pairs and relation.

Ding, B., Qin, C., Liu, L., Chia, Y.K., Li, B., Joty, S., Bing, L.: Is gpt-3 a good data annotator? In ACL. pp. 11173–11195 (2023)

ECIR 2025: LLMs are Human-like Annotators

58

Prompt guided unlabelled data annotation

Prompt guided training data generation

Few shot ICL
PGDG: To generate training data with relation “head of government”

First "teach" GPT-3 to generate head-tail entity pairs that have the specified relation
Instruct GPT-3 to generate a sentence with the given entities and relation.

59 of 153

Dictionary-assisted training data generation

PGDA is better for tasks with smaller label spaces and clearly defined labels.

sentiment analysis

Generation-based approaches (PGDG and DADG) are better suited for tasks with larger label spaces or labels that possess a certain degree of ambiguity.

Ding, B., Qin, C., Liu, L., Chia, Y.K., Li, B., Joty, S., Bing, L.: Is gpt-3 a good data annotator? In ACL. pp. 11173–11195 (2023)

Tagging-based approach

Allows for direct access to in-domain unlabeled data
As the label space becomes larger, prompt size increases.

Generation-based approaches

May generate data that contains information that was "learned" during pre-training and may not align with the distribution of in-domain data.
Cost-effective as the prompts used can be significantly shorter and multiple data can be generated with a single prompt at a time.

ECIR 2025: LLMs are Human-like Annotators

59

60 of 153

PersonaChatGen: Generating Personalized Dialogues

ProfileGen Filtering

Remove if output does not have a (persona category, persona entity) key-value pair
Exact Matching Persona Entity.
Preserving Persona Category in generated profile sentences.
Duplication Filtering.

PersonaChatGen Filtering

Remove samples where GPT3 copies given profile sentences when generating personalized dialogue
Persona Consistency with utterances.
Remove toxic samples.

Lee, Y.J., Lim, C.G., Choi, Y., Lm, J.H., Choi, H.J.: Personachatgen: Generating personalized dialogues using gpt-3. In: 1st workshop on customized chat grounding persona and knowledge. pp. 29–48 (2022)

ECIR 2025: LLMs are Human-like Annotators

60

61 of 153

PersonaChatGen: Generating Personalized Dialogues

Lee, Y.J., Lim, C.G., Choi, Y., Lm, J.H., Choi, H.J.: Personachatgen: Generating personalized dialogues using gpt-3. In: 1st workshop on customized chat grounding persona and knowledge. pp. 29–48 (2022)

ECIR 2025: LLMs are Human-like Annotators

61

62 of 153

Mixture of Soft Prompts for Controllable Data Generation

MSP learns a set of soft prompts, mixes them together to generate attribute-preserving examples, then merges the augmented and original data to train a smaller, downstream model.
Individual examples in some tasks can contain multiple attributes.

Multi-aspect intent detection: a dialogue utterance may have 3 intent attributes.

Attribute Mixing: Concat; Pooling; Attention; Bottleneck; CNN

Chen, D., Lee, C., Lu, Y., Rosati, D., Yu, Z.: Mixture of soft prompts for controllable data generation. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 14815–14833 (2023)

ECIR 2025: LLMs are Human-like Annotators

62

63 of 153

Mixture of Soft Prompts for Controllable Data Generation

Soft prompts are initialized with the name and description of attribute, e.g. “song is a musical song or melody”.

Chen, D., Lee, C., Lu, Y., Rosati, D., Yu, Z.: Mixture of soft prompts for controllable data generation. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 14815–14833 (2023)

ECIR 2025: LLMs are Human-like Annotators

63

64 of 153

Attributed Training Data Generator

Yu, Y., Zhuang, Y., Zhang, J., Meng, Y., Ratner, A.J., Krishna, R., Shen, J., Zhang, C.: Large language model as attributed training data generator: A tale of diversity and bias. NeurIPS 36 (2024)

Attributed generation can increase diversity and reduce bias
“location” in NYT news dataset

“North America” (68.01%)
“Africa” (0.69%)

ECIR 2025: LLMs are Human-like Annotators

64

65 of 153

Attributed Training Data Generator

Yu, Y., Zhuang, Y., Zhang, J., Meng, Y., Ratner, A.J., Krishna, R., Shen, J., Zhang, C.: Large language model as attributed training data generator: A tale of diversity and bias. NeurIPS 36 (2024)

ECIR 2025: LLMs are Human-like Annotators

65

66 of 153

Summary

Are LLMs good annotators?
How can we get better annotation accuracy from LLMs?

CoT, analogical prompting, MoT, program prompting, LeanReasoner, Symbolic Reasoning

How can we use LLMs to generate samples and then label them?

ZeroGen, ProGen, SuperGen, TarGen, 1-shot Gen, Active learning,

Generating diverse and attribute-specific datasets.

Dictionary-assisted generation, PersonaChatGen, Attributed Training Data Generation,

ECIR 2025: LLMs are Human-like Annotators

66

67 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]
Generating Annotations for NLP Tasks using LLMs [30 mins]
Benchmarking the LLM Annotations and Human Annotations [30 mins]
Coffee break [30 min]
Evaluation of LLM Generated Annotations [30 mins]
Autolabel Tools to Label Reasoning Datasets [20 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

67

68 of 153

ECIR 2025: LLMs are Human-like Annotators

68

What is reasoning?

Reasoning is the ability to make inferences using evidence and logic.
Reasoning can be divided into multiple types of skills such as Commonsense, Mathematical, and Symbolic reasoning etc.
Often, reasoning involves deductions from inference chains, called as multi-step reasoning.

Augmented Language Models: a Survey (Mialon et.al, 2023)

Do language models truly understand and apply common sense reasoning?

69 of 153

ECIR 2025: LLMs are Human-like Annotators

69

Reasoning Problems

Augmented Language Models: a Survey (Mialon et.al, 2023)

Hard Language Tasks: require multiple steps of reasoning to solve

Arithmetic reasoning is the ability to understand and apply mathematical concepts and principles in order to solve problems involving arithmetic operations. This involves using logical thinking and mathematical principles to determine the correct course of action when solving mathematical problems.

Commonsense Reasoning is the use of everyday knowledge and understanding to make judgments and predictions about new situations. It is a fundamental aspect of human intelligence that enables us to navigate our environment, understand others, and make decisions with incomplete information

Symbolic reasoning is a form of reasoning that involves the manipulation of symbols according to formal rules. In symbolic reasoning, we use abstract symbols to represent concepts and relationships, and then manipulate those symbols according to precise rules in order to draw conclusions or solve problems.

70 of 153

ECIR 2025: LLMs are Human-like Annotators

70

Multi-step reasoning is often seen as a weakness in language models

Towards Reasoning in Large Language Models: A Survey (Huang et.al, 2023)

It is unclear to what extent LLMs are capable of reasoning

Former research on reasoning in small language models through fully

supervised finetuning on specific datasets

Creating a dataset containing explicit reasoning can be difficult and time-consuming
training on a specific dataset limits application to a specific domain

Reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters

71 of 153

ECIR 2025: LLMs are Human-like Annotators

71

Reasoning and Commonsense Benchmarks

Source: https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond#different-types-of-llm-benchmarks

How can we assess reasoning abilities in language models?”

TruthfulQA — Truthfulness
MMLU — Language understanding
HellaSwag — Commonsense reasoning
BIG-Bench Hard — Challenging reasoning tasks
HumanEval — Coding challenges
CodeXGLUE — Programming tasks
Chatbot Arena — Human-ranked ELO-based benchmark
MT Bench — Complex conversational ability

TruthfulQA evaluates models on their ability to provide accurate and truthful answers, which is crucial for combating misinformation and promoting ethical AI usage.

MMLU is aimed at evaluating models based on the knowledge they acquired during pre-training, focusing solely on zero-shot and few-shot settings.

Codeing: HumanEval consists of 164 unique programming tasks designed to evaluate a model’s code generation abilities. These tasks cover a broad spectrum, from algorithms to the comprehension of programming languages.

The Chatbot Arena is an open platform for ranking language models using over 200K human votes. Users can anonymously quiz and judge pairs of AI models like ChatGPT or Claude without knowing their identities, and votes are counted towards rankings only if the model identities stay hidden.

MT-bench evaluates chat assistants’ quality by presenting them with a series of multi-turn open-ended questions, utilizing LLMs as judges.

72 of 153

ECIR 2025: LLMs are Human-like Annotators

72

How is reasoning measured (in the literature)?

GPT-4 Technical Report (OpenAI).

While SOTA models at the time struggled to score above 50% with pre-training, GPT-4 achieved a record-high of 95.3% with just 10-shot prompting in 2023.

73 of 153

ECIR 2025: LLMs are Human-like Annotators

73

Chain of thought prompting and Self consistency

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Prompt: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?
11 apples

Keep in mind that the authors claim that this is an emergent ability that arises with sufficiently large language models.

74 of 153

ECIR 2025: LLMs are Human-like Annotators

74

Chain of thought prompting: Arithmetic Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Do not positively impact performance for small models
Few-shot CoT achieves better performance on LLM than zero-shot CoT.

Manually composed 8 exemplars
All contains equations with flexible formats
Benchmarked on:

GSM8K (Cobbe et al. 2021)
SVAMP (Patel et al., 2021)
MAWPS (Koncel-Kedziorski et al., 2016)

75 of 153

ECIR 2025: LLMs are Human-like Annotators

75

Chain of thought prompting and Self consistency

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Chain-of-thought prompting improves performance on a range of arithmetic reasoning tasks

76 of 153

ECIR 2025: LLMs are Human-like Annotators

76

Chain of thought prompting: Symbolic Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Standard prompting fails out-of-domain tests for both tasks
Zero-shot CoT using Instruct-GPT-3 175B achieves the similar performance as few-shot CoT in both tasks using 540B PaLM model

77 of 153

ECIR 2025: LLMs are Human-like Annotators

77

Chain of thought prompting: Commonsense Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

CoT show minimal benefits on CSQA and StrategyQA tasks
Few-shot achieves better performance than Zero-shot CoT on 175B GPT-3 model for CSQA and Strategy QA tasks, but Zero-shot CoT shows significant improvement for Date understanding task.

78 of 153

ECIR 2025: LLMs are Human-like Annotators

78

More Advances: Self consistency

Self-consistency improves chain of thought reasoning in language models. (Wang et al., 2022)

Change greedy decode (single-path) to self-consistency (multi-path) in few-shot CoT

79 of 153

ECIR 2025: LLMs are Human-like Annotators

79

STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning

STaR: Self-Taught Reasoner. (Zelikman et al., 2022)

Self-Taught Reasoner (STaR), which iteratively improves a model’s ability to generate rationales to solve problems.
participants were 74% more likely to prefer the STaR-generated rationales over the human-generated rationales

80 of 153

ECIR 2025: LLMs are Human-like Annotators

80

Program-aided Language Models (PAL)

PAL: Program-aided Language Models (Gao et.al, 2023)

PAL enhance problem-solving by generating code to represent intermediate reasoning steps (contrast this with CoT prompting, which uses natural language to reason).

81 of 153

ECIR 2025: LLMs are Human-like Annotators

81

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

TORA models exhibit remarkable improvements over previous state-of-the-art approaches across all scales

82 of 153

ECIR 2025: LLMs are Human-like Annotators

82

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

Utilize GPT-4 to synthesize high-quality trajectories on arithmetic reasoning datasets.

83 of 153

ECIR 2025: LLMs are Human-like Annotators

83

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

TORA provides valuable insights for future research, contributing to the development of more advanced and versatile reasoning agents.

84 of 153

ECIR 2025: LLMs are Human-like Annotators

84

Plan-and-Solve Prompting

PAL: Program-aided Language Models (Gao et.al, 2023)

Zero-shot-CoT prompting
Plan-and-Solve (PS) prompting
answer extraction prompting

Zero-shot PS + Prompting consistently outperforms Zero-shot-CoT across all reasoning datasets by a large margin

85 of 153

ECIR 2025: LLMs are Human-like Annotators

85

Can we use LLMs to benchmark reasoning datasets?

With models like GPT-4, it's now possible to synthetically produce datasets that are more comprehensive and diverse than human-labeled ones, in far less time

86 of 153

ECIR 2025: LLMs are Human-like Annotators

86

Reasoning datasets: CriticBench

CRITICBENCH is designed to assess the two key aspects of LLMs’ critical reasoning:

critique
correction

87 of 153

ECIR 2025: LLMs are Human-like Annotators

87

Reasoning datasets: Question collection on CriticBench

Response collection from LLMs:

Employ various LLMs, including GPT-3.5, GPT4, LLaMa2 (7B, 13B, and 70B variants), and vicuna (7B, 13B, and 33B variants)
Each LLM generate response for each question and provide valid reasoning

Response annotation:

Response correctness:

rule-based matching, followed by a more detailed evaluation using GPT-4 and Humans

Domains:

15 datasets spanning 5 domains: mathematical, commonsense, symbolic, coding, and algorithmic.

Question collection:

Randomly selecting quantity of data from existing datasets

88 of 153

ECIR 2025: LLMs are Human-like Annotators

88

Reasoning datasets: Evaluation process on CriticBench

89 of 153

ECIR 2025: LLMs are Human-like Annotators

89

Reasoning datasets: Annotation example of CriticBench

GPT-4 evaluation is closer to Human evaluation

90 of 153

ECIR 2025: LLMs are Human-like Annotators

90

Reasoning datasets: Key Factors in Critical Reasoning

LLMs struggle more with incorrect answers in detail-oriented tasks like algorithmic tasks compared to logic centric tasks like code generation

91 of 153

ECIR 2025: LLMs are Human-like Annotators

91

Reasoning datasets: Average performance on CriticBench

The knowledge acquired by LLMs is not entirely consistent across generation, critique, and correction tasks.

92 of 153

ECIR 2025: LLMs are Human-like Annotators

92

Reasoning datasets: Consistency of GQC Knowledge

Analysis across different task types found that models perform better in Q and C for tasks focused on logic compared to those requiring attention to detail.

93 of 153

ECIR 2025: LLMs are Human-like Annotators

93

Human preference benchmarks with reasoning tasks

How to construct a benchmark which consists of real-world multi-turn conversations?
How to automatically evaluate the model’s responses?

Why human preference benchmarks?

Existing NLP benchmarks fail to accurately evaluate LLM chat assistants
Knowledge benchmark (e.g., MMLU)

Not open-ended

Instruction benchmark (e.g., Super-Natural Instructions)

Only one turn
Human annotation efforts

Conversational benchmark (e.g., CoQA)

Too easy for state-of-the-art LLMs

94 of 153

ECIR 2025: LLMs are Human-like Annotators

94

Human preference benchmarks: LLMs as judges

LLM judges provide not only scores but also explanations, offering interpretable outputs and valuable insights.

Four popular benchmarks

MT-Bench
Chatbot Arena
MT-Bench-101
HumanEval

95 of 153

ECIR 2025: LLMs are Human-like Annotators

95

LLM-as-a-judge

The agreement between GPT-4 judges and humans reaches over 80%, the same level agreement among humans.

96 of 153

ECIR 2025: LLMs are Human-like Annotators

96

Limitations of LLM-as-a-judge

Position bias

LLM favor the answers in the first position

Verbosity bias

LLM favor long answers

Self-appreciation bias

LLM favor its own answers of answers similar to its own answer

Limited reasoning ability

LLMs fail to judge hard math/reasoning/code questions

It's crucial to acknowledge the limitations of LLM-as-a-judge, such as its inability to detect hallucinations and penalize LLM generated answers accordingly, and potential errors when grading math/reasoning questions.

97 of 153

ECIR 2025: LLMs are Human-like Annotators

97

Human preference benchmarks: MT-Bench-101

The agreement between GPT-4 and human expert evaluations reached 87%, utilizing our designed evaluation approach.

98 of 153

ECIR 2025: LLMs are Human-like Annotators

98

MT-Bench-101: Hierarchical Ability Taxonomy

Utilized GPT-4 to construct data and it is the most powerful model for multi-turn dialogues.

13 tasks

3-level abilities

99 of 153

ECIR 2025: LLMs are Human-like Annotators

99

MT-Bench-101: Model’s performance

Closed-source models consistently exhibit superior performance compared to open-source counterparts across all evaluated tasks.
Content confusion and format rephrasing are relatively less difficult, while the mathematical reasoning task is the most challenging

100 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]
Generating Annotations for NLP Tasks using LLMs [30 mins]
Benchmarking the LLM Annotations and Human Annotations [30 mins]
Coffee break [30 min]
Evaluation of LLM Generated Annotations [30 mins]
Autolabel Tools to Label Reasoning Datasets [20 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

100

101 of 153

ECIR 2025: LLMs are Human-like Annotators

101

Why Focus on Evaluation

Source: https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/

102 of 153

ECIR 2025: LLMs are Human-like Annotators

102

LLM Evaluation vs. Human Evaluation

103 of 153

ECIR 2025: LLMs are Human-like Annotators

103

How to scale “human evaluation”?

Chatbot arena allows users to vote for the superior response manually, which leverage multiple LLMs to autonomously determine which response stands out.

104 of 153

ECIR 2025: LLMs are Human-like Annotators

104

LLM Evaluation

Model

LLM Benchmark

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Aggregation

System Prompt

System Format

Hyperparams

105 of 153

ECIR 2025: LLMs are Human-like Annotators

105

LLM Evaluation

Task:

What is that concrete problem that we want to address (e.g., classification, summarization, commonsense reasoning..)

Dataset:

What dataset we want to use?

Metric:

How we evaluate the performance?

Instruction:

Evaluate model ability to perform an unseen task given context in the form of instructions

Shot:

The model is given with or without any prior examples at inference time

106 of 153

ECIR 2025: LLMs are Human-like Annotators

106

LLM Evaluation

Model

System Prompt

System Format

Hyperparams

<SYS> you are helpful Model </SYS>

<instruction> Translate this sentence to French

<user> I like pizza

<assistant> J'aime la pizza

107 of 153

ECIR 2025: LLMs are Human-like Annotators

107

LLM Evaluation: Alpaca

GPT-4 based evaluation of chatbot output agrees well with human evaluation; known before
Surprisingly, judge Llama 2 (70B) can compete with judge GPT-4 (1760B). But, Judge Llama 2 is more biased (more ties) and less capable of following the instructions carefully.

108 of 153

ECIR 2025: LLMs are Human-like Annotators

108

LLM Evaluation: G-Eval

LLM consistently gives higher scores to GPT-3.5 summaries, even though human judges prefer summaries written in human language

G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment (Yang et.al, 2023)

Can we rely on LLMs to assess LLM outputs
Do LLMs show a bias towards the outputs they generate during evaluations?

109 of 153

ECIR 2025: LLMs are Human-like Annotators

109

LLM Evaluation: GPT-Score

https://github.com/confident-ai/deepeval

110 of 153

ECIR 2025: LLMs are Human-like Annotators

110

Language model-written evaluations

https://github.com/confident-ai/deepeval

Generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size.
Overall, LM-written evaluations are high quality and quickly discover many novel LM behaviors.

111 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]
Generating Annotations for NLP Tasks using LLMs [30 mins]
Benchmarking the LLM Annotations and Human Annotations [30 mins]
Coffee break [30 min]
Evaluation of LLM Generated Annotations [30 mins]
Autolabel Tools to Label Reasoning Datasets [20 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

111

112 of 153

Generate a synthetic dataset using LLMs

112

AutoLabel:

Labeling Agent provides a method to generate synthetic datasets
Supports synthetic dataset generation for classification and entity matching tasks

Prodigy:

LLMs, which allow you to provide a prompt in order to annotate an NLP task.

Labelbox:

Easily curate and annotate data
Generate high-quality human feedback data for computer vision and language models, evaluate and improve model performance

LLM-data-annotation:

Combines human expertise with the efficiency of Large Language Models (LLMs) like OpenAI's GPT-3.5 to simplify dataset annotation and model improvement.

113 of 153

AutoLabel

113

How to use AutoLabel tool to annotate a task?

114 of 153

AutoLabel: Question Answering

114

First step: specify a labeling configuration

115 of 153

AutoLabel: Question Answering

115

Second step: do a dry-run on test dataset using the LLM specified in config.json by running agent.plan

116 of 153

AutoLabel: Question Answering

116

Final step: run the labeling with agent.run

117 of 153

LLMs can label data as well as humans, but 100x faster

117

Refuel LLM (84.2%) outperforms trained human annotators (80.4%), GPT-3-5-turbo (81.3%), PaLM-2 (82.3%) and Claude (79.3%) across a benchmark of 15 text labeling datasets.

118 of 153

LLMs can label data: Quality Evaluation

118

Refuel LLM outperforms human annotators and all closed source LLMs, excluding GPT-4

119 of 153

LLMs can label data: Quality Evaluation

119

RefuelLLM-2 (83.82%) outperforms all current state-of-the-art LLMs for data labeling and enrichment, including GPT-4-Turbo (80.88%), Claude-3-Opus (79.19%) and Gemini-1.5-Pro (74.59%)

120 of 153

LLMs can label data: Quality Evaluation

120

Refuel-LLM-2 is competitive or superior in terms of output quality, compared to current state-of-the-art LLMs

121 of 153

Prodigy

121

https://demo.prodi.gy/?=null&view_id=ner_manual

What Prodigy isn’t:

software as a service” – it’s a tool that you can download, install and run yourself

Usage:

Annotate NLP tasks, Audio and vision tasks

122 of 153

AutoLabel tools: Which one is better

122

Autolabel currently supports Chain-of-thought prompting as well suitable for reasoning tasks.
Labeling tasks spanning categories such as classification, entity resolution, matching, reading comprehension, reasoning and information extraction

123 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]
Generating Annotations for NLP Tasks using LLMs [30 mins]

Benchmarking the LLM Annotations and Human Annotations [30 mins]
Evaluation of LLM Generated Annotations [30 mins]

Autolabel Tools to Label Reasoning Datasets [20 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

123

124 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]
Generating Annotations for NLP Tasks using LLMs [30 mins]
Benchmarking the LLM Annotations and Human Annotations [30 mins]
Coffee break [30 min]
Evaluation of LLM Generated Annotations [30 mins]
Autolabel Tools to Label Reasoning Datasets [20 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

Taxonomy of hallucinations
Hallucination detection
Methods to mitigate hallucination

ECIR 2025: LLMs are Human-like Annotators

124

125 of 153

Hallucination

125

How did hallucination evolve in LLMs?

In the context of language models: the generation of text or responses that seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.

126 of 153

Evolution of Hallucination in LLMs

126

LLMs simply generate text without knowing the specific details of the visit's purpose.

127 of 153

Evolution of Hallucination in LLMs

127

Despite advanced capabilities of MLLM, like all LLMs, are prone to hallucination

128 of 153

Evolution of Hallucination in LLMs

128

The problem of hallucination also exists in other foundation models such as image, video, and audio as well

129 of 153

Taxonomy of Hallucinations

ECIR 2025: LLMs are Human-like Annotators

129

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Like their text-only counterparts in LLMs, Video- and Audio-based language models are also prone to hallucinations.

130 of 153

Taxonomy of Hallucinations: Causes

ECIR 2025: LLMs are Human-like Annotators

130

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

131 of 153

Taxonomy of Hallucinations: Metrics and Benchmarks

ECIR 2025: LLMs are Human-like Annotators

131

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

132 of 153

Taxonomy of Hallucinations: Mitigation

ECIR 2025: LLMs are Human-like Annotators

132

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

133 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]
Generating Annotations for NLP Tasks using LLMs [30 mins]

Benchmarking the LLM Annotations and Human Annotations [30 mins]
Evaluation of LLM Generated Annotations [30 mins]

Autolabel Tools to Label Reasoning Datasets [20 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

Taxonomy of hallucinations
Hallucination detection
Methods to mitigate hallucination

ECIR 2025: LLMs are Human-like Annotators

133

134 of 153

Hallucination Types

134

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Hallucination can occur in any NLG task, necessitating a thorough examination based on the fundamental principles of text generation from a given prompt.

135 of 153

Hallucination Types: Orientation, Category and Degree

135

https://arxiv.org/pdf/2310.04988

136 of 153

Hallucination Types: Orientation

136

Factual Mirage:

an LLM engages in hallucination or distortion of a given prompt that is factually correct.

Intrinsic

Extrensic

The generated text that appears to be true or real but is actually misleading or false.

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

137 of 153

Hallucination Types: Orientation

137

Silver Lining:

an LLM indulges in hallucination by conjuring an elaborate and captivating narrative based on a given prompt that is factually incorrect.

Intrinsic

Extrensic

LLM produces unfaithful content or is inconsistent with the provided source prompt.

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

138 of 153

Hallucination Types: Category

138

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

an LLMs generating text that exhibits a mashed fusion of events from different timelines.

139 of 153

Hallucination Types: Category

139

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

LLMs generate an incorrect location with an event.

140 of 153

Hallucination Types: Category

140

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

an LLM generates quotations attributed to either fictional or real characters without sufficient evidence to verify the authenticity of such statements.

141 of 153

Hallucination Types: Category

141

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

an LLM fabricates an imaginary personality in relation to a past event, without concrete evidence.

142 of 153

Hallucination Types: Category

142

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

an LLM generates numeric values related to past events, such as dates, ages, or monetary amounts, that are inconsistent with the actual facts.

143 of 153

Hallucination Types: Category

143

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

LLMs generate an imprecise expansion for an acronym.

144 of 153

Hallucination Types: Degree

144

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Alarming indicates added information pieces that bear a radical dissemblance from the topic fed via the prompt.
Moderate indicates a level of hallucination that introduces facts that are either fictitious or tangential to the topic at hand.
Mild indicates minor hallucination which is superficial in terms of its impact.

How can we detect and evaluate LLM’s hallucinations across different types?

145 of 153

Hallucination Detection: SelfCheckGPT

145

Gpteval: Nlg evaluation using gpt-4 with better human alignment (Yang et.al, 2023)

SelfCheckGPT can effectively detect hallucinated sentences with higher accuracy compared to several baseline methods
SelfCheckGPT's prompting method achieved the highest performance in detecting non-factual sentences

146 of 153

Hallucination Detection: FACTScore

146

"Factscore: Fine-grained atomic evaluation of factual precision in long form text generation (Sewon et.al, 2023)

GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.

147 of 153

Hallucination eLiciTation dataset

147

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Selected 15 LLMs, and used them to generate a total of 75,000 text passages, with each LLM producing 5,000 text prose entries.
The text prompts provided to these LLMs consisted of tweets from NYTimes and headlines sourced from the Politifact dataset

148 of 153

Hallucination Vulnerability Index (HVI)

148

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

Larger LLMs without RLHF are prone to both orientations of hallucination
For smaller LLMs like T5, Dolly, etc., Generated Golem, Virtual Voice, and Geographic Erratum categories of hallucination are rarely observed.

149 of 153

Agenda

Introductions to LLMs and Recap Their Capabilities [30 mins]
Generating Annotations for NLP Tasks using LLMs [30 mins]
Benchmarking the LLM Annotations and Human Annotations [30 mins]
Coffee break [30 min]
Evaluation of LLM Generated Annotations [30 mins]
Autolabel Tools to Label Reasoning Datasets [20 mins]
Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

Taxonomy of hallucinations
Hallucination detection
Methods to mitigate hallucination

ECIR 2025: LLMs are Human-like Annotators

149

150 of 153

Hallucination Mitigation

150

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)

Leveraging external knowledge to validate the correctness: RAG
Modifying the decoding strategy
Sampling multiple outputs and checking their consistency: SelfCheckGPT

Proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average

151 of 153

Hallucination Mitigation: Chain-Of-Verification (CoVe)

151

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)

CoVe reduces inaccuracies in LLMs' responses by verifying facts through structured questioning.
Significant reduction in factual hallucinations.
Final responses are more reliable and factually accurate.

152 of 153

Is hallucination always bad?

152

https://www.washingtonpost.com/opinions/2023/12/27/artificial-intelligence-hallucinations/

Numerals: Models have been shown to hallucinate a lot while generating numerals, such as dates, quantities, and scalars.
Long Text: Models often tend to self-contradict while generating the output.
Reasoning: Misunderstanding facts/information present in the source text can lead to hallucinations and errors.
When Contextual Knowledge Conflicts with the Parametric Knowledge: Models have been shown to prioritize the parametric knowledge (acquired during pre-training) over the contextual knowledge which leads to hallucinations.

153 of 153

Thanks!

ECIR 2025: LLMs are Human-like Annotators

153