1 of 153

LLMs are Human-like Annotators

Mounika Marreddy1, Subba Reddy Oota2, Manish Gupta3

1IIIT Hyderabad, India; 2TU Berlin, Germany; 3Microsoft, India

mounika0559@gmail.com, subba.reddy.oota@tu-berlin.de, gmanish@microsoft.com

ECIR 2025

The 47th European Conference on Information Retrieval

Apr 6-10, 2025. Lucca, Italy.

2 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]: Manish
  • Generating Annotations for NLP Tasks using LLMs [30 mins]: Manish
  • Benchmarking the LLM Annotations and Human Annotations [30 mins]: Subba
  • Coffee break [30 min]
  • Evaluation of LLM Generated Annotations [30 mins]: Subba
  • Autolabel Tools to Label Reasoning Datasets [20 mins]: Mounika
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]: Mounika

ECIR 2025: LLMs are Human-like Annotators

2

3 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]
  • Generating Annotations for NLP Tasks using LLMs [30 mins]
  • Benchmarking the LLM Annotations and Human Annotations [30 mins]
  • Coffee break [30 min]
  • Evaluation of LLM Generated Annotations [30 mins]
  • Autolabel Tools to Label Reasoning Datasets [20 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

3

4 of 153

Deep Learning and Large Language Models

ECIR 2025: LLMs are Human-like Annotators

4

Basic: ANNs, CNNs, RNNs, LSTMs

NLP: Encoder-Decoder, Attention, Transformers, BERT, GPT, T0, BART, T5…

Prompt based models: GPT3, T0/mT0, InstructGPT, Prompt tuning …

5 of 153

GPT-3

  • Humans do not require large supervised datasets to learn most language tasks
  • This is enough
    • A brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”)
    • A tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”)
  • In-context learning

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).

ECIR 2025: LLMs are Human-like Annotators

5

6 of 153

InstructGPT

ECIR 2025: LLMs are Human-like Annotators

6

1. Supervised fine-tuning (SFT)

2. Reward model (RM) training

3. RL via proximal policy optimization (PPO) on RM

7 of 153

ChatGPT and Prompting

  • Generate labelled training data
    • Cheaper: Reduction in $ cost vs UHRS
    • Faster turnaround
    • Agility: Big dev savings on hitapp creation and judge training
    • Quality: Higher label quality
  • Directly use GPT models rather than train your own.
  • Prompt engineering: good task description; examples; multiple prompts help.
  • Can control output length, output language, output style.

ECIR 2025: LLMs are Human-like Annotators

7

8 of 153

Summarization

ECIR 2025: LLMs are Human-like Annotators

8

Question Answering

9 of 153

Machine Translation

ECIR 2025: LLMs are Human-like Annotators

9

Ads Copywriting

10 of 153

Machine Reading Comprehension

ECIR 2025: LLMs are Human-like Annotators

10

11 of 153

Solving reasoning problems

ECIR 2025: LLMs are Human-like Annotators

11

12 of 153

Loads of LLMs and SLMs

ECIR 2025: LLMs are Human-like Annotators

12

GPT-4o

OpenAI o1, o3 …

LLaMA 3, 3.2, 3.3, …

13 of 153

Small language models

  • Models based on llama and Falcon
    • llama: open source models
    • Falcon: clean web data at scale
    • LoRA: fast finetuning with low rank adaptation: Alpaca
    • Vicuna: Conversation tuning
    • Falcon Instruct, Alpaca: Instruction tuning
    • Orca: Explanation tuning
  • Models based on llama2
    • llama-2: RLHF
    • Orca 2: Cautious Reasoning and Progressive learning
  • Models based on Mistral
    • Mistral: Sliding Window attention, Pre-fill and chunking, Rolling buffers
    • Mixtral: MoE
    • SOLAR: Depth-up scaling, alignment tuning, DPO
  • The Phi series: Clean textbook quality data

ECIR 2025: LLMs are Human-like Annotators

13

14 of 153

LLaMA 1

  •  
  • Mathematical reasoning
    • MATH: 12K middle school and high school math problems in LaTeX.
    • GSM8k: Middle school math problems.
    • LLaMA-65B >Minerva 62B
      • Minerva: PaLM models finetuned on 38.5B tokens from ArXiv and Math pages
  • Code generation
    • HumanEval and MBPP.
    • LLaMA > LaMDA and PaLM
  • Massive Multitask Language Understanding (MMLU)
    • MCQs on humanities, STEM and social sciences.
    • Instruction tuned LLaMA-I 65B led to better results

ECIR 2025: LLMs are Human-like Annotators

14

15 of 153

LLaMA 2

  • A collection of pretrained and fine-tuned LLMs: 7B, 13B, 34B, 70B.
  • Fine-tuned LLMs: Llama 2-Chat.
  • Llama 2 models > open-source chat models on most benchmarks.
  • Commercial use license.

Llama 2: Open Foundation and Fine-Tuned Chat Models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi and others. July 2023.

ECIR 2025: LLMs are Human-like Annotators

15

Helpfulness human evaluation on ~4k prompts

16 of 153

LLaMA 3

ECIR 2025: LLMs are Human-like Annotators

16

Meta llama 3 instruct model

Meta llama 3 pretrained model

17 of 153

LLaMA 3.2

ECIR 2025: LLMs are Human-like Annotators

17

Category

Benchmark

Llama 3.2

11B

Llama 3.2

90B

Claude 3

Haiku

GPT-40

mini

Image

College-level Problems and Mathematical Reasoning

 

 

 

MMMU (val, O-shot COT, micro avg accuracy)

50.7

60.3

50.2

59.4

MMMU-Pro, Standard (10 opts, test)

33

45.2

27.3

42.3

MMMU-Pro, Vision (test)

23.7

33.8

20.1

36.5

MathVista (testmini)

51.5

57.3

46.4

56.7

Charts and Diagram Understanding

 

 

ChartQA (test, O-shot COT relaxed accuracy)

83.4

85.5

81.7

-

A12 Diagram (test)

91.1

92.3

86.7

-

DocVQA (test, ANLS)

88.4

90.1

88.8

-

General VQA

VQAv2 (test)

75.2

78.1

-

-

Text

General

MMLU (O-shot, COT)

73

86

75.2

82

Math

MATH (O-shot, COT)

51.9

68

38.9

70.2

Reasoning

GPQA (O-shot, COT)

32.8

46.7

33.3

40.2

Multilingual

MGSM (O-shot, COT)

68.9

86.9

75.1

87

Category

Benchmark

Llama 3.2

1B

Llama 3.2

3B

Gemma 2

2B IT

Phi-3.5

mini IT

General

MMLU (5-shot)

49.3

63.4

57.8

69

Open-rewrite eval (O-shot, rougeL)

41.6

40.1

31.2

34.5

TLDR9+ (test, I-shot, rougeL)

16.8

19

13.9

12.8

IFEval

59.5

77.4

61.9

59.2

Tool Use

BFCL V2

25.7

67

27.4

58.4

Nexus

13.5

34.3

21

26.1

Math

GSM8K (8-shot, COT)

44.4

77.7

62.5

86.2

MATH (O-shot, COT)

30.6

48

23.8

44.2

Reasoning

ARC Challenge (O-shot)

59.4

78.6

76.7

87.4

GPQA (O-shot)

27.2

32.8

27.5

31.9

Hellaswag (O-shot)

41.2

69.8

61.1

81.4

Long Context

InfiniteBench/En.MC (128k)

38

63.3

-

39.2

InfiniteBench/En.QA (128k)

20.3

19.8

-

11.3

NIH/Multi-needle

75

84.7

-

52.7

Multilingual

MGSM (O-shot, COT)

24.5

58.2

40.2

49.8

18 of 153

GPT-4

  • GPT-4 exhibits human-level performance on the majority of these professional and academic exams.
  • It passes a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers

GPT-4 Technical Report. OpenAI. https://cdn.openai.com/papers/gpt-4.pdf

ECIR 2025: LLMs are Human-like Annotators

18

19 of 153

Math word problems and reasoning QA

ECIR 2025: LLMs are Human-like Annotators

19

20 of 153

Math word problems and reasoning QA

ECIR 2025: LLMs are Human-like Annotators

20

21 of 153

Chart understanding and reasoning over data

ECIR 2025: LLMs are Human-like Annotators

21

Spot a data point that stands out in these charts and what that implicates. Then produce a detailed markdown table for all the data shown.

22 of 153

Image understanding and reasoning

ECIR 2025: LLMs are Human-like Annotators

22

Geometrical reasoning

23 of 153

Information seeking about objects

  • Prompt: Do you know the precise location where this image was taken?
  • Model Response: This image was taken in New York City. The Empire State Building is visible in the background. The street in the foreground is 8th Avenue. The cross street is West 34th Street.

ECIR 2025: LLMs are Human-like Annotators

23

Multimodal reasoning based on visual cues

24 of 153

Multimodal humor understanding

  • Prompt: Can you explain this meme?
  • Model Response: The meme is about a person who is playing a game at 300 FPS, but their monitor is only 75Hz. The person is trying to catch the game, but it is too fast for their monitor to display. The meme is funny because it is relatable to anyone who has ever played a game on a monitor that is not fast enough.

ECIR 2025: LLMs are Human-like Annotators

24

Commonsense reasoning in a multilingual setting

  • Prompt: Explain what i should call my grandparents on each side of my family based on this chart
  • Model Response: On your father’s side, you would call your grandparents Yéye and Nainai. On your mother’s side, you would call your grandparents Wàigong and Wàipó.

25 of 153

Reasoning and code generation

Create a web app called "Opossum Search":

1. Every time you make a search query, it should redirect you to a google search with the same query, but the word opossum before it.

2. It should be visually similar to Google search,

3. Instead of the google logo, it should have a picture of an opossum from the internet.

4. It should be a single html file, no separate js or css files.

5. It should say "Powered by google search" in the footer

ECIR 2025: LLMs are Human-like Annotators

25

26 of 153

Mathematics: Calculus

  • Prompt (video)
    • How could this person improve their technique?
  • Model Response
    • This soccer player needs to work on their ball-striking mechanics. Their foot is not making good contact with the ball, which is causing the ball to fly up and away from the goal. They also need to work on their body positioning. They should be leaning into the shot with their non-kicking leg in front of their kicking leg, and they should be following through with their kicking leg.

ECIR 2025: LLMs are Human-like Annotators

26

Video understanding and reasoning

27 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]
  • Generating Annotations for NLP Tasks using LLMs [30 mins]
  • Benchmarking the LLM Annotations and Human Annotations [30 mins]
  • Coffee break [30 min]
  • Evaluation of LLM Generated Annotations [30 mins]
  • Autolabel Tools to Label Reasoning Datasets [20 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

27

28 of 153

Generating Annotations for NLP Tasks using LLMs

  • Are LLMs good annotators?
  • How can we get better annotation accuracy from LLMs?
  • How can we use LLMs to generate samples and then label them?
  • Generating diverse and attribute-specific datasets.

ECIR 2025: LLMs are Human-like Annotators

28

29 of 153

ChatGPT > crowd workers for text-annotation tasks

  • ChatGPT > crowd workers by 25%
    • Relevance: whether a tweet is about content moderation or, in a separate task, about politics
    • Topic detection: whether a tweet is about a set of six predefined topics (i.e., Section 230, Trump Ban, Complaint, Platform Policies, Twitter Support, and others)
    • Stance detection: whether a tweet is in favor of, against, or neutral about repealing Section 230 (on content moderation)
    • General frame detection: whether a tweet contains a set of two opposing frames (content moderation as a “problem” and “solution”).
    • Policy frame detection: whether a tweet contains a set of fourteen policy frames.
  • ChatGPT’s intercoder agreement exceeds that of both crowd workers and trained annotators.
  • Per-annotation cost of ChatGPT is < $0.003: ~30x cheaper than MTurk.

ECIR 2025: LLMs are Human-like Annotators

29

Accuracy means agreement with the trained annotators.

30 of 153

USDC: A Dataset of User Stance and Dogmatism in Long Conversations

  • Input: entire conversation and top two authors.
  • Stance: Strongly In Favor, Strongly Against, Stance Not Inferrable, Somewhat In Favor, or Somewhat Against
  • Dogmatism: Firm but Open, Open to Dialogue, Flexible or Deeply Rooted.

ECIR 2025: LLMs are Human-like Annotators

30

31 of 153

USDC: A Dataset of User Stance and Dogmatism in Long Conversations

  • Inter-annotator Agreement with LLMs as Annotators
    • LLMs: 0.485 (stance), 0.435 (dogmatism)
    • Humans: 0.34 (stance), 0.44 (dogmastism)
  • Verification using Human Interaction
    • 200 test conversations.
    • IAA: 0.56 (stance); 0.45 (dogmatism)
  • Stance Detection Evaluation on SPINOS
    • Use SLMs finetuned on USDC stance.
    • Evaluate on 5-class SPINOS dataset.
    • F1: SPINOS paper (0.341), random baseline (0.230), majority baseline (0.124), ours (0.320).
  • Complex and cumbersome nature of conv understanding
    • Understanding user opinions and their shifts in multi-user conversational contexts.
    • Long-range memory capabilities.
  • Voting over ({Mistral Large, GPT-4}×{zero-shot, one-shot, few-shot}).

ECIR 2025: LLMs are Human-like Annotators

31

Failure Cases

32 of 153

Selective annotation and Prompt retrieval

  • Unsupervised, graph-based selective annotation method, vote-k, to select diverse, representative examples to annotate.
    • Graph with each sample as node and SBERT sim to build edges.
    • Choose high degree nodes where with degree discounting for already chosen nodes.
    • Choose samples with model confidence scores from each of the 10 buckets.

ECIR 2025: LLMs are Human-like Annotators

32

ICL perf over varying annotation budgets for HellaSwag commonsense reasoning. LLM=GPT-J

100 annotated examples

33 of 153

LLMs instead of Human Judges?

  • Judge-Bench: 20 NLP datasets
  • Evaluate 11 LLMs.
  • substantial variance across models and datasets
  • Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human (blue) or model-generated (red).
  • LLMs should be carefully validated against human judgments before being used as evaluators.

Bavaresco, A, Raffaella B, Leonardo B, Desmond E, Raquel F, A Gatt, E Ghaleb et al. "Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks." arXiv:2406.18403 (2024).

ECIR 2025: LLMs are Human-like Annotators

33

34 of 153

Generating Annotations for NLP Tasks using LLMs

  • Are LLMs good annotators?
  • How can we get better annotation accuracy from LLMs?
  • How can we use LLMs to generate samples and then label them?
  • Generating diverse and attribute-specific datasets.

ECIR 2025: LLMs are Human-like Annotators

34

35 of 153

Chain of thought (CoT) prompting

  • Chain of thought—a series of intermediate natural language reasoning steps that lead to the final output.
  • It could be 0-shot or few-shot.
  • Prompt as a triple: <input, chain of thought, output>.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

ECIR 2025: LLMs are Human-like Annotators

35

36 of 153

What are advantages of chain of thought prompting?

  • Allows models to decompose multi-step problems into intermediate steps
  • Improves interpretability
  • PaLM 540B with CoT
    • achieved SOTA on StrategyQA
    • outperforms an unaided sports enthusiast on sports understanding.
  • CoT reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, etc.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

Prompting PaLM 540B with just 8 CoT exemplars achieves SOTA on GSM8K math word problems, surpassing even finetuned GPT-3 with a verifier.

ECIR 2025: LLMs are Human-like Annotators

36

37 of 153

Analogical prompting

  • CoT needs labeled exemplars of the reasoning process.
  • To solve problems, humans think about related problems or high-level knowledge.
  • Analogical prompting
    • Prompt LLMs to self-generate relevant exemplars in the context, before proceeding to solve the given problem.
    • Avoids need for labelled exemplars. Can tailor the generated exemplars and knowledge to each problem

ECIR 2025: LLMs are Human-like Annotators

37

38 of 153

Analogical prompting methods

  • Self-generated exemplars
    • # Problem: [x]
    • # Relevant problems: Recall three relevant and distinct problems. For each problem, describe it and explain the solution.
    • # Solve the initial problem:
  • Self-generated knowledge + exemplars
    • # Tutorial: Identify core concepts in the problem and provide a tutorial.
    • Generating knowledge before exemplars yields superior results

Big Bench reasoning tasks with GPT3.5-Turbo

ECIR 2025: LLMs are Human-like Annotators

38

39 of 153

MoT: Memory-of-Thought

  • Finetuning is expensive.
  • Memory helps humans improve themselves in terms of decision-making, reasoning, judgment, etc.
  • MoT can improve LLMs via prethinking and recalling.
  • Pre-thinking
    • LLM pre-thinks on the unlabeled dataset
      • Few-Shot-CoT
      • Each demo has question, rationale and answer.
      • Get multiple <rationale, answer> pairs and choose majority-voted answer
    • Saves the high-confidence thoughts as external memory
      • Answer-entropy to filter out uncertain thoughts
      • Store <question, rationale, answer> as memory.
  • Recalling at test time
    • Given a test question, LLM recalls relevant memory to help itself reason and answer it.

ECIR 2025: LLMs are Human-like Annotators

39

40 of 153

MoT: Memory-of-Thought

  • Recalling
    • Cluster memory items. Get topK semantically relevant memory item candidates from each cluster using SBERT.
    • Few-Shot-CoT with these memory items as extra context.
  • Baselines
    • MoT (no rationale): removes rationales in the retrieved memory and thus lets the LLM directly output the answer
    • MoT (no thinking): keeps rationales in the retrieved memory but forces the LLM to directly answer the question without CoT at recall stage.

MoT exceeds Few-Shot-CoT and Zero-Shot-CoT

ECIR 2025: LLMs are Human-like Annotators

40

  • Tree of thoughts
  • Graph of thoughts

41 of 153

Dynamic Program Prompting and Program Distillation

  •  

ECIR 2025: LLMs are Human-like Annotators

41

42 of 153

Dynamic Program Prompting and Program Distillation

Example prediction and retrieved program samples

ECIR 2025: LLMs are Human-like Annotators

42

43 of 153

LeanReasoner: Offloading reasoning to Lean

  • Lean: a theorem proving symbolic solver framework
  • Offloading reasoning to Lean: Reduces the risk of logical inconsistencies
  • SOTA perf on FOLIO and ProofWriter.
  • Fine-tuning on <100 in-domain samples for each dataset.
  • GPT4 prompts for formalization and proof generation.

ECIR 2025: LLMs are Human-like Annotators

43

44 of 153

Symbolic reasoning for math word problems

Self-prompting: Prompt LLM with numeric problem and its response to the problem, and then ask it to solve the symbolic problem.

ECIR 2025: LLMs are Human-like Annotators

44

45 of 153

Symbolic Rule Learning for Robust Numerical Reasoning

  • Numerical reasoning for machine reading comprehension (RC) remains a difficult challenge.
  • ICL with LLMs to decompose complex questions into simpler sub-questions that are easier to answer with single-span RC models.
  • Symbolic learning methods like ILASP to learn rules for recomposing partial answers.
  • Benefits
    • Data efficiency: no training or fine-tuning.
    • Neuro-symbolic approach 🡺 robust numerical reasoning
    • Interpretable and verifiable reasoning traces.

ECIR 2025: LLMs are Human-like Annotators

45

46 of 153

Generating Annotations for NLP Tasks using LLMs

  • Are LLMs good annotators?
  • How can we get better annotation accuracy from LLMs?
  • How can we use LLMs to generate samples and then label them?
  • Generating diverse and attribute-specific datasets.

ECIR 2025: LLMs are Human-like Annotators

46

47 of 153

ZeroGen: Efficient Zero-shot Learning via Dataset Generation

  • Tiny task model (TAM) has orders of magnitude fewer parameters than PLMs.
  • Variant of knowledge distillation but (a) does not require any human annotations (b) Flexible arch choice of student models.
  • TAM > PLM with only ∼0.4% number of parameters
  • In some low-resourced settings, TAM trained with synthesized data even outperforms the same model trained with human annotations in a fully supervised manner.

ECIR 2025: LLMs are Human-like Annotators

47

48 of 153

ZeroGen: Efficient Zero-shot Learning via Dataset Generation

  • Label is wrapped up into a label-descriptive prompt
  • For sentence-pair classification tasks, we need to generate two sequences that bear certain relationships (e.g., premise and hypothesis in NLI, context and question in QA).
    • First generate and/or sample a conditional context (premise in NLI and context in QA).
    • The context is then concatenated with a sampled label and transformed into a prompt T
    • Giving the prompt T, generate the other sentence

ECIR 2025: LLMs are Human-like Annotators

48

49 of 153

ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback

  • ZeroGen: suffers from low-quality issues (e.g., low informativeness, redundancy).
  • ProGen
    • Multi-phase dataset generation
    • In each phase, the generation is steered by feedback from the previously generated dataset, so as to synthesize a dataset with higher quality.
    • Feedback from the task-specific model to guide the generation of new training data via in-context examples.

ECIR 2025: LLMs are Human-like Annotators

49

50 of 153

ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback

  •  

ECIR 2025: LLMs are Human-like Annotators

50

51 of 153

SuperGen (Supervision Generation)

  •  

ECIR 2025: LLMs are Human-like Annotators

51

52 of 153

SuperGen (Supervision Generation)

ECIR 2025: LLMs are Human-like Annotators

52

53 of 153

TarGen: Targeted Data Generation

  • Multi-step prompting strategy (for WiC task)
    • Create a set of prompts (boxes 1, 2) to generate instance seeds unique to each task instance.
    • Create label-specific prompts (box 3) that generate instances based on instance seeds
    • Pass instances to self-correction module.
      • Verify alignment between generated instances and their labels, as well as the alignment between these instances and the task description.
  • Models trained on the synthetic datasets for 8 SuperGLUE tasks perform ∼1−3% points higher than those trained on original datasets.

ECIR 2025: LLMs are Human-like Annotators

53

54 of 153

Generation using Single Formatting Example

  • Labeling data requires careful data selection, while generating data necessitates task-specific prompt engineering.
    • Labelers: Curating raw data for tasks in specialized domains, such as those in the biomedical or legal fields, can be particularly challenging
    • Generators: Requires careful curation of few-shot examples, or composition of prompts that highlight the semantic meaning of labels.
  • Self-reference strategy: iteratively samples from the pool of newly created examples to seed the prompt for the next round of generation.
    • random, contrastive, similar, and tree sampling (use examples from step 1 only).
  • Tree-based and Contrastive incurred the lowest cost.

ECIR 2025: LLMs are Human-like Annotators

54

55 of 153

Generation with Active Learning

  • Optimizing LLM as Better Annotator
    • Few-shot inference without finetuning.
    • k-NN few-shot example retrieval
    • Label Verbalizer
      • “per:parents” 🡪 either “subject is the parent of object” or “object is the parent of subject”.
  • Active Data Acquisition
    • Random
    • Maximum Entropy
    • Least Confidence
    • K-Means Diversity sampling

ECIR 2025: LLMs are Human-like Annotators

55

56 of 153

Generation with Active Learning

  • FewGen: enhances ZeroGen with in-context examples uniformly sampled from the demonstration pool.

ECIR 2025: LLMs are Human-like Annotators

56

57 of 153

Generating Annotations for NLP Tasks using LLMs

  • Are LLMs good annotators?
  • How can we get better annotation accuracy from LLMs?
  • How can we use LLMs to generate samples and then label them?
  • Generating diverse and attribute-specific datasets.

ECIR 2025: LLMs are Human-like Annotators

57

58 of 153

Dictionary-assisted training data generation

  • Dictionary-Assisted Training Data Generation (DADG)
    • First query the head-tail entity pairs under relation P6, relation ID of “head of government”, from Wikidata.
    • GPT-3 can then be instructed to generate a sentence with the specified entity pairs and relation.

ECIR 2025: LLMs are Human-like Annotators

58

Prompt guided unlabelled data annotation

Prompt guided training data generation

  • Few shot ICL
  • PGDG: To generate training data with relation “head of government”
    • First "teach" GPT-3 to generate head-tail entity pairs that have the specified relation
    • Instruct GPT-3 to generate a sentence with the given entities and relation.

59 of 153

Dictionary-assisted training data generation

  • PGDA is better for tasks with smaller label spaces and clearly defined labels.
    • sentiment analysis
  • Generation-based approaches (PGDG and DADG) are better suited for tasks with larger label spaces or labels that possess a certain degree of ambiguity.

  • Tagging-based approach
    • Allows for direct access to in-domain unlabeled data
    • As the label space becomes larger, prompt size increases.
  • Generation-based approaches
    • May generate data that contains information that was "learned" during pre-training and may not align with the distribution of in-domain data.
    • Cost-effective as the prompts used can be significantly shorter and multiple data can be generated with a single prompt at a time.

ECIR 2025: LLMs are Human-like Annotators

59

60 of 153

PersonaChatGen: Generating Personalized Dialogues

  • ProfileGen Filtering
    • Remove if output does not have a (persona category, persona entity) key-value pair
    • Exact Matching Persona Entity.
    • Preserving Persona Category in generated profile sentences.
    • Duplication Filtering.
  • PersonaChatGen Filtering
    • Remove samples where GPT3 copies given profile sentences when generating personalized dialogue
    • Persona Consistency with utterances.
    • Remove toxic samples.

ECIR 2025: LLMs are Human-like Annotators

60

61 of 153

PersonaChatGen: Generating Personalized Dialogues

ECIR 2025: LLMs are Human-like Annotators

61

62 of 153

Mixture of Soft Prompts for Controllable Data Generation

  • MSP learns a set of soft prompts, mixes them together to generate attribute-preserving examples, then merges the augmented and original data to train a smaller, downstream model.
  • Individual examples in some tasks can contain multiple attributes.
    • Multi-aspect intent detection: a dialogue utterance may have 3 intent attributes.
  • Attribute Mixing: Concat; Pooling; Attention; Bottleneck; CNN

ECIR 2025: LLMs are Human-like Annotators

62

63 of 153

Mixture of Soft Prompts for Controllable Data Generation

Soft prompts are initialized with the name and description of attribute, e.g. “song is a musical song or melody”.

ECIR 2025: LLMs are Human-like Annotators

63

64 of 153

Attributed Training Data Generator

  • Attributed generation can increase diversity and reduce bias
  • “location” in NYT news dataset
    • “North America” (68.01%)
    • “Africa” (0.69%)

ECIR 2025: LLMs are Human-like Annotators

64

65 of 153

Attributed Training Data Generator

  •  

ECIR 2025: LLMs are Human-like Annotators

65

66 of 153

Summary

  • Are LLMs good annotators?
  • How can we get better annotation accuracy from LLMs?
    • CoT, analogical prompting, MoT, program prompting, LeanReasoner, Symbolic Reasoning
  • How can we use LLMs to generate samples and then label them?
    • ZeroGen, ProGen, SuperGen, TarGen, 1-shot Gen, Active learning,
  • Generating diverse and attribute-specific datasets.
    • Dictionary-assisted generation, PersonaChatGen, Attributed Training Data Generation,

ECIR 2025: LLMs are Human-like Annotators

66

67 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]
  • Generating Annotations for NLP Tasks using LLMs [30 mins]
  • Benchmarking the LLM Annotations and Human Annotations [30 mins]
  • Coffee break [30 min]
  • Evaluation of LLM Generated Annotations [30 mins]
  • Autolabel Tools to Label Reasoning Datasets [20 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

67

68 of 153

ECIR 2025: LLMs are Human-like Annotators

68

What is reasoning?

  • Reasoning is the ability to make inferences using evidence and logic.
  • Reasoning can be divided into multiple types of skills such as Commonsense, Mathematical, and Symbolic reasoning etc.
  • Often, reasoning involves deductions from inference chains, called as multi-step reasoning.

Augmented Language Models: a Survey (Mialon et.al, 2023)

  • Do language models truly understand and apply common sense reasoning?

69 of 153

ECIR 2025: LLMs are Human-like Annotators

69

Reasoning Problems

Augmented Language Models: a Survey (Mialon et.al, 2023)

  • Hard Language Tasks: require multiple steps of reasoning to solve

70 of 153

ECIR 2025: LLMs are Human-like Annotators

70

Multi-step reasoning is often seen as a weakness in language models

Towards Reasoning in Large Language Models: A Survey (Huang et.al, 2023)

  • It is unclear to what extent LLMs are capable of reasoning

Former research on reasoning in small language models through fully

supervised finetuning on specific datasets

  • Creating a dataset containing explicit reasoning can be difficult and time-consuming
  • training on a specific dataset limits application to a specific domain

Reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters

71 of 153

ECIR 2025: LLMs are Human-like Annotators

71

Reasoning and Commonsense Benchmarks

Source: https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond#different-types-of-llm-benchmarks

  • How can we assess reasoning abilities in language models?”
  1. TruthfulQA — Truthfulness
  2. MMLU — Language understanding
  3. HellaSwag — Commonsense reasoning
  4. BIG-Bench Hard — Challenging reasoning tasks
  5. HumanEval — Coding challenges
  6. CodeXGLUE — Programming tasks
  7. Chatbot Arena — Human-ranked ELO-based benchmark
  8. MT Bench — Complex conversational ability

72 of 153

ECIR 2025: LLMs are Human-like Annotators

72

How is reasoning measured (in the literature)?

GPT-4 Technical Report (OpenAI).

  • While SOTA models at the time struggled to score above 50% with pre-training, GPT-4 achieved a record-high of 95.3% with just 10-shot prompting in 2023.

73 of 153

ECIR 2025: LLMs are Human-like Annotators

73

Chain of thought prompting and Self consistency

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Prompt: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?

11 apples

  • Keep in mind that the authors claim that this is an emergent ability that arises with sufficiently large language models.

74 of 153

ECIR 2025: LLMs are Human-like Annotators

74

Chain of thought prompting: Arithmetic Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

  • Do not positively impact performance for small models
  • Few-shot CoT achieves better performance on LLM than zero-shot CoT.
  • Manually composed 8 exemplars
  • All contains equations with flexible formats
  • Benchmarked on:
    • GSM8K (Cobbe et al. 2021)
    • SVAMP (Patel et al., 2021)
    • MAWPS (Koncel-Kedziorski et al., 2016)

75 of 153

ECIR 2025: LLMs are Human-like Annotators

75

Chain of thought prompting and Self consistency

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

  • Chain-of-thought prompting improves performance on a range of arithmetic reasoning tasks

76 of 153

ECIR 2025: LLMs are Human-like Annotators

76

Chain of thought prompting: Symbolic Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

  • Standard prompting fails out-of-domain tests for both tasks
  • Zero-shot CoT using Instruct-GPT-3 175B achieves the similar performance as few-shot CoT in both tasks using 540B PaLM model

77 of 153

ECIR 2025: LLMs are Human-like Annotators

77

Chain of thought prompting: Commonsense Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

  • CoT show minimal benefits on CSQA and StrategyQA tasks
  • Few-shot achieves better performance than Zero-shot CoT on 175B GPT-3 model for CSQA and Strategy QA tasks, but Zero-shot CoT shows significant improvement for Date understanding task.

78 of 153

ECIR 2025: LLMs are Human-like Annotators

78

More Advances: Self consistency

Self-consistency improves chain of thought reasoning in language models. (Wang et al., 2022)

  • Change greedy decode (single-path) to self-consistency (multi-path) in few-shot CoT

79 of 153

ECIR 2025: LLMs are Human-like Annotators

79

STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning

STaR: Self-Taught Reasoner. (Zelikman et al., 2022)

  • Self-Taught Reasoner (STaR), which iteratively improves a model’s ability to generate rationales to solve problems.
  • participants were 74% more likely to prefer the STaR-generated rationales over the human-generated rationales

80 of 153

ECIR 2025: LLMs are Human-like Annotators

80

Program-aided Language Models (PAL)

PAL: Program-aided Language Models (Gao et.al, 2023)

  • PAL enhance problem-solving by generating code to represent intermediate reasoning steps (contrast this with CoT prompting, which uses natural language to reason).

81 of 153

ECIR 2025: LLMs are Human-like Annotators

81

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

  • TORA models exhibit remarkable improvements over previous state-of-the-art approaches across all scales

82 of 153

ECIR 2025: LLMs are Human-like Annotators

82

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

  • Utilize GPT-4 to synthesize high-quality trajectories on arithmetic reasoning datasets.

83 of 153

ECIR 2025: LLMs are Human-like Annotators

83

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

  • TORA provides valuable insights for future research, contributing to the development of more advanced and versatile reasoning agents.

84 of 153

ECIR 2025: LLMs are Human-like Annotators

84

Plan-and-Solve Prompting

PAL: Program-aided Language Models (Gao et.al, 2023)

  1. Zero-shot-CoT prompting
  2. Plan-and-Solve (PS) prompting
  3. answer extraction prompting
  • Zero-shot PS + Prompting consistently outperforms Zero-shot-CoT across all reasoning datasets by a large margin

85 of 153

ECIR 2025: LLMs are Human-like Annotators

85

Can we use LLMs to benchmark reasoning datasets?

  • With models like GPT-4, it's now possible to synthetically produce datasets that are more comprehensive and diverse than human-labeled ones, in far less time

86 of 153

ECIR 2025: LLMs are Human-like Annotators

86

Reasoning datasets: CriticBench

  • CRITICBENCH is designed to assess the two key aspects of LLMs’ critical reasoning:
    • critique
    • correction

87 of 153

ECIR 2025: LLMs are Human-like Annotators

87

Reasoning datasets: Question collection on CriticBench

Response collection from LLMs:

  • Employ various LLMs, including GPT-3.5, GPT4, LLaMa2 (7B, 13B, and 70B variants), and vicuna (7B, 13B, and 33B variants)
  • Each LLM generate response for each question and provide valid reasoning

Response annotation:

  • Response correctness:
    • rule-based matching, followed by a more detailed evaluation using GPT-4 and Humans

Domains:

  • 15 datasets spanning 5 domains: mathematical, commonsense, symbolic, coding, and algorithmic.

Question collection:

  • Randomly selecting quantity of data from existing datasets

88 of 153

ECIR 2025: LLMs are Human-like Annotators

88

Reasoning datasets: Evaluation process on CriticBench

89 of 153

ECIR 2025: LLMs are Human-like Annotators

89

Reasoning datasets: Annotation example of CriticBench

  • GPT-4 evaluation is closer to Human evaluation

90 of 153

ECIR 2025: LLMs are Human-like Annotators

90

Reasoning datasets: Key Factors in Critical Reasoning

  • LLMs struggle more with incorrect answers in detail-oriented tasks like algorithmic tasks compared to logic centric tasks like code generation

91 of 153

ECIR 2025: LLMs are Human-like Annotators

91

Reasoning datasets: Average performance on CriticBench

  • The knowledge acquired by LLMs is not entirely consistent across generation, critique, and correction tasks.

92 of 153

ECIR 2025: LLMs are Human-like Annotators

92

Reasoning datasets: Consistency of GQC Knowledge

  • Analysis across different task types found that models perform better in Q and C for tasks focused on logic compared to those requiring attention to detail.

93 of 153

ECIR 2025: LLMs are Human-like Annotators

93

Human preference benchmarks with reasoning tasks

  • How to construct a benchmark which consists of real-world multi-turn conversations?
  • How to automatically evaluate the model’s responses?

Why human preference benchmarks?

  • Existing NLP benchmarks fail to accurately evaluate LLM chat assistants
  • Knowledge benchmark (e.g., MMLU)
    • Not open-ended
  • Instruction benchmark (e.g., Super-Natural Instructions)
    • Only one turn
    • Human annotation efforts
  • Conversational benchmark (e.g., CoQA)
    • Too easy for state-of-the-art LLMs

94 of 153

ECIR 2025: LLMs are Human-like Annotators

94

Human preference benchmarks: LLMs as judges

  • LLM judges provide not only scores but also explanations, offering interpretable outputs and valuable insights.

Four popular benchmarks

  • MT-Bench
  • Chatbot Arena
  • MT-Bench-101
  • HumanEval

95 of 153

ECIR 2025: LLMs are Human-like Annotators

95

LLM-as-a-judge

  • The agreement between GPT-4 judges and humans reaches over 80%, the same level agreement among humans.

96 of 153

ECIR 2025: LLMs are Human-like Annotators

96

Limitations of LLM-as-a-judge

Position bias

  • LLM favor the answers in the first position

Verbosity bias

  • LLM favor long answers

Self-appreciation bias

  • LLM favor its own answers of answers similar to its own answer

Limited reasoning ability

  • LLMs fail to judge hard math/reasoning/code questions
  • It's crucial to acknowledge the limitations of LLM-as-a-judge, such as its inability to detect hallucinations and penalize LLM generated answers accordingly, and potential errors when grading math/reasoning questions.

97 of 153

ECIR 2025: LLMs are Human-like Annotators

97

Human preference benchmarks: MT-Bench-101

  • The agreement between GPT-4 and human expert evaluations reached 87%, utilizing our designed evaluation approach.

98 of 153

ECIR 2025: LLMs are Human-like Annotators

98

MT-Bench-101: Hierarchical Ability Taxonomy

  • Utilized GPT-4 to construct data and it is the most powerful model for multi-turn dialogues.

13 tasks

3-level abilities

99 of 153

ECIR 2025: LLMs are Human-like Annotators

99

MT-Bench-101: Model’s performance

  • Closed-source models consistently exhibit superior performance compared to open-source counterparts across all evaluated tasks.
  • Content confusion and format rephrasing are relatively less difficult, while the mathematical reasoning task is the most challenging

100 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]
  • Generating Annotations for NLP Tasks using LLMs [30 mins]
  • Benchmarking the LLM Annotations and Human Annotations [30 mins]
  • Coffee break [30 min]
  • Evaluation of LLM Generated Annotations [30 mins]
  • Autolabel Tools to Label Reasoning Datasets [20 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

100

101 of 153

ECIR 2025: LLMs are Human-like Annotators

101

Why Focus on Evaluation

Source: https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/

102 of 153

ECIR 2025: LLMs are Human-like Annotators

102

LLM Evaluation vs. Human Evaluation

103 of 153

ECIR 2025: LLMs are Human-like Annotators

103

How to scale “human evaluation”?

  • Chatbot arena allows users to vote for the superior response manually, which leverage multiple LLMs to autonomously determine which response stands out.

104 of 153

ECIR 2025: LLMs are Human-like Annotators

104

LLM Evaluation

Model

LLM Benchmark

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Aggregation

 

System Prompt

System Format

Hyperparams

105 of 153

ECIR 2025: LLMs are Human-like Annotators

105

LLM Evaluation

Task:

  • What is that concrete problem that we want to address (e.g., classification, summarization, commonsense reasoning..)

Dataset:

  • What dataset we want to use?

Metric:

  • How we evaluate the performance?

Instruction:

  • Evaluate model ability to perform an unseen task given context in the form of instructions

Shot:

  • The model is given with or without any prior examples at inference time

106 of 153

ECIR 2025: LLMs are Human-like Annotators

106

LLM Evaluation

Model

System Prompt

System Format

Hyperparams

<SYS> you are helpful Model </SYS>

<instruction> Translate this sentence to French

<user> I like pizza

<assistant> J'aime la pizza

107 of 153

ECIR 2025: LLMs are Human-like Annotators

107

LLM Evaluation: Alpaca

  • GPT-4 based evaluation of chatbot output agrees well with human evaluation; known before
  • Surprisingly, judge Llama 2 (70B) can compete with judge GPT-4 (1760B). But, Judge Llama 2 is more biased (more ties) and less capable of following the instructions carefully.

108 of 153

ECIR 2025: LLMs are Human-like Annotators

108

LLM Evaluation: G-Eval

  • LLM consistently gives higher scores to GPT-3.5 summaries, even though human judges prefer summaries written in human language

G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment (Yang et.al, 2023)

  • Can we rely on LLMs to assess LLM outputs
  • Do LLMs show a bias towards the outputs they generate during evaluations?

109 of 153

ECIR 2025: LLMs are Human-like Annotators

109

LLM Evaluation: GPT-Score

https://github.com/confident-ai/deepeval

110 of 153

ECIR 2025: LLMs are Human-like Annotators

110

Language model-written evaluations

https://github.com/confident-ai/deepeval

  • Generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size.
  • Overall, LM-written evaluations are high quality and quickly discover many novel LM behaviors.

111 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]
  • Generating Annotations for NLP Tasks using LLMs [30 mins]
  • Benchmarking the LLM Annotations and Human Annotations [30 mins]
  • Coffee break [30 min]
  • Evaluation of LLM Generated Annotations [30 mins]
  • Autolabel Tools to Label Reasoning Datasets [20 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

111

112 of 153

Generate a synthetic dataset using LLMs

112

AutoLabel:

  • Labeling Agent provides a method to generate synthetic datasets
  • Supports synthetic dataset generation for classification and entity matching tasks

Prodigy:

  • LLMs, which allow you to provide a prompt in order to annotate an NLP task.

Labelbox:

  • Easily curate and annotate data
  • Generate high-quality human feedback data for computer vision and language models, evaluate and improve model performance

LLM-data-annotation:

  • Combines human expertise with the efficiency of Large Language Models (LLMs) like OpenAI's GPT-3.5 to simplify dataset annotation and model improvement.

113 of 153

AutoLabel

113

  • How to use AutoLabel tool to annotate a task?

114 of 153

AutoLabel: Question Answering

114

  • First step: specify a labeling configuration

115 of 153

AutoLabel: Question Answering

115

  • Second step: do a dry-run on test dataset using the LLM specified in config.json by running agent.plan

116 of 153

AutoLabel: Question Answering

116

  • Final step: run the labeling with agent.run

117 of 153

LLMs can label data as well as humans, but 100x faster

117

  • Refuel LLM (84.2%) outperforms trained human annotators (80.4%), GPT-3-5-turbo (81.3%), PaLM-2 (82.3%) and Claude (79.3%) across a benchmark of 15 text labeling datasets.

118 of 153

LLMs can label data: Quality Evaluation

118

  • Refuel LLM outperforms human annotators and all closed source LLMs, excluding GPT-4

119 of 153

LLMs can label data: Quality Evaluation

119

  • RefuelLLM-2 (83.82%) outperforms all current state-of-the-art LLMs for data labeling and enrichment, including GPT-4-Turbo (80.88%), Claude-3-Opus (79.19%) and Gemini-1.5-Pro (74.59%)

120 of 153

LLMs can label data: Quality Evaluation

120

  • Refuel-LLM-2 is competitive or superior in terms of output quality, compared to current state-of-the-art LLMs

121 of 153

Prodigy

121

https://demo.prodi.gy/?=null&view_id=ner_manual

What Prodigy isn’t:

  • software as a service” – it’s a tool that you can download, install and run yourself

Usage:

  • Annotate NLP tasks, Audio and vision tasks

122 of 153

AutoLabel tools: Which one is better

122

  • Autolabel currently supports Chain-of-thought prompting as well suitable for reasoning tasks.
  • Labeling tasks spanning categories such as classification, entity resolution, matching, reading comprehension, reasoning and information extraction

123 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]
  • Generating Annotations for NLP Tasks using LLMs [30 mins]

  • Benchmarking the LLM Annotations and Human Annotations [30 mins]
  • Evaluation of LLM Generated Annotations [30 mins]

  • Autolabel Tools to Label Reasoning Datasets [20 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]

ECIR 2025: LLMs are Human-like Annotators

123

124 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]
  • Generating Annotations for NLP Tasks using LLMs [30 mins]
  • Benchmarking the LLM Annotations and Human Annotations [30 mins]
  • Coffee break [30 min]
  • Evaluation of LLM Generated Annotations [30 mins]
  • Autolabel Tools to Label Reasoning Datasets [20 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]
    • Taxonomy of hallucinations
    • Hallucination detection
    • Methods to mitigate hallucination

ECIR 2025: LLMs are Human-like Annotators

124

125 of 153

Hallucination

125

  • How did hallucination evolve in LLMs?
  • In the context of language models: the generation of text or responses that seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.

126 of 153

Evolution of Hallucination in LLMs

126

  • LLMs simply generate text without knowing the specific details of the visit's purpose.

127 of 153

Evolution of Hallucination in LLMs

127

  • Despite advanced capabilities of MLLM, like all LLMs, are prone to hallucination

128 of 153

Evolution of Hallucination in LLMs

128

  • The problem of hallucination also exists in other foundation models such as image, video, and audio as well

129 of 153

Taxonomy of Hallucinations

ECIR 2025: LLMs are Human-like Annotators

129

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Like their text-only counterparts in LLMs, Video- and Audio-based language models are also prone to hallucinations.

130 of 153

Taxonomy of Hallucinations: Causes

ECIR 2025: LLMs are Human-like Annotators

130

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

131 of 153

Taxonomy of Hallucinations: Metrics and Benchmarks

ECIR 2025: LLMs are Human-like Annotators

131

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

132 of 153

Taxonomy of Hallucinations: Mitigation

ECIR 2025: LLMs are Human-like Annotators

132

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

133 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]
  • Generating Annotations for NLP Tasks using LLMs [30 mins]

  • Benchmarking the LLM Annotations and Human Annotations [30 mins]
  • Evaluation of LLM Generated Annotations [30 mins]

  • Autolabel Tools to Label Reasoning Datasets [20 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]
    • Taxonomy of hallucinations
    • Hallucination detection
    • Methods to mitigate hallucination

ECIR 2025: LLMs are Human-like Annotators

133

134 of 153

Hallucination Types

134

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Hallucination can occur in any NLG task, necessitating a thorough examination based on the fundamental principles of text generation from a given prompt.

135 of 153

Hallucination Types: Orientation, Category and Degree

135

136 of 153

Hallucination Types: Orientation

136

Factual Mirage:

  • an LLM engages in hallucination or distortion of a given prompt that is factually correct.

Intrinsic

Extrensic

  • The generated text that appears to be true or real but is actually misleading or false.

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

137 of 153

Hallucination Types: Orientation

137

Silver Lining:

  • an LLM indulges in hallucination by conjuring an elaborate and captivating narrative based on a given prompt that is factually incorrect.

Intrinsic

Extrensic

  • LLM produces unfaithful content or is inconsistent with the provided source prompt.

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

138 of 153

Hallucination Types: Category

138

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • an LLMs generating text that exhibits a mashed fusion of events from different timelines.

139 of 153

Hallucination Types: Category

139

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • LLMs generate an incorrect location with an event.

140 of 153

Hallucination Types: Category

140

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • an LLM generates quotations attributed to either fictional or real characters without sufficient evidence to verify the authenticity of such statements.

141 of 153

Hallucination Types: Category

141

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • an LLM fabricates an imaginary personality in relation to a past event, without concrete evidence.

142 of 153

Hallucination Types: Category

142

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • an LLM generates numeric values related to past events, such as dates, ages, or monetary amounts, that are inconsistent with the actual facts.

143 of 153

Hallucination Types: Category

143

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • LLMs generate an imprecise expansion for an acronym.

144 of 153

Hallucination Types: Degree

144

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Alarming indicates added information pieces that bear a radical dissemblance from the topic fed via the prompt.
  • Moderate indicates a level of hallucination that introduces facts that are either fictitious or tangential to the topic at hand.
  • Mild indicates minor hallucination which is superficial in terms of its impact.
  • How can we detect and evaluate LLM’s hallucinations across different types?

145 of 153

Hallucination Detection: SelfCheckGPT

145

Gpteval: Nlg evaluation using gpt-4 with better human alignment (Yang et.al, 2023)

  • SelfCheckGPT can effectively detect hallucinated sentences with higher accuracy compared to several baseline methods
  • SelfCheckGPT's prompting method achieved the highest performance in detecting non-factual sentences

146 of 153

Hallucination Detection: FACTScore

146

"Factscore: Fine-grained atomic evaluation of factual precision in long form text generation (Sewon et.al, 2023)

  • GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.

147 of 153

Hallucination eLiciTation dataset

147

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Selected 15 LLMs, and used them to generate a total of 75,000 text passages, with each LLM producing 5,000 text prose entries.
  • The text prompts provided to these LLMs consisted of tweets from NYTimes and headlines sourced from the Politifact dataset

148 of 153

Hallucination Vulnerability Index (HVI)

148

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Larger LLMs without RLHF are prone to both orientations of hallucination
  • For smaller LLMs like T5, Dolly, etc., Generated Golem, Virtual Voice, and Geographic Erratum categories of hallucination are rarely observed.

149 of 153

Agenda

  • Introductions to LLMs and Recap Their Capabilities [30 mins]
  • Generating Annotations for NLP Tasks using LLMs [30 mins]
  • Benchmarking the LLM Annotations and Human Annotations [30 mins]
  • Coffee break [30 min]
  • Evaluation of LLM Generated Annotations [30 mins]
  • Autolabel Tools to Label Reasoning Datasets [20 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [40 mins]
    • Taxonomy of hallucinations
    • Hallucination detection
    • Methods to mitigate hallucination

ECIR 2025: LLMs are Human-like Annotators

149

150 of 153

Hallucination Mitigation

150

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)

  • Leveraging external knowledge to validate the correctness: RAG
  • Modifying the decoding strategy
  • Sampling multiple outputs and checking their consistency: SelfCheckGPT
  • Proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average

151 of 153

Hallucination Mitigation: Chain-Of-Verification (CoVe)

151

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)

  • CoVe reduces inaccuracies in LLMs' responses by verifying facts through structured questioning.
  • Significant reduction in factual hallucinations.
  • Final responses are more reliable and factually accurate.

152 of 153

Is hallucination always bad?

152

https://www.washingtonpost.com/opinions/2023/12/27/artificial-intelligence-hallucinations/

  • Numerals: Models have been shown to hallucinate a lot while generating numerals, such as dates, quantities, and scalars.
  • Long Text: Models often tend to self-contradict while generating the output.
  • Reasoning: Misunderstanding facts/information present in the source text can lead to hallucinations and errors.
  • When Contextual Knowledge Conflicts with the Parametric Knowledge: Models have been shown to prioritize the parametric knowledge (acquired during pre-training) over the contextual knowledge which leads to hallucinations.

153 of 153

Thanks!

ECIR 2025: LLMs are Human-like Annotators

153