1 of 143

LLMs are Human-like Annotators

Mounika Marreddy1, Subba Reddy Oota2, Lucie Flek1, Manish Gupta3

1Univ of Bonn, Germany; 2TU Berlin, Germany; 3Microsoft, India;

mmarredd@uni-bonn.de, subba.reddy.oota@tu-berlin.de, flek@bit.uni-bonn.de,gmanish@microsoft.com

KR 2024

21st International Conference on Principles of Knowledge Representation and Reasoning

Nov 2 - 8, 2024. Hanoi, Vietnam

2 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]: Manish
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]: Manish
  • Reasoning Datasets with LLM Annotations [25 mins]: Subba
  • Tea break [30 min]
  • Evaluation of LLM Generated Annotations [25 mins]: Subba
  • Autolabel Tools to Label Reasoning Datasets [25 mins]: Mounika
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]: Mounika

KR 2024: LLMs are Human-like Annotators

2

3 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]
  • Reasoning Datasets with LLM Annotations [25 mins]
  • Tea break [30 min]
  • Evaluation of LLM Generated Annotations [25 mins]
  • Autolabel Tools to Label Reasoning Datasets [25 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

3

4 of 143

Deep Learning and Large Language Models

KR 2024: LLMs are Human-like Annotators

4

Basic: ANNs, CNNs, RNNs, LSTMs

NLP: Encoder-Decoder, Attention, Transformers, BERT, GPT, T0, BART, T5…

Prompt based models: GPT3, T0/mT0, InstructGPT, Prompting

5 of 143

GPT-3

  • Humans do not require large supervised datasets to learn most language tasks
  • This is enough
    • A brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”)
    • A tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”)
  • In-context learning

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).

KR 2024: LLMs are Human-like Annotators

5

6 of 143

InstructGPT

KR 2024: LLMs are Human-like Annotators

6

1. Supervised fine-tuning (SFT)

2. Reward model (RM) training

3. RL via proximal policy optimization (PPO) on RM

7 of 143

ChatGPT and Prompting

  • Generate labelled training data
    • Cheaper: Reduction in $ cost vs UHRS
    • Faster turnaround
    • Agility: Big dev savings on hitapp creation and judge training
    • Quality: Higher label quality
  • Directly use GPT models rather than train your own.
  • Prompt engineering: good task description; examples; multiple prompts help.
  • Can control output length, output language, output style.

KR 2024: LLMs are Human-like Annotators

7

8 of 143

Summarization

KR 2024: LLMs are Human-like Annotators

8

Question Answering

9 of 143

Machine Translation

KR 2024: LLMs are Human-like Annotators

9

Ads Copywriting

10 of 143

Machine Reading Comprehension

KR 2024: LLMs are Human-like Annotators

10

11 of 143

Solving reasoning problems

KR 2024: LLMs are Human-like Annotators

11

12 of 143

Chain of thought (CoT) prompting

  • Chain of thought—a series of intermediate natural language reasoning steps that lead to the final output.
  • It could be 0-shot or few-shot.
  • Prompt as a triple: <input, chain of thought, output>.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

KR 2024: LLMs are Human-like Annotators

12

13 of 143

What are advantages of chain of thought prompting?

  • Allows models to decompose multi-step problems into intermediate steps
  • Improves interpretability: how the model might have arrived at a particular answer and help debug where the reasoning path went wrong
  • CoT reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, etc.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

Prompting PaLM 540B with just 8 CoT exemplars achieves SOTA on GSM8K math word problems, surpassing even finetuned GPT-3 with a verifier.

KR 2024: LLMs are Human-like Annotators

13

14 of 143

CoT improves Commonsense Reasoning

  • 5 datasets
    • CSQA asks commonsense questions about the world involving complex semantics that often require prior knowledge.
    • StrategyQA involves inferring a multi-hop strategy to answer questions.
    • Date Understanding involves inferring a date from a given context
    • Sports Understanding involves determining whether a sentence relating to sports is plausible or implausible.
    • SayCan involves mapping a natural language instruction to a sequence of robot actions from a discrete set.
  • PaLM 540B with CoT achieved SOTA on StrategyQA (75.6% vs 69.4%) and outperforms an unaided sports enthusiast on sports understanding (95.4% vs 84%).

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. "Chain of thought prompting elicits reasoning in large language models." arXiv:2201.11903 (2022).

KR 2024: LLMs are Human-like Annotators

14

15 of 143

Loads of LLMs and SLMs

KR 2024: LLMs are Human-like Annotators

15

GPT-4o

OpenAI O1

16 of 143

Small language models

  • Models based on llama and Falcon
    • llama: open source models
    • Falcon: clean web data at scale
    • LoRA: fast finetuning with low rank adaptation: Alpaca
    • Vicuna: Conversation tuning
    • Falcon Instruct, Alpaca: Instruction tuning
    • Orca: Explanation tuning
  • Models based on llama2
    • llama-2: RLHF
    • Orca 2: Cautious Reasoning and Progressive learning
  • Models based on Mistral
    • Mistral: Sliding Window attention, Pre-fill and chunking, Rolling buffers
    • Mixtral: MoE
    • SOLAR: Depth-up scaling, alignment tuning, DPO
  • The Phi series: Clean textbook quality data

KR 2024: LLMs are Human-like Annotators

16

17 of 143

LLaMA 1

  •  
  • Mathematical reasoning
    • MATH: 12K middle school and high school math problems in LaTeX.
    • GSM8k: Middle school math problems.
    • LLaMA-65B >Minerva 62B
      • Minerva: PaLM models finetuned on 38.5B tokens from ArXiv and Math pages
  • Code generation
    • HumanEval and MBPP.
    • LLaMA > LaMDA and PaLM
  • Massive Multitask Language Understanding (MMLU)
    • MCQs on humanities, STEM and social sciences.
    • Instruction tuned LLaMA-I 65B led to better results

KR 2024: LLMs are Human-like Annotators

17

18 of 143

LLaMA 2

  • A collection of pretrained and fine-tuned LLMs: 7B, 13B, 34B, 70B.
  • Fine-tuned LLMs: Llama 2-Chat.
  • Llama 2 models > open-source chat models on most benchmarks.
  • Commercial use license.

Llama 2: Open Foundation and Fine-Tuned Chat Models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi and others. July 2023.

KR 2024: LLMs are Human-like Annotators

18

Helpfulness human evaluation on ~4k prompts

19 of 143

LLaMA 3

KR 2024: LLMs are Human-like Annotators

19

Meta llama 3 instruct model

Meta llama 3 pretrained model

20 of 143

LLaMA 3.2

KR 2024: LLMs are Human-like Annotators

20

Category

Benchmark

Llama 3.2

11B

Llama 3.2

90B

Claude 3

Haiku

GPT-40

mini

Image

College-level Problems and Mathematical Reasoning

 

 

 

MMMU (val, O-shot COT, micro avg accuracy)

50.7

60.3

50.2

59.4

MMMU-Pro, Standard (10 opts, test)

33

45.2

27.3

42.3

MMMU-Pro, Vision (test)

23.7

33.8

20.1

36.5

MathVista (testmini)

51.5

57.3

46.4

56.7

Charts and Diagram Understanding

 

 

ChartQA (test, O-shot COT relaxed accuracy)

83.4

85.5

81.7

-

A12 Diagram (test)

91.1

92.3

86.7

-

DocVQA (test, ANLS)

88.4

90.1

88.8

-

General VQA

VQAv2 (test)

75.2

78.1

-

-

Text

General

MMLU (O-shot, COT)

73

86

75.2

82

Math

MATH (O-shot, COT)

51.9

68

38.9

70.2

Reasoning

GPQA (O-shot, COT)

32.8

46.7

33.3

40.2

Multilingual

MGSM (O-shot, COT)

68.9

86.9

75.1

87

Category

Benchmark

Llama 3.2

1B

Llama 3.2

3B

Gemma 2

2B IT

Phi-3.5

mini IT

General

MMLU (5-shot)

49.3

63.4

57.8

69

Open-rewrite eval (O-shot, rougeL)

41.6

40.1

31.2

34.5

TLDR9+ (test, I-shot, rougeL)

16.8

19

13.9

12.8

IFEval

59.5

77.4

61.9

59.2

Tool Use

BFCL V2

25.7

67

27.4

58.4

Nexus

13.5

34.3

21

26.1

Math

GSM8K (8-shot, COT)

44.4

77.7

62.5

86.2

MATH (O-shot, COT)

30.6

48

23.8

44.2

Reasoning

ARC Challenge (O-shot)

59.4

78.6

76.7

87.4

GPQA (O-shot)

27.2

32.8

27.5

31.9

Hellaswag (O-shot)

41.2

69.8

61.1

81.4

Long Context

InfiniteBench/En.MC (128k)

38

63.3

-

39.2

InfiniteBench/En.QA (128k)

20.3

19.8

-

11.3

NIH/Multi-needle

75

84.7

-

52.7

Multilingual

MGSM (O-shot, COT)

24.5

58.2

40.2

49.8

21 of 143

GPT-4

  • GPT-4 exhibits human-level performance on the majority of these professional and academic exams.
  • It passes a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers

GPT-4 Technical Report. OpenAI. https://cdn.openai.com/papers/gpt-4.pdf

KR 2024: LLMs are Human-like Annotators

21

22 of 143

Math word problems and reasoning QA

KR 2024: LLMs are Human-like Annotators

22

23 of 143

Math word problems and reasoning QA

KR 2024: LLMs are Human-like Annotators

23

24 of 143

Chart understanding and reasoning over data

KR 2024: LLMs are Human-like Annotators

24

Spot a data point that stands out in these charts and what that implicates. Then produce a detailed markdown table for all the data shown.

25 of 143

Image understanding and reasoning

KR 2024: LLMs are Human-like Annotators

25

Geometrical reasoning

26 of 143

Information seeking about objects

  • Prompt: Do you know the precise location where this image was taken?
  • Model Response: This image was taken in New York City. The Empire State Building is visible in the background. The street in the foreground is 8th Avenue. The cross street is West 34th Street.

KR 2024: LLMs are Human-like Annotators

26

Multimodal reasoning based on visual cues

27 of 143

Multimodal humor understanding

  • Prompt: Can you explain this meme?
  • Model Response: The meme is about a person who is playing a game at 300 FPS, but their monitor is only 75Hz. The person is trying to catch the game, but it is too fast for their monitor to display. The meme is funny because it is relatable to anyone who has ever played a game on a monitor that is not fast enough.

KR 2024: LLMs are Human-like Annotators

27

Commonsense reasoning in a multilingual setting

  • Prompt: Explain what i should call my grandparents on each side of my family based on this chart
  • Model Response: On your father’s side, you would call your grandparents Yéye and Nainai. On your mother’s side, you would call your grandparents Wàigong and Wàipó.

28 of 143

Reasoning and code generation

Create a web app called "Opossum Search":

1. Every time you make a search query, it should redirect you to a google search with the same query, but the word opossum before it.

2. It should be visually similar to Google search,

3. Instead of the google logo, it should have a picture of an opossum from the internet.

4. It should be a single html file, no separate js or css files.

5. It should say "Powered by google search" in the footer

KR 2024: LLMs are Human-like Annotators

28

29 of 143

Mathematics: Calculus

  • Prompt (video)
    • How could this person improve their technique?
  • Model Response
    • This soccer player needs to work on their ball-striking mechanics. Their foot is not making good contact with the ball, which is causing the ball to fly up and away from the goal. They also need to work on their body positioning. They should be leaning into the shot with their non-kicking leg in front of their kicking leg, and they should be following through with their kicking leg.

KR 2024: LLMs are Human-like Annotators

29

Video understanding and reasoning

30 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]
  • Reasoning Datasets with LLM Annotations [25 mins]
  • Tea break [30 min]
  • Evaluation of LLM Generated Annotations [25 mins]
  • Autolabel Tools to Label Reasoning Datasets [25 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

30

31 of 143

Generating Annotations for Reasoning Tasks using LLMs

  • Arithmetic Reasoning
  • Commonsense Reasoning
  • Logical Reasoning
  • Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

31

32 of 143

Generating Annotations for Reasoning Tasks using LLMs

  • Arithmetic Reasoning
  • Commonsense Reasoning
  • Logical Reasoning
  • Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

32

33 of 143

LLMs with CoT are Non-Causal Reasoners

  • CoT and Answer do not fully align.

KR 2024: LLMs are Human-like Annotators

33

  • ICL, SFT, and RLHF significantly impact the causal relations.

34 of 143

LLMs with CoT are Non-Causal Reasoners

  • Structural Causal Model
  • Larger LLMs tend to possess optimal SCM types
  • Instruction (Z): task outline, a few examples, question.
  • CoT (X): step-by-step reasoning process
  • Answer (Y)
  • Ideally, Y is fully determined by X.

KR 2024: LLMs are Human-like Annotators

34

35 of 143

Automatic Reasoning Chain Evaluation

  •  

KR 2024: LLMs are Human-like Annotators

35

36 of 143

Automatic Reasoning Chain Evaluation

  • GSM8K and Multistep-Arithmetics
    • SocREval (Socratic method): ask GPT4 to generate a reference reasoning chain before evaluation; 1-shot exemplar.
    • Direct (trace): ask GPT-4 to directly evaluate a reasoning chain; 3-shot exemplar.
    • Direct (step): ask GPT-4 to check the reasoning step by step ; 3-shot exemplar.
    • CoT (step): ask GPT-4 to generate a reasoning process before evaluating each reasoning step ; 3-shot exemplar.

Common types of false positive reasoning chains detected by AutoRace

KR 2024: LLMs are Human-like Annotators

36

37 of 143

Dynamic Program Prompting and Program Distillation

  •  

KR 2024: LLMs are Human-like Annotators

37

38 of 143

Dynamic Program Prompting and Program Distillation

Example prediction and retrieved program samples

KR 2024: LLMs are Human-like Annotators

38

39 of 143

Generating Annotations for Reasoning Tasks using LLMs

  • Arithmetic Reasoning
  • Commonsense Reasoning
  • Logical Reasoning
  • Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

39

40 of 143

MoT: Memory-of-Thought

  • Finetuning is expensive.
  • Memory helps humans improve themselves in terms of decision-making, reasoning, judgment, etc.
  • MoT can improve LLMs via prethinking and recalling.
  • Pre-thinking
    • LLM pre-thinks on the unlabeled dataset
      • Few-Shot-CoT with multiple-path decoding strategy
      • Each demo has question, rationale and answer.
      • Get multiple <rationale, answer> pairs and choose majority-voted answer
    • Saves the high-confidence thoughts as external memory
      • Answer-entropy to filter out uncertain thoughts
      • Store <question, rationale, answer> as memory.
  • Recalling at test time
    • Given a test question, LLM recalls relevant memory to help itself reason and answer it.

KR 2024: LLMs are Human-like Annotators

40

41 of 143

MoT: Memory-of-Thought

  • Recalling
    • Cluster memory items. Get topK semantically relevant memory item candidates from each cluster using SBERT.
    • Let LLM choose best memory candidates from each cluster.
    • Few-Shot-CoT with these memory items as extra context.
  • Baselines
    • MoT (no rationale): removes rationales in the retrieved memory and thus lets the LLM directly output the answer
    • MoT (no thinking): keeps rationales in the retrieved memory but forces the LLM to directly answer the question without CoT at recall stage.

MoT exceeds Few-Shot-CoT and Zero-Shot-CoT

KR 2024: LLMs are Human-like Annotators

41

42 of 143

Selective annotation and Prompt retrieval

  • Unsupervised, graph-based selective annotation method, vote-k, to select diverse, representative examples to annotate.
    • Graph with each sample as node and SBERT sim to build edges.
    • Choose high degree nodes where with degree discounting for already chosen nodes.
    • Choose samples with model confidence scores from each of the 10 buckets.
  • GPT-J with 6B parameters.

KR 2024: LLMs are Human-like Annotators

42

ICL perf over varying annotation budgets for HellaSwag commonsense reasoning. LLM=GPT-J

100 annotated examples

Example from HellaSwag CommonSense Reasoning

43 of 143

Generating Annotations for Reasoning Tasks using LLMs

  • Arithmetic Reasoning
  • Commonsense Reasoning
  • Logical Reasoning
  • Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

43

44 of 143

Analogical prompting

  • CoT needs labeled exemplars of the reasoning process.
  • To solve problems, humans think about related problems or high-level knowledge.
  • Analogical prompting
    • Prompt LLMs to self-generate relevant exemplars in the context, before proceeding to solve the given problem.
    • Avoids need for labelled exemplars. Can tailor the generated exemplars and knowledge to each problem

KR 2024: LLMs are Human-like Annotators

44

45 of 143

Analogical prompting methods

  • Self-generated exemplars
    • # Problem: [x]
    • # Relevant problems: Recall three relevant and distinct problems. For each problem, describe it and explain the solution.
    • # Solve the initial problem:
  • Self-generated knowledge + exemplars
    • # Tutorial: Identify core concepts in the problem and provide a tutorial.
    • generating knowledge before exemplars yields superior results

Big Bench reasoning tasks with GPT3.5-Turbo

KR 2024: LLMs are Human-like Annotators

45

46 of 143

LeanReasoner: Offloading reasoning to Lean

  • Lean: a theorem proving symbolic solver framework
  • Offloading reasoning to Lean: Reduces the risk of logical inconsistencies
  • SOTA perf on FOLIO and ProofWriter.
  • Fine-tuning on <100 in-domain samples for each dataset.

KR 2024: LLMs are Human-like Annotators

46

47 of 143

LeanReasoner: Offloading reasoning to Lean

  • Logical reasoning contextual MCQ task: Input is context para, question, options.
  • Formalizer (LLM) converts the context and question to formalized context and formalized question.
  • Tactic Generator
    • ReProver model
      • Dense Passage Retriever to select premises when provided with the current goal
      • Google’s Byte-T5 generator that generates tactics using the goal and the retrieved premises.
  • Proof search
    • Build a proof tree that incrementally evolves the goal through tactic invocations.
    • Expand those goals where our generative model has the highest global confidence.
  • GPT-4 Base: few-shot examples include both formalization and proof generation in a single prompt.
  • GPT-4 Base Comments: augment these examples with line-by-line comments in Lean code.
  • GPT-4 Base Separate: separate prompts for formalization and proof generation.

KR 2024: LLMs are Human-like Annotators

47

48 of 143

LeanReasoner: Offloading reasoning to Lean

Sample proofs created by LeanReasoner without pretraining (left), finetuned on Intuitive data (middle), and finetuned on Concise data (right).

KR 2024: LLMs are Human-like Annotators

48

49 of 143

Event Relation Logical Prediction

  • Coreference ∈{Yes, No}
  • Temporal ∈ {NO_TEMPORAL, BEFORE, OVERLAP, CONTAINS, SIMULTANEOUS, ENDS-ON, BEGINS-ON}
  • Causal ∈ {NO_CAUSAL, PRECONDITION, CAUSE}
  • Subevent ∈ {Yes, No}.
  • ChatGPT incorrectly predicts the temporal and causal relations between events “FIRE” and “collapsed” as “simultaneous” and “cause”.
  • 60% of the answers from ChatGPT on the MAVEN-ERE (event relation extraction) dataset are logically inconsistent

KR 2024: LLMs are Human-like Annotators

49

50 of 143

Enabling LLMs for Event Relation Logical Prediction

  • 3 ways of logic acquisition
    • Generative-based approach
      • Encourages LLMs to generate rationale themselves,
      • Uncertainty of the generated rationales may also bring some biases, leading to an incorrect subsequent answer
    • Retrieval-based approach
      • Collects constraints from realistic data
      • Retrieves relevant contents and adds them to the LLM instruction.
      • Ensures correctness of logic
    • Finetuning-based approach
      • Constructs a high-order event relation logical prediction dataset (LLM-ERL)
      • Fine-tune specialized LLMs

KR 2024: LLMs are Human-like Annotators

50

51 of 143

Event Relation Logical Prediction Results

KR 2024: LLMs are Human-like Annotators

51

52 of 143

Generating Annotations for Reasoning Tasks using LLMs

  • Arithmetic Reasoning
  • Commonsense Reasoning
  • Logical Reasoning
  • Symbolic Reasoning

KR 2024: LLMs are Human-like Annotators

52

53 of 143

Symbolic reasoning for math word problems

  • CoT based explanations are neither verifiable nor concise.
  • Symbolic evaluation
    • SVAMP 🡪 SVAMP-Sym
    • GPT-3’s davinci-002 achieves good zero-shot accuracy on symbolic problems (64.2%), comparable to the numeric accuracy of 68.9%
  • Self-prompting
    • Prompt LLM with numeric problem and its response to the problem, and then ask it to solve the symbolic problem.

KR 2024: LLMs are Human-like Annotators

53

54 of 143

Symbolic reasoning for math word problems

  • “Raw”: eval on SVAMP and (SVAMP-Sym) dataset for numeric (symbolic) MWPs
  • (-F): output before the filtering step.
  • “SP” =self-prompting; “SP + AP” = two-stage self-prompting where an additional “Alignment Prompt” is added
  • Sim: between the full responses for numeric and symbolic problems.
  • CoT is better; SP improves sim; variables (i,j,k,l) have slightly worse accuracy; SP+AP is the best.

KR 2024: LLMs are Human-like Annotators

54

55 of 143

Symbolic Rule Learning for Robust Numerical Reasoning

  • Numerical reasoning for machine reading comprehension (RC) remains a difficult challenge.
  • ICL with LLMs to decompose complex questions into simpler subquestions that are easier to answer with single-span RC models.
  • Symbolic learning methods like ILASP to learn rules for recomposing partial answers.
  • Benefits
    • Data efficiency: no training or fine-tuning.
    • Neuro-symbolic approach 🡺 robust numerical reasoning
    • Interpretable and verifiable reasoning traces.

KR 2024: LLMs are Human-like Annotators

55

56 of 143

Symbolic Rule Learning for Robust Numerical Reasoning

  • Few-shot (3-shot) examples
  • 300 examples from the DROP training set, with annotations of CoT reasoning traces and question decompositions.
  • 2 approaches to select few-shot examples for each test question
    • KNN of the complex test question in the embedding space of the annotated questions using sentence embeddings
    • Define a canonical set of examples from the annotated questions for each given type, and transform the task into type-prediction using Alpaca 7B.
      • Types: addition, subtraction and negation

KR 2024: LLMs are Human-like Annotators

56

57 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]
  • Reasoning Datasets with LLM Annotations [25 mins]
  • Tea break [30 min]
  • Evaluation of LLM Generated Annotations [25 mins]
  • Autolabel Tools to Label Reasoning Datasets [25 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

57

58 of 143

KR 2024: LLMs are Human-like Annotators

58

What is reasoning?

  • Reasoning is the ability to make inferences using evidence and logic.
  • Reasoning can be divided into multiple types of skills such as Commonsense, Mathematical, and Symbolic reasoning etc.
  • Often, reasoning involves deductions from inference chains, called as multi-step reasoning.

Augmented Language Models: a Survey (Mialon et.al, 2023)

  • Do language models truly understand and apply common sense reasoning?

59 of 143

KR 2024: LLMs are Human-like Annotators

59

Reasoning Problems

Augmented Language Models: a Survey (Mialon et.al, 2023)

  • Hard Language Tasks: require multiple steps of reasoning to solve

60 of 143

KR 2024: LLMs are Human-like Annotators

60

Multi-step reasoning is often seen as a weakness in language models

Towards Reasoning in Large Language Models: A Survey (Huang et.al, 2023)

  • It is unclear to what extent LLMs are capable of reasoning

Former research on reasoning in small language models through fully

supervised finetuning on specific datasets

  • Creating a dataset containing explicit reasoning can be difficult and time-consuming
  • training on a specific dataset limits application to a specific domain

Reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters

61 of 143

KR 2024: LLMs are Human-like Annotators

61

Reasoning and Commonsense Benchmarks

Source: https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond#different-types-of-llm-benchmarks

  • How can we assess reasoning abilities in language models?”
  1. TruthfulQA — Truthfulness
  2. MMLU — Language understanding
  3. HellaSwag — Commonsense reasoning
  4. BIG-Bench Hard — Challenging reasoning tasks
  5. HumanEval — Coding challenges
  6. CodeXGLUE — Programming tasks
  7. Chatbot Arena — Human-ranked ELO-based benchmark
  8. MT Bench — Complex conversational ability

62 of 143

KR 2024: LLMs are Human-like Annotators

62

How is reasoning measured (in the literature)?

GPT-4 Technical Report (OpenAI).

  • While SOTA models at the time struggled to score above 50% with pre-training, GPT-4 achieved a record-high of 95.3% with just 10-shot prompting in 2023.

63 of 143

KR 2024: LLMs are Human-like Annotators

63

Chain of thought prompting and Self consistency

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

Prompt: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?

11 apples

  • Keep in mind that the authors claim that this is an emergent ability that arises with sufficiently large language models.

64 of 143

KR 2024: LLMs are Human-like Annotators

64

Chain of thought prompting: Arithmetic Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

  • Do not positively impact performance for small models
  • Few-shot CoT achieves better performance on LLM than zero-shot CoT.
  • Manually composed 8 exemplars
  • All contains equations with flexible formats
  • Benchmarked on:
    • GSM8K (Cobbe et al. 2021)
    • SVAMP (Patel et al., 2021)
    • MAWPS (Koncel-Kedziorski et al., 2016)

65 of 143

KR 2024: LLMs are Human-like Annotators

65

Chain of thought prompting and Self consistency

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

  • Chain-of-thought prompting improves performance on a range of arithmetic reasoning tasks

66 of 143

KR 2024: LLMs are Human-like Annotators

66

Chain of thought prompting: Symbolic Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

  • Standard prompting fails out-of-domain tests for both tasks
  • Zero-shot CoT using Instruct-GPT-3 175B achieves the similar performance as few-shot CoT in both tasks using 540B PaLM model

67 of 143

KR 2024: LLMs are Human-like Annotators

67

Chain of thought prompting: Commonsense Reasoning

Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)

  • CoT show minimal benefits on CSQA and StrategyQA tasks
  • Few-shot achieves better performance than Zero-shot CoT on 175B GPT-3 model for CSQA and Strategy QA tasks, but Zero-shot CoT shows significant improvement for Date understanding task.

68 of 143

KR 2024: LLMs are Human-like Annotators

68

More Advances: Self consistency

Self-consistency improves chain of thought reasoning in language models. (Wang et al., 2022)

  • Change greedy decode (single-path) to self-consistency (multi-path) in few-shot CoT

69 of 143

KR 2024: LLMs are Human-like Annotators

69

STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning

STaR: Self-Taught Reasoner. (Zelikman et al., 2022)

  • Self-Taught Reasoner (STaR), which iteratively improves a model’s ability to generate rationales to solve problems.
  • participants were 74% more likely to prefer the STaR-generated rationales over the human-generated rationales

70 of 143

KR 2024: LLMs are Human-like Annotators

70

Program-aided Language Models (PAL)

PAL: Program-aided Language Models (Gao et.al, 2023)

  • PAL enhance problem-solving by generating code to represent intermediate reasoning steps (contrast this with CoT prompting, which uses natural language to reason).

71 of 143

KR 2024: LLMs are Human-like Annotators

71

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

  • TORA models exhibit remarkable improvements over previous state-of-the-art approaches across all scales

72 of 143

KR 2024: LLMs are Human-like Annotators

72

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

  • Utilize GPT-4 to synthesize high-quality trajectories on arithmetic reasoning datasets.

73 of 143

KR 2024: LLMs are Human-like Annotators

73

Tool-Integrated Reasoning (TORA)

TORA: A TOOL-INTEGRATED REASONING AGENT (Zhibin et.al, 2024)

  • TORA provides valuable insights for future research, contributing to the development of more advanced and versatile reasoning agents.

74 of 143

KR 2024: LLMs are Human-like Annotators

74

Plan-and-Solve Prompting

PAL: Program-aided Language Models (Gao et.al, 2023)

  1. Zero-shot-CoT prompting
  2. Plan-and-Solve (PS) prompting
  3. answer extraction prompting
  • Zero-shot PS + Prompting consistently outperforms Zero-shot-CoT across all reasoning datasets by a large margin

75 of 143

KR 2024: LLMs are Human-like Annotators

75

Can we use LLMs to benchmark reasoning datasets?

  • With models like GPT-4, it's now possible to synthetically produce datasets that are more comprehensive and diverse than human-labeled ones, in far less time

76 of 143

KR 2024: LLMs are Human-like Annotators

76

Reasoning datasets: CriticBench

  • CRITICBENCH is designed to assess the two key aspects of LLMs’ critical reasoning:
    • critique
    • correction

77 of 143

KR 2024: LLMs are Human-like Annotators

77

Reasoning datasets: Question collection on CriticBench

Response collection from LLMs:

  • Employ various LLMs, including GPT-3.5, GPT4, LLaMa2 (7B, 13B, and 70B variants), and vicuna (7B, 13B, and 33B variants)
  • Each LLM generate response for each question and provide valid reasoning

Response annotation:

  • Response correctness:
    • rule-based matching, followed by a more detailed evaluation using GPT-4 and Humans

Domains:

  • 15 datasets spanning 5 domains: mathematical, commonsense, symbolic, coding, and algorithmic.

Question collection:

  • Randomly selecting quantity of data from existing datasets

78 of 143

KR 2024: LLMs are Human-like Annotators

78

Reasoning datasets: Evaluation process on CriticBench

79 of 143

KR 2024: LLMs are Human-like Annotators

79

Reasoning datasets: Annotation example of CriticBench

  • GPT-4 evaluation is closer to Human evaluation

80 of 143

KR 2024: LLMs are Human-like Annotators

80

Reasoning datasets: Key Factors in Critical Reasoning

  • LLMs struggle more with incorrect answers in detail-oriented tasks like algorithmic tasks compared to logic centric tasks like code generation

81 of 143

KR 2024: LLMs are Human-like Annotators

81

Reasoning datasets: Average performance on CriticBench

  • The knowledge acquired by LLMs is not entirely consistent across generation, critique, and correction tasks.

82 of 143

KR 2024: LLMs are Human-like Annotators

82

Reasoning datasets: Consistency of GQC Knowledge

  • Analysis across different task types found that models perform better in Q and C for tasks focused on logic compared to those requiring attention to detail.

83 of 143

KR 2024: LLMs are Human-like Annotators

83

Human preference benchmarks with reasoning tasks

  • How to construct a benchmark which consists of real-world multi-turn conversations?
  • How to automatically evaluate the model’s responses?

Why human preference benchmarks?

  • Existing NLP benchmarks fail to accurately evaluate LLM chat assistants
  • Knowledge benchmark (e.g., MMLU)
    • Not open-ended
  • Instruction benchmark (e.g., Super-Natural Instructions)
    • Only one turn
    • Human annotation efforts
  • Conversational benchmark (e.g., CoQA)
    • Too easy for state-of-the-art LLMs

84 of 143

KR 2024: LLMs are Human-like Annotators

84

Human preference benchmarks: LLMs as judges

  • LLM judges provide not only scores but also explanations, offering interpretable outputs and valuable insights.

Four popular benchmarks

  • MT-Bench
  • Chatbot Arena
  • MT-Bench-101
  • HumanEval

85 of 143

KR 2024: LLMs are Human-like Annotators

85

LLM-as-a-judge

  • The agreement between GPT-4 judges and humans reaches over 80%, the same level agreement among humans.

86 of 143

KR 2024: LLMs are Human-like Annotators

86

Limitations of LLM-as-a-judge

Position bias

  • LLM favor the answers in the first position

Verbosity bias

  • LLM favor long answers

Self-appreciation bias

  • LLM favor its own answers of answers similar to its own answer

Limited reasoning ability

  • LLMs fail to judge hard math/reasoning/code questions
  • It's crucial to acknowledge the limitations of LLM-as-a-judge, such as its inability to detect hallucinations and penalize LLM generated answers accordingly, and potential errors when grading math/reasoning questions.

87 of 143

KR 2024: LLMs are Human-like Annotators

87

Human preference benchmarks: MT-Bench-101

  • The agreement between GPT-4 and human expert evaluations reached 87%, utilizing our designed evaluation approach.

88 of 143

KR 2024: LLMs are Human-like Annotators

88

MT-Bench-101: Hierarchical Ability Taxonomy

  • Utilized GPT-4 to construct data and it is the most powerful model for multi-turn dialogues.

13 tasks

3-level abilities

89 of 143

KR 2024: LLMs are Human-like Annotators

89

MT-Bench-101: Model’s performance

  • Closed-source models consistently exhibit superior performance compared to open-source counterparts across all evaluated tasks.
  • Content confusion and format rephrasing are relatively less difficult, while the mathematical reasoning task is the most challenging

90 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]
  • Reasoning Datasets with LLM Annotations [25 mins]
  • Tea break [30 min]
  • Evaluation of LLM Generated Annotations [25 mins]
  • Autolabel Tools to Label Reasoning Datasets [25 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

90

91 of 143

KR 2024: LLMs are Human-like Annotators

91

Why Focus on Evaluation

Source: https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/

92 of 143

KR 2024: LLMs are Human-like Annotators

92

LLM Evaluation vs. Human Evaluation

93 of 143

KR 2024: LLMs are Human-like Annotators

93

How to scale “human evaluation”?

  • Chatbot arena allows users to vote for the superior response manually, which leverage multiple LLMs to autonomously determine which response stands out.

94 of 143

KR 2024: LLMs are Human-like Annotators

94

LLM Evaluation

Model

LLM Benchmark

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Shots

Instruction

Task

Dataset

Metric

Aggregation

 

System Prompt

System Format

Hyperparams

95 of 143

KR 2024: LLMs are Human-like Annotators

95

LLM Evaluation

Task:

  • What is that concrete problem that we want to address (e.g., classification, summarization, commonsense reasoning..)

Dataset:

  • What dataset we want to use?

Metric:

  • How we evaluate the performance?

Instruction:

  • Evaluate model ability to perform an unseen task given context in the form of instructions

Shot:

  • The model is given with or without any prior examples at inference time

96 of 143

KR 2024: LLMs are Human-like Annotators

96

LLM Evaluation

Model

System Prompt

System Format

Hyperparams

<SYS> you are helpful Model </SYS>

<instruction> Translate this sentence to French

<user> I like pizza

<assistant> J'aime la pizza

97 of 143

KR 2024: LLMs are Human-like Annotators

97

LLM Evaluation: Alpaca

  • GPT-4 based evaluation of chatbot output agrees well with human evaluation; known before
  • Surprisingly, judge Llama 2 (70B) can compete with judge GPT-4 (1760B). But, Judge Llama 2 is more biased (more ties) and less capable of following the instructions carefully.

98 of 143

KR 2024: LLMs are Human-like Annotators

98

LLM Evaluation: G-Eval

  • LLM consistently gives higher scores to GPT-3.5 summaries, even though human judges prefer summaries written in human language

G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment (Yang et.al, 2023)

  • Can we rely on LLMs to assess LLM outputs
  • Do LLMs show a bias towards the outputs they generate during evaluations?

99 of 143

KR 2024: LLMs are Human-like Annotators

99

LLM Evaluation: GPT-Score

https://github.com/confident-ai/deepeval

100 of 143

KR 2024: LLMs are Human-like Annotators

100

Language model-written evaluations

https://github.com/confident-ai/deepeval

  • Generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size.
  • Overall, LM-written evaluations are high quality and quickly discover many novel LM behaviors.

101 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]
  • Reasoning Datasets with LLM Annotations [25 mins]
  • Tea break [30 min]
  • Evaluation of LLM Generated Annotations [25 mins]
  • Autolabel Tools to Label Reasoning Datasets [25 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

101

102 of 143

Generate a synthetic dataset using LLMs

102

AutoLabel:

  • Labeling Agent provides a method to generate synthetic datasets
  • Supports synthetic dataset generation for classification and entity matching tasks

Prodigy:

  • LLMs, which allow you to provide a prompt in order to annotate an NLP task.

Labelbox:

  • Easily curate and annotate data
  • Generate high-quality human feedback data for computer vision and language models, evaluate and improve model performance

LLM-data-annotation:

  • Combines human expertise with the efficiency of Large Language Models (LLMs) like OpenAI's GPT-3.5 to simplify dataset annotation and model improvement.

103 of 143

AutoLabel

103

  • How to use AutoLabel tool to annotate a task?

104 of 143

AutoLabel: Question Answering

104

  • First step: specify a labeling configuration

105 of 143

AutoLabel: Question Answering

105

  • Second step: do a dry-run on test dataset using the LLM specified in config.json by running agent.plan

106 of 143

AutoLabel: Question Answering

106

  • Final step: run the labeling with agent.run

107 of 143

LLMs can label data as well as humans, but 100x faster

107

  • Refuel LLM (84.2%) outperforms trained human annotators (80.4%), GPT-3-5-turbo (81.3%), PaLM-2 (82.3%) and Claude (79.3%) across a benchmark of 15 text labeling datasets.

108 of 143

LLMs can label data: Quality Evaluation

108

  • Refuel LLM outperforms human annotators and all closed source LLMs, excluding GPT-4

109 of 143

LLMs can label data: Quality Evaluation

109

  • RefuelLLM-2 (83.82%) outperforms all current state-of-the-art LLMs for data labeling and enrichment, including GPT-4-Turbo (80.88%), Claude-3-Opus (79.19%) and Gemini-1.5-Pro (74.59%)

110 of 143

LLMs can label data: Quality Evaluation

110

  • Refuel-LLM-2 is competitive or superior in terms of output quality, compared to current state-of-the-art LLMs

111 of 143

Prodigy

111

https://demo.prodi.gy/?=null&view_id=ner_manual

What Prodigy isn’t:

  • software as a service” – it’s a tool that you can download, install and run yourself

Usage:

  • Annotate NLP tasks, Audio and vision tasks

112 of 143

AutoLabel tools: Which one is better

112

  • Autolabel currently supports Chain-of-thought prompting as well suitable for reasoning tasks.
  • Labeling tasks spanning categories such as classification, entity resolution, matching, reading comprehension, reasoning and information extraction

113 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]

  • Reasoning Datasets with LLM Annotations [25 mins]
  • Evaluation of LLM Generated Annotations [25 mins]

  • Autolabel Tools to Label Reasoning Datasets [25 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]

KR 2024: LLMs are Human-like Annotators

113

114 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]
  • Reasoning Datasets with LLM Annotations [25 mins]
  • Tea break [30 min]
  • Evaluation of LLM Generated Annotations [25 mins]
  • Autolabel Tools to Label Reasoning Datasets [25 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]
    • Taxonomy of hallucinations
    • Hallucination detection
    • Methods to mitigate hallucination

KR 2024: LLMs are Human-like Annotators

114

115 of 143

Hallucination

115

  • How did hallucination evolve in LLMs?
  • In the context of language models: the generation of text or responses that seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.

116 of 143

Evolution of Hallucination in LLMs

116

  • LLMs simply generate text without knowing the specific details of the visit's purpose.

117 of 143

Evolution of Hallucination in LLMs

117

  • Despite advanced capabilities of MLLM, like all LLMs, are prone to hallucination

118 of 143

Evolution of Hallucination in LLMs

118

  • The problem of hallucination also exists in other foundation models such as image, video, and audio as well

119 of 143

Taxonomy of Hallucinations

119

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Like their text-only counterparts in LLMs, Video- and Audio-based language models are also prone to hallucinations.

120 of 143

Taxonomy of Hallucinations: Causes

120

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

121 of 143

Taxonomy of Hallucinations: Metrics and Benchmarks

121

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

122 of 143

Taxonomy of Hallucinations: Mitigation

122

Hallucination of Multimodal Large Language Models: A Survey (Bai et.al, 2024)

123 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]

  • Reasoning Datasets with LLM Annotations [25 mins]
  • Evaluation of LLM Generated Annotations [25 mins]

  • Autolabel Tools to Label Reasoning Datasets [25 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]
    • Taxonomy of hallucinations
    • Hallucination detection
    • Methods to mitigate hallucination

KR 2024: LLMs are Human-like Annotators

123

124 of 143

Hallucination Types

124

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Hallucination can occur in any NLG task, necessitating a thorough examination based on the fundamental principles of text generation from a given prompt.

125 of 143

Hallucination Types: Orientation, Category and Degree

125

126 of 143

Hallucination Types: Orientation

126

Factual Mirage:

  • an LLM engages in hallucination or distortion of a given prompt that is factually correct.

Intrinsic

Extrensic

  • The generated text that appears to be true or real but is actually misleading or false.

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

127 of 143

Hallucination Types: Orientation

127

Silver Lining:

  • an LLM indulges in hallucination by conjuring an elaborate and captivating narrative based on a given prompt that is factually incorrect.

Intrinsic

Extrensic

  • LLM produces unfaithful content or is inconsistent with the provided source prompt.

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

128 of 143

Hallucination Types: Category

128

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • an LLMs generating text that exhibits a mashed fusion of events from different timelines.

129 of 143

Hallucination Types: Category

129

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • LLMs generate an incorrect location with an event.

130 of 143

Hallucination Types: Category

130

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • an LLM generates quotations attributed to either fictional or real characters without sufficient evidence to verify the authenticity of such statements.

131 of 143

Hallucination Types: Category

131

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • an LLM fabricates an imaginary personality in relation to a past event, without concrete evidence.

132 of 143

Hallucination Types: Category

132

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • an LLM generates numeric values related to past events, such as dates, ages, or monetary amounts, that are inconsistent with the actual facts.

133 of 143

Hallucination Types: Category

133

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • LLMs generate an imprecise expansion for an acronym.

134 of 143

Hallucination Types: Degree

134

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Alarming indicates added information pieces that bear a radical dissemblance from the topic fed via the prompt.
  • Moderate indicates a level of hallucination that introduces facts that are either fictitious or tangential to the topic at hand.
  • Mild indicates minor hallucination which is superficial in terms of its impact.
  • How can we detect and evaluate LLM’s hallucinations across different types?

135 of 143

Hallucination Detection: SelfCheckGPT

135

Gpteval: Nlg evaluation using gpt-4 with better human alignment (Yang et.al, 2023)

  • SelfCheckGPT can effectively detect hallucinated sentences with higher accuracy compared to several baseline methods
  • SelfCheckGPT's prompting method achieved the highest performance in detecting non-factual sentences

136 of 143

Hallucination Detection: FACTScore

136

"Factscore: Fine-grained atomic evaluation of factual precision in long form text generation (Sewon et.al, 2023)

  • GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.

137 of 143

Hallucination eLiciTation dataset

137

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Selected 15 LLMs, and used them to generate a total of 75,000 text passages, with each LLM producing 5,000 text prose entries.
  • The text prompts provided to these LLMs consisted of tweets from NYTimes and headlines sourced from the Politifact dataset

138 of 143

Hallucination Vulnerability Index (HVI)

138

A Survey of Hallucination in “Large” Foundation Models (Rawte et.al, 2023)

  • Larger LLMs without RLHF are prone to both orientations of hallucination
  • For smaller LLMs like T5, Dolly, etc., Generated Golem, Virtual Voice, and Geographic Erratum categories of hallucination are rarely observed.

139 of 143

Agenda

  • Introductions to LLMs and Recap Their Capabilities [25 mins]
  • Generating Annotations for Reasoning Tasks using LLMs [25 mins]
  • Reasoning Datasets with LLM Annotations [25 mins]
  • Tea break [30 min]
  • Evaluation of LLM Generated Annotations [25 mins]
  • Autolabel Tools to Label Reasoning Datasets [25 mins]
  • Overcoming the Hallucinations in LLM Annotations and Future Trends [25 mins]
    • Taxonomy of hallucinations
    • Hallucination detection
    • Methods to mitigate hallucination

KR 2024: LLMs are Human-like Annotators

139

140 of 143

Hallucination Mitigation

140

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)

  • Leveraging external knowledge to validate the correctness: RAG
  • Modifying the decoding strategy
  • Sampling multiple outputs and checking their consistency: SelfCheckGPT
  • Proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average

141 of 143

Hallucination Mitigation: Chain-Of-Verification (CoVe)

141

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation (Neeraj et.al, 2023)

  • CoVe reduces inaccuracies in LLMs' responses by verifying facts through structured questioning.
  • Significant reduction in factual hallucinations.
  • Final responses are more reliable and factually accurate.

142 of 143

Is hallucination always bad?

142

https://www.washingtonpost.com/opinions/2023/12/27/artificial-intelligence-hallucinations/

  • Numerals: Models have been shown to hallucinate a lot while generating numerals, such as dates, quantities, and scalars.
  • Long Text: Models often tend to self-contradict while generating the output.
  • Reasoning: Misunderstanding facts/information present in the source text can lead to hallucinations and errors.
  • When Contextual Knowledge Conflicts with the Parametric Knowledge: Models have been shown to prioritize the parametric knowledge (acquired during pre-training) over the contextual knowledge which leads to hallucinations.

143 of 143

A big thank you!

  • Tutorial, Code and Material

KR 2024: LLMs are Human-like Annotators

143