1 of 148

Labelling with LLM and Human-in-the-Loop

Ekaterina Artemova, Akim Tsvigun, Dominik Schlechtweg

Natalia Fedorova, Sergei Tilga, Konstantin Chernyshev, Boris Obmoroshev

1

TALK TO US

TUTORIAL�PAGE

2 of 148

About us

2

Natalia Fedorova

Toloka Partnership Manager, Toloka

Boris Obmoroshev

Toloka AI R&D and Analytics Director, Toloka

Sergei Tilga

Head of R&D, Toloka

Ekaterina Artemova

Machine Learning Researcher, Toloka

Konstantin Chernyshev

Machine Learning Researcher, Toloka

Akim Tsvigun

Natural Language Processing Lead @ Nebius AI, University of Amsterdam

Dominik Schlechtweg

Research Group Lead, Universität Stuttgart

3 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1 �

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

3

4 of 148

Introduction

4

5 of 148

Data is new oil

Clive Humby

5

6 of 148

Data needs in NLP

Raw data

  • Is needed for unsupervised learning tasks, such as LLM pre-training
  • Teaches LLMs to generate plausible language
  • Is usually collected by scraping web sources

This is not the focus of the this tutorial!

Labelled data

  • Is needed for tasks like sentiment analysis, named entity recognition, and syntactic parsing
  • Is needed for specific applications (conversational AI, machine translation) and specific domains (medicine, law, finance, etc)
  • Is manually annotated by humans

6

7 of 148

Many applications require complex annotation

7

8 of 148

Data annotation pipeline

8

01

02

03

04

Frame the target task

Problem formulation: �Does this text pose a risk of a harm?

Conceptualize the problem

Annotation schema defines labels, how they should be applied and how complex cases should �be treated.

Quality control

Choose an aggregation rule, control inter-annotator agreement, train a baseline model.

Instruct annotators

Annotation guidelines are provided to the annotators to label raw data and to guide annotation decisions

9 of 148

Labelling with human: Example

  • Problem formulation: detect emotions in tweets
  • Annotation schema: six emotions (love, anger, fear, sadness, surprise, joy)
  • Instruction:

Please read each text carefully and classify it based on the dominant emotion it expresses: love, anger, fear, sadness, surprise, or jou. If the tweet does not clearly convey one of these emotions, or if it is purely factual or neutral, mark it as neutral.

  • Quality control: overlap of 5 annotations, majority voting, and control tasks

9

Saravia, Elvis, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. "CARER: Contextualized affect representations for emotion recognition." EMNLP. 2018.

i was ready to meet mom in the airport and feel her ever supportive arms around me

love

anger

fear

sadness

surprise

joy

neutral

10 of 148

Labelling with human: Budget estimation

  • Let X be the cost of annotating a single text.
  • The target dataset includes 1,000 texts plus an additional 10% for control tasks.
  • The total budget is calculated as (1,000+100)⋅5⋅X=5,500⋅X(1,000+100)⋅5⋅X=5,500⋅X.
  • X is determined based on the hourly payment rate and the average number of texts that can be annotated per hour.
  • The time spent on labelling depends on the complexity of the task and the size of the target dataset.

10

Saravia, Elvis, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. "CARER: Contextualized affect representations for emotion recognition." EMNLP. 2018.

i was ready to meet mom in the airport and feel her ever supportive arms around me

love

anger

fear

sadness

surprise

joy

neutral

11 of 148

Labelling with an LLM

  • We can leverage the ability of LLMs for zero-shot classification by prompting them to predict labels for unlabeled text.
  • The cost of an LLM run depends on the type of access (API or self-hosting).
  • The time spent on labeling depends on the inference speed.
  • Depending on the computational setup, the cost of labeling 1,100 tweets can be below $5.

11

Saravia, Elvis, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. "CARER: Contextualized affect representations for emotion recognition." EMNLP. 2018.

Given the following text, classify the emotion expressed in the message into one of the following categories: anger, fear, joy, love, sadness, or surprise. If the text does not express a clear emotion, classify it as neutral.

Text: i was ready to meet mom in the airport and feel her ever supportive arms around me

Emotion: <Select the appropriate emotion category>

love

12 of 148

LLMs do the work, humans fix the oops

Labeling with LLMs

  • Reduces cost and time spent on annotations
  • Can make mistakes and struggle with complex or subjective tasks
  • Unclear when to trust LLM predictions

Solutions

  • Synthetic Data: Generate synthetic data instead of labeling raw data
  • Active Learning: Select the most informative instances for human labeling
  • Hybrid Approaches: LLMs handle simple instances, humans handle complex ones

12

The slide title is generated with ChatGPT

13 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1 �

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

13

14 of 148

Synthetic data generation

14

15 of 148

Synthetic data generation

15

A Powerful and Thought-Provoking Masterpiece

Oppenheimer is a stunning portrayal of J. Robert Oppenheimer's life and the creation of the atomic bomb. Cillian Murphy delivers a brilliant performance, capturing the complexity of the man behind history's most pivotal invention. Christopher Nolan's direction, paired with breathtaking visuals and a haunting score, makes every moment gripping.

The film masterfully balances science, ethics, and personal struggles, leaving you reflecting long after the credits roll. With a stellar cast and unforgettable storytelling, Oppenheimer is a must-watch for anyone who loves powerful, impactful cinema.

Suppose you are writing movies review for the IMDB platform. Write a positive review on the film Oppenheimer.

16 of 148

Synthetic data generation

The main idea is to generate synthetic data in a way that it mimics characteristics and feature of real word data.

16

Benefits

  • Lower costs for data management and analysis
  • Faster data collection turnaround time
  • Fewer security issues

Limitations

  • Difficulties in generating accurate and diverse data
  • Need for validation procedures
  • The quality of trained classifiers may suffer

17 of 148

Requirements for synthetic data

  • Accurate labels: �Labels must be correct and consistent to ensure reliable model training
  • Matching domain: �Data should mimic the target domain in language, style, and content
  • Diversity: Variation is essential to prevent overfitting and enhance model robustness

17

18 of 148

LLMs for synthetic data generation

Zero-shot Generation

  • Prompt an LLM with a task description and a label
  • No examples provided; LLM relies on pre-trained knowledge

18

Generate a sentence that conveys a {label} sentiment.

19 of 148

LLMs for synthetic data generation

Few-shot Generation

  • Prompt an LLM with a task description, a label, and several data instances
  • Directs LLM behavior using sample instances

19

You are given a data entry consisting of a sentence and its sentiment label. Sentence: {sentence}, Label: {label}. Generate a similar data entry and output both a sentence and a label.

20 of 148

LLMs for synthetic data generation

Hierarchical Generation

  • Prompt an LLM to generate key points
  • Create detailed prompts incorporating these key points
  • Use crafted prompts to generate specific content
  • All steps can be zero- or few-shot

20

Given the topic {topic}, generate a sentence that conveys a {label} sentiment.

Generate a topic that can be used to create sentences with distinct {label} sentiment. The topic should be broad enough.

21 of 148

Validation techniques

  • Train-real-test-synthetic: �Train a classifier on human-labeled data and evaluate its accuracy on generated dataset
  • Train-synthetic-test-real: �Train a classifier on synthetic data and evaluate its accuracy using a human-labeled test dataset
  • Statistical checks: �Examine whether the label distribution, diversity, and stylistic features in the generated dataset match those of the real dataset

21

22 of 148

Case study 1: How does the task subjectivity affect the performance?

Key Findings:

• Subjectivity negatively affects model performance on synthetic data

• Performance gap increases for highly subjective tasks

• Models perform better on less subjective instances within tasks

22

23 of 148

Case study 1: How does the task subjectivity affect the performance?

Methodology:

• Used GPT-3.5-Turbo for synthetic data generation

• Explored zero-shot and few-shot generation settings

• Evaluated across 10 text classification tasks

• Conducted crowdsourced studies to determine task-level subjectivity

23

24 of 148

Case study 2: How does increasing diversity of synthetic data affect performance?

Key findings:

  • Diversification approaches can increase data diversity but often at the cost of data accuracy
  • Balancing diversity and accuracy in synthetic data generation is crucial for creating high-quality datasets
  • Human-LLM loop can facilitate high diversity and accuracy in LLM-based text data generation

24

25 of 148

Case study 2: How does increasing diversity of synthetic data affect performance?

Diversification approaches:

  • Logit suppression: Decreases probability of sampling frequently generated tokens.
  • Temperature sampling: Flattens token sampling probability distribution.

Human interventions:

  • Label replacement (LR): Correcting misaligned labels improved model accuracy by ~15% when used with diversification techniques.
  • Out-of-scope filtering (OOSF): Removing irrelevant instances showed limited utility in improving accuracy.

25

26 of 148

Case study 3: To which extent can synthetic data replace human annotation?

Key findings:

  • Using synthetic data only leads to worse performance than the same amount of real data.
  • A small amount of human-labeled data significantly outperforms large quantities of synthetic data in terms of performance gain and cost-effectiveness.
  • Labeling hundreds of data points requires lower budgets than generating a magnitude larger amount of data.

26

Ashok, Dhananjay, and Jonathan May. "A Little Human Data Goes A Long Way." arXiv preprint arXiv:2410.13098 (2024).

27 of 148

Case study 3: To which extent can synthetic data replace human annotation?

Methodology

  • Gradually increase the amount of synthetic data in fixed-size training sets and test model performance.
  • Use performance prediction methods to estimate the volume of synthetic data that provides a comparable gain to 200 human-labeled data points.
    • Fit a regression model: performance ~ dataset size.
  • At some volume, the costs of compute usage become comparable to the cost of human labeling

27

Ashok, Dhananjay, and Jonathan May. "A Little Human Data Goes A Long Way." arXiv preprint arXiv:2410.13098 (2024).

28 of 148

Understanding bias in synthetic data

Synthetic data can perpetuate, amplify, and introduce biases due to several factors

  • Label-conditioned generations can introduce bias and reduce the difficulty of the synthetic data entry
  • Generation models can replicate existing biases from the pre-training data.
  • Synthetic data may reproduce biases found in human data used for fine-tuning or in few-shot setups

28

Chim, Jenny, Julia Ive, and Maria Liakata. "Evaluating Synthetic Data Generation from User Generated Text." Computational Linguistics (2024): 1-44.

29 of 148

Understanding bias in synthetic data

Synthetic data can perpetuate, amplify, and introduce biases due to several factors:

  • Uneven quality in synthetic subsets, such as poorly generated code-switched or vernacular texts, can impact linguistic minorities.
  • Neglecting underrepresented groups in data generation can lead to biased outputs that mainly reflect majority perspectives.

29

Chim, Jenny, Julia Ive, and Maria Liakata. "Evaluating Synthetic Data Generation from User Generated Text." Computational Linguistics (2024): 1-44.

30 of 148

Conclusion: Synthetic data for text classification

Effectiveness

  • Useful in objective tasks (e.g., news topic classification)
  • Limitations in highly subjective tasks (e.g., humor detection)

Diversity Impact

  • Increased diversity generally improves classifier performance
  • Diversity-label accuracy trade-off exists

Best Practices

  • Use hierarchical approach to LLM prompting
  • Think about sources of bias in synthetic data
  • Combine with human labelling for optimal results

30

31 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1 �

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

31

32 of 148

Active learning with LMs

32

33 of 148

Active learning with LMs

  • Introduction to active learning (AL)
  • Active learning strategies in text classification
  • Generative active learning
  • Human-based vs LLM-based annotation in active learning
  • Techniques for efficient active learning

33

34 of 148

Introduction to active learning

  • AL reduces the required number of annotated inputs to reach the desired quality of an LM
  • AL methods select for annotation instances that are the most informative or uncertain for the model
  • In some NLP tasks, training a model on ⅙ of the dataset, selected with AL, reaches 99% of quality of the model, trained on the entire dataset

34

Settles, B.: “Active learning: Synthesis lectures on artificial intelligence and machine learning”. Long Island, NY: Morgan & Clay Pool, 2012.

35 of 148

How to select texts for annotation?

  • How to measure the amount of “new information” in the instance?
  • Imagine a classification task with two labels: ‘positive’ and ‘negative’
  • We have two labeled examples:
    • “I like this place” has a ‘positive’ label
    • “I can say I’m disappointed” has a ‘negative’ label
  • Which text is more important to annotate: “I can’t say I’m disappointed” or "It was really exciting!”?
  • First text cosine similarities with already annotated texts: 0.71, 0.84*
  • Second text cosine similarities with already annotated texts: 0.71, 0.62*
  • The first text is more “similar” to already annotated texts, but its label is more vague

35

* Using BGE-ICL model - one of the SOTA models in text embeddings

36 of 148

AL strategies in text classification

  • AL strategies differ in how they calculate the “informativeness” of the input text
  • Key Considerations when choosing an AL strategy for your case:
    • Model uncertainty
    • Diversity of selected examples
    • Computational efficiency
    • Size of the batch for annotation
    • Balance between exploration and exploitation

36

37 of 148

AL strategies in text classification

  • Least Confidence (LC,): measure the probability of the most probable label, predicted by the model.

  • Breaking Ties (BT, ): measure the difference of the probability of the first two most probable labels.

  • Contrastive Active Learning (CAL, ): measure the divergence of predictive likelihoods of input texts from their neighbors in the training set.

  • Batch Active Learning by Diverse Gradient Embeddings (BADGE, ): measure the informativeness as the gradient magnitude with respect to model parameters in the output layer.

37

↑: higher values stands for higher priority, ↓: lower value stand for higher priority

38 of 148

Generative active learning

  • Traditional AL is limited to selecting from existing unlabeled data
  • Sometimes, the existing data is scarce and doesn’t cover many potential inputs
  • Generative AL leverages the power of modern LLMs to create synthetic examples
  • It involves the three-stage process:
    1. Generate synthetic examples using LLMs (e.g., GPT-4, Deepseek-v3)
    2. Apply AL strategies to select the most informative synthetic examples
    3. Present selected examples for human annotation

38

39 of 148

Generative active learning

39

Advantages

  • Explores input space beyond the original dataset
  • Addresses rare or missing cases
  • Targets specific areas where the model needs improvement
  • Is particularly valuable when real-world examples are scarce or expensive to obtain

Challenges

  • Relies heavily on the generative LLM capabilities
  • Produces unrealistic or poor examples with weaker LLMs
  • Struggles with specialized domains
  • May generated examples that don’t not reflect real-world distribution
  • Requires extra computational costs for generation step

40 of 148

Case study: How much costs can active learning save?

Key Findings:

  • Active learning can save up to 90% of annotation costs compared to random sampling

  • State-of-the-art AL strategies like BT and CAL consistently outperform random sampling

40

41 of 148

Case study: How much costs can active learning save?

Methodology:

  • Evaluated across 5 text classification tasks
  • Active Learning Loop:
    • Query 25 instances per iteration
    • Oracle-provided labels
    • Retrain model with accumulated labeled data on each iteration
  • Used BERT and DistilRoBERTa models

41

42 of 148

Conclusion: Active learning

Effectiveness

  • Drastically reduced annotation budgets

Generalization

  • Data acquired through AL does not depend much on LM used
  • AL can be applied to various NLP tasks including text- and token- level classification and natural language generation

Best Practices

  • Choose appropriate AL strategy and avoid biased data sampling
  • Implement iterative cycles: train -> sample -> annotate -> train -> sample -> annotate …
  • Think about cold start problem: which samples to use at the beginning

42

43 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1 �

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

43

44 of 148

Quality control in human labeling

44

45 of 148

What is quality control in human annotation?

  • What is quality?
  • How to measure quality?
  • How to control quality?

45

46 of 148

What is quality control in human annotation?

  • What is quality? �When a task does not have one correct answer, it becomes difficult to define quality.
  • How to measure quality? �Quality is determined based on the specific task and its requirements.
  • How to control quality? �We will discuss this topic throughout today's presentation.

46

47 of 148

Management triangle

47

QUALITY

COST

TIME

GOOD �+ �QUICK�= EXPENSIVE

GOOD �+ �CHEAP�= �SLOW

QUICK�+ �CHEAP�= �POOR QUALITY

48 of 148

Methods of Quality control

Before task performance

  • Selection of annotators
  • Well-designed instruction
  • Onboarding and exams
  • Choosing team

Within task performance

  • Technical improvements of a platform
  • Control tasks
  • Motivation �(e.g. performance-based pricing)
  • Tricks to remove bots and cheaters �(e.g. quick answers)

After task performance

  • Data acceptance and working with data
  • Feedback loops

48

49 of 148

Before task performance

49

50 of 148

Selection of annotators

  • Sourcing & selection of supply of annotators�
  • Choose the right criteria (e.g. education, languages,country, etc.)
    • To select proper specialization
    • To control quality of execution level on your tasks
    • To get annotators with best quality on past projects

50

51 of 148

Onboarding and exams

  • An onboarding task is similar to production task that includes comments to guide annotators through the understanding of the task.
  • An exam is a test that annotators �need to pass before starting production tasks.
  • Exam tasks evaluate education level �of annotators.

51

52 of 148

Team roles

To achieve the best quality in labeling, it is better to create a cross-functional team.�

Team-leads/Subject experts:

  • Assist in formulating clear instructions for annotators, providing feedback on the guidelines, clarifying challenging aspects, conducting initial trial tasks, and creating illustrative examples.
  • They understand the specifics of the task better than others.
  • Oversee the quality manual and provide feedback to annotators.
  • Participates in hiring of new annotators.

52

53 of 148

Team roles

Solution engineer is a technical lead who is in charge of setting a pipeline, implementing automatization tools and using quality control instruments.

Supply manager is in charge of finding relevant annotators and communicating with them. Also supply manager is responsible for time and cost efficient labeling.

Annotators/Subject matter expert is in charge of doing the task, providing feedback and following instructions.

53

54 of 148

Pipeline setup

  • Every complacent task can be splitted into small parts with additional checks, post verifications.
  • One of the instrument of controlling quality is designing and simplifying the pipeline.
  • Design of pipeline can help optimizing speed and cost.
  • Implementing LLM checks on different steps of pipeline.

54

55 of 148

Within task performance

55

56 of 148

  1. Technological methods

56

57 of 148

ML methods for human labeling

  • Smart matching technics between tasks (domain, complexity level, time, format..etc) and annotators (skillset, performance, preferences..etc).
  • Smart matching can be set of simple heuristics, works better on a large scale. Can increase quality and speed from 10% to 30% depending on a task.
  • ML-based inspirational seeds generation.
  • ML-based auto checks and assistants.
  • Appeals and feedback loops for our experts.
  • Automatic deduplication and quality metrics.

57

58 of 148

ML methods for human labeling: Co-pilot features increase productivity of experts

  • Real-time RAG-based fact check
  • Prompt coverage check
  • AI-generated content check
  • Grammar check

58

59 of 148

Anti-fraud rules

Fraud prevention built into data pipeline from start to finish to guarantee authentic human effort and expertise:

  • Control of speed of responses
  • Checking a trajectory of the cursor
  • Checks via CAPTCHA
  • Identification of personality
  • Check whether a link has been visited
  • Check whether a video has been played

59

60 of 148

2. Operational methods

60

61 of 148

Control tasks/honey pods

Tasks with known correct answer shown to performers to evaluate�their performance.

  • Distribution of answers in control tasks = distribution in whole set of tasks
  • But should contain rare answer variants with higher frequency
  • Refresh your set of control tasks regularly to avoid bots and cheating
  • Automatic control task generation via annotators
  • Tasks with answers of high confidence (e.g. aggregation of answers from a large number of annotators)

61

62 of 148

Motivation in human annotator work

  • Bonuses for a good quality within a period
  • Gamification (e.g. achievements, leader boards, etc)
  • Price depending on performance

62

63 of 148

After task performance

63

64 of 148

Data acceptance and working with data

  • Data quality metrics correlate with model performance gains for confidence in training data
  • Audit by annotation or domain experts
  • Aggregation tools for general crowd tasks
  • ML-based assessment of dataset quality, including LLM-checks

64

65 of 148

Feedback loops

  • Feedback is essential part of the process of every communication
  • Providing and getting feedback improves the quality drastically
  • Feedback can be provided from subject matter experts, from manager of the task, from customer.

65

66 of 148

Summary

  • To achieve the best quality, you first need to define what "best quality" means.
  • For complex labeling tasks, you need a team of experts who can combine different skill sets and organize the process effectively.
  • Implementing quality control at various stages of the data labeling process will improve performance, cost efficiency, and speed.
  • As data labeling scales, more advanced mechanisms for quality control become necessary.

66

67 of 148

Managing human workers

67

68 of 148

Overview

  • Best practices in annotation guidelines
  • Psychological characteristics of doing annotation tasks
  • Inter annotator agreement
  • Communication with the annotators

68

69 of 148

Best practices in annotation guidelines

  • Clear instructions without ambiguities
  • Generalization and explaining goal of the task
  • Balancing detailing and conciseness
  • Including tests and interactive part of the instructions
  • Many “corner” examples that demonstrates different cases
  • Using different formats (videos, webinars, meetings)

69

70 of 148

Psychological characteristics of doing annotation tasks

70

71 of 148

Framing effect

71

The size of the red dot “varies” depending on the context. Even when we're aware of the illusion, we can't help but perceive it. It is an illustration of power of framing.

72 of 148

Framing effect

72

The size of the red dot “varies” depending on the context. Even when we're aware of the illusion, we can't help but perceive it. It is an illustration of power of framing.

73 of 148

Attention

  • Attention as a limited-capacity resource distributed among cognitive processes.
  • This means that people tend to get tired more quickly when they pay greater attention.
  • The less attention you get, less quality you get.
  • New tasks, changes in the instruction requires more attention.

73

74 of 148

Thinking

  • Information perception is automated, which means that people tend to think according to their stereotypes.
  • People are resistant to perceiving new information.
  • Relearning is much harder than learning on the first attempt.

74

75 of 148

Memory

  • “Working memory” has limited capacity and contains 7+-2 cells
  • Combine different sections of the instructions into a single, cohesive contextual block.
  • Try to divide explanation of the task and minor details.

75

76 of 148

Inter annotator agreement

People tend to perceive things subjectively, even professionals.

How to improve Inter annotator agreement?

  • Clear annotation guidelines
  • Few-shot examples
  • Training and calibration
  • Measure agreement metrics
  • Iterative feedback and revisions

76

77 of 148

Communication with the annotators

Effective communication is a crucial part of every team's success.

  • Establish a dedicated channel for consistent communication with annotators.
  • Make the process interactive to engage them actively.
  • Implement feedback loops to ensure continuous improvement.
  • Organize communication streams by separating announcements from responses to general questions.
  • Consolidate all updates into the instructions for easy reference.

77

78 of 148

Summary

  • Working with people comes with many limitations that must be considered. Unlike machines, people are complex and not always easy to approach.
  • Effective communication is crucial for success in labeling tasks.
  • Structuring and organizing information can significantly improve understanding of the task.
  • Investing in thorough onboarding processes leads to better performance and overall outcomes.

78

79 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

79

80 of 148

QA session 1

80

81 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1 �

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

81

82 of 148

Talk to us

Tutorial web page

Toloka Research Fellowship program

82

83 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1 �

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

83

84 of 148

Hybrid pipelines

84

85 of 148

Introduction: Problem Types

We consider two primary labeling direction:

  1. Classification: Assigning discrete labels / preferences / judgements.
  2. Generation: Creating new text / prompts / replies.

Tutorial Focus:

  • Primarily on classification / preference tasks, with discrete label / multi-label;
  • LLMs used for pre-assigning or validate labels;
  • Generation also interconnected and might use similar approaches.

85

86 of 148

Introduction: Roadmap

  1. Automatic Labeling (Prompts, Fine-Tuning)
  2. Confidence Estimation Techniques
  3. Aggregation Strategies (Thresholding, Overlap, etc.)
  4. Balancing Quality vs. Automation
  5. Hybrid Pipelines & Continuous Improvement
  6. Advanced Considerations (Calibration, Infrastructure, Cost Modeling)
  7. Real-World Case Studies & Lessons Learned

86

87 of 148

Example

87

  1. Clarity: Both responses clearly explain the process of resetting a password. However, Response B breaks down the steps more comprehensively.
  2. Detail: Response A gives a high-level overview, while Response B provides actionable details like entering the email address, checking the inbox, and following the link.
  3. Helpfulness: Response B is more helpful because it anticipates potential confusion and offers a step-by-step guide.

Preferred Response

Response B

Evaluate the provided responses to determine which is more helpful to the user based on the given query. Explain your reasoning using a chain-of-thought approach, compare the level of detail, clarity, and helpfulness. response.

Query: “How do I reset my password?”

  • Response A: “To reset your password, click on ‘Forgot Password’ on the login page and follow the instructions.”
  • Response B: “Go to the login page, click on ‘Forgot Password,’ enter your email address, check your inbox for the reset link, and follow the steps in the email.”

88 of 148

Introduction: Why and What?

Problem:

  • Human annotators offer the highest quality but costy on large scale and harder to scale.
  • LLMs provide cheaper labeling but can introduce errors or biases.

Goal:

  • Combine the strengths of LLMs (speed, coverage) with human expertise (accuracy, domain knowledge).
  • Achieve higher throughput, lower costs, and maintain flexible quality control.

General Hybrid Setup: Apply Auto Labelling, later refine with Human Force; But can vary.

88

*

89 of 148

Introduction: General Hybrid Setup

  1. Model First Pass (Auto-Labeling):
  2. LLM predicts labels.
  3. Confidence are estimated
  4. Human-in-the-Loop:
  5. Items with low confidence or high risk are routed to human annotators.
  6. Potential re-check or overlap for final verification.
  7. Dynamic Component (Optional, but worth it):
  8. Model Re-training with new samples
  9. Adjust cutoff points as needed to maintain desired quality vs. cost.
  10. etc

89

LLM pass

Estimate confidence: �e.g. Is confidence > threshold?

Directly accept

Human labeling and Overlap

Update models

Accept \w overlap

90 of 148

Introduction: Tasks / Modalities

Task Types (classification):

  • Preference labeling: e.g. ranking or rating system outputs for user satisfaction.
  • Toxicity or content moderation: multi-class or multi-label classification.
  • Topic detection: discrete labeling of text segments into categories.
  • Temporal: describe, discuss or compare time-sensitive events.

Modalities:

  • Text-only
  • Visual-input / Vision-Language Models (VLLM) / Visual-output
  • Audio
  • Agents (it is a lot here)

90

91 of 148

Auto labeling: How to measure quality

We need to measure “quality” of the pipeline.

There are some ways to define it:

  • Downstream performance change
  • Label quality (mislabelled samples on honeypots)
  • Inter-annotator-agreement

Tl;dr: Have help-out golden set (labeled by trusted experts / with overlap)

91

92 of 148

Auto labeling: �Prompt Engineering Basics

Simplest method. Take a ready-made LLM, add a few examples and hope for the best.

Tricks to increase performance:

  • Tune the main prompt / instructions.
  • Few-Shot Prompts: Provide small curated examples for guidance.
  • Chain-of-Thought: Encourage explicit reasoning steps (or use o1-like models);
  • (Optional) Structured Format: Request outputs in JSON or with delimiters.�BUT may reduce quality.

92

You task is …

Here are criteria …

Here are examples …

Think step-by-step

I will evaluate …

Let’s think step-by-step

Criteria 1, Criteria 2…

The final answer is X

93 of 148

Auto labeling: Prompt Engineering

Pros:

  • Minimal setup; broad knowledge ‘baked’ in the LLM.
  • Quick adaptation with few-shot prompts.
  • Scalability, easy to add a new domain or a few examples;

�Cons:

  • LLMs make mistakes :)
  • Susceptible to hallucinations, unpredictable outputs.
  • Biases inherent in the LLM might propagate to labels. (e.g. longer texts CITEs)

93

94 of 148

Auto labeling: Prompt Engineering Cost

Commercial vs. Open-Source LLMs:

  • Proprietary (e.g., GPT) might yield higher quality but at higher inference cost and with LICENSE restrictions.
  • Open-source (e.g., Qwen, LLaMA) can be cheaper to run locally or with a provider, but may require specialized fine-tuning, sometimes open License.

GPT: 5k samples × $0.06 per sample = $300�Qwen 2.5 72B: 0.5-1.0h for 5k sample run time × $10/h for H100 × 2 GPUs = $20-$40

Trade-offs:

  • Higher auto labeling quality, less human checks → higher inference cost.
  • Using more flexible open-source LLMs → infrastructure maintenance and fine-tuning.

94

95 of 148

Auto labeling: Finetuning Basics

Tips and Tricks:

  • LoRA: Lightweight finetuning. Good for domain-specific improvements.
  • Always keep a held-out golden set for measuring accuracy, F1, etc.
  • Monitor overfitting and generalization carefully, check for data leaks.

Take ready-made (or base) LLM, a bunch of golden examples (labeled by trusted experts) and finetune the desired classifier.

95

96 of 148

Auto labeling: Finetuning

Method 1: Tune LLM with LM SFT

  • Pros: Straightforward approach; uses existing base LLM.
  • Cons: Prompt sensitivity, potential difficulty for multi-label tasks.

Method 2: �Adding a New Classification Head

  • Pros: Potentially more accurate for classification; can handle multi-label.
  • Cons: Requires more data for from-scratch head training.

96

LLM layers

LM Head

+LoRA

Input

Output

+LoRA

LLM layers

Classification head

Input

Output

+LoRA

97 of 148

Auto labeling: Increasing Efficiency Post-tuning

  • Faster Inference: �Pruning, quantization, and distillation etc
  • Quality increase: �Ensemble, Majority Voting for Models/Temperatures.
  • Optimized infrastructure: �GPU utilization, caching KV/predictions.
  • Continuous improvement: �Retrain when data drifts or new classes appear.

Qwen 2.5 72B finetuning: 2h for training × $10/h for H100 × 2 GPUs = $40�BUT Data for finetuing: 1k sample × 10m for sample × $60/h expert time = $10k

97

98 of 148

Auto labeling: Problem

  • Insufficient Quality: Often not ready for production without human checks).
  • LLM Limitations: Performance capped by model biases/quality (e.g., Qwen model used).
  • Upper Bound: Even best-LLM-case may not match multi-human consensus.
  • Prompt Sensitivity: Minor wording changes can skew results.
  • Real-world Experience: That's it ;)

Most of the problems can be curated with careful Human labeling on top.

98

99 of 148

Confidence Estimation: Why It’s Crucial

Key Concept: Confidence directs the labeling workflow

  • High-confidence model outputs can be accepted automatically.
  • Low-confidence outputs go to human annotators.

�Benefits:

  • Save tedious work for LLM, avoid unnecessary human effort.
  • Improves cost-effectiveness;
  • Reduces LLM error risks by focusing human attention on critical cases.
  • Key to making “hybrid” work seamlessly.

99

100 of 148

Confidence Estimation: In Text Generation

Token-Level Probabilities:

  • Summation or average of log probabilities across tokens.
  • Challenging to calibrate; context length can skew probabilities.

�Multiple Prompts / Variance Check:

  • Generate multiple candidate outputs to measure disagreement.

Calibration Challenges:

  • Log-probs often are not well-calibrated; Post-hoc scaling might be required.
  • Systematic biases in LLMs (e.g. options order, long context).
  • Log probabilities require interpretation; e.g. multiple choice not necessary equals to the class.

100

S = <assistant/> The best response is A

context

prediction

P(S) = P(<assistant/> | prompt) ᐧ … ᐧ P(A | prompt <assistant/> The best response is)

101 of 148

Confidence Estimation: In Classification

Softmax Probabilities:

  • Typically used out-of-the-box, but can be miscalibrated (over/underconfidence).

�Calibration Methods:

  • Temperature scaling, Platt scaling, isotonic regression.
  • Reliability diagrams to visualize predicted vs. true probabilities.
  • Multi-label tasks may need specialized calibration strategies.

101

102 of 148

Confidence Estimation: Advanced Calibration

Convert scores to probability: Better calibration → More accurate routing.

  • Must have: select the one time
  • Nice to have: Why to you need calibration - to move the between batches
  • Calibration techniques: Temperature scaling, ensemble methods, reliability diagrams.
  • Avoid over/underconfidence to reduce mis-assignments.

102

103 of 148

Aggregating: Combining Model & Human Labels – Basics

Goal: Identify when to trust the model vs. when to request human assistance. So we need to find re-labeling ratio.

Strategies:

  • Simple Thresholding (if confidence is high, use model else use human).
  • Overlap: Use model as an additional expert (majority vote, weighted blending)
  • Meta-Classifier (decides if an item should go to human labeling)

Note: Humans make mistakes; covered in Human Labeling part of the tutorial.

103

104 of 148

Aggregating: �Combining Model & Human Labels – Basics

104

LLM pass

Is confidence > threshold?

Directly accept

Human labeling

Sample

Meta Estimator

Likely Ok: LLM labeling

Risky: Human labeling

Human labeling

LLM pass

Weighted Blending

Accept

Simple thresholding

Overlap

Meta-classifier

105 of 148

Aggregating: Threshold-Based

Select a simple threshold to find easy-for-LLM cases while deferring uncertain items.

Method: If LLM confidence ≥ T → Accept model label. Else → Human label.

Pros: Straightforward implementation; development cost-efficient.

Cons: Incorrect threshold choice can lead to suboptimal quality or inflated costs.

105

106 of 148

Aggregating: Overlap-Based

For some uncertain cases it is beneficial to overlap human and model labeling.

Method: Gather both LLM and human labels, combine with:

  • Majority vote: LLM is an extra expert
  • Weighted blending: We can assign weight based on confidence,... for humans too!
  • Heuristic rules.

Pros: More robust final labels; each source acts as a check on the other.

Cons: Increases costs (double labeling). Must select carefully items to overlap.

106

107 of 148

Aggregating: Meta-classifier

Let’s directly optimize the human re-labeling with a separate model!

Method: Train a secondary “routing” model on features (e.g., confidence score, topic, content, input complexity) to predict if LLM’s output is likely incorrect.

Pros: More sophisticated than a fixed threshold; can factor domain-specific cues.

Cons: Requires more labeled data to train routing model; complex pipeline;

107

108 of 148

Aggregating: Conclusion and Recap

It is cheap to use LLM as classifier, we need to find cases to route to human experts.

Rule of thumb (production experience):

  • High confidence → model-only or minimal human checks.
  • Medium confidence → overlap with human for extra certainty. (check relabelling ratio)
  • Low confidence → human-only labeling.

108

109 of 148

Aggregating: Schema

109

110 of 148

Balancing: Quality & Automation

Adjust thresholds and overlap to meet cost-quality targets.

Threshold Tuning:

  • Lower threshold → higher cost (more human efforts), �higher quality.
  • Higher threshold → lower cost, higher risk of errors.

Cost vs. Accuracy:

  • Plot different thresholds or strategies.
  • Choose sweet spot based on project’s budget and risk tolerance.

110

111 of 148

Balancing: Cost Modeling & ROI Analysis

We need to estimate costs, including human labor and compute expenses. �Cost modeling ensures the pipeline remains profitable and scalable.

Factors:

  • Annotation cost per item (human-labor rate).
  • LLM inference cost per call (hello, o1-like models).
  • Infrastructure overhead (GPU usage, hosting, maintenance).

Compute Return of Investments (ROI): “Saved human effort” vs. “LLM dev/inference fees.” (do not forget dev cost).�- Periodic reviews ensure long-term economic viability.��Manual cost: $50k for 5k labeling�Hybrid costs: $10k for 1k training set + $200-500 for experiments + $10k for overlap�LLM-only cost: $300 for 5k labeling

111

+platform

+infrastructure

+held-out set

+EXPERIMENT FAILS

112 of 148

Balancing: Cost Modeling & ROI Analysis

We need to estimate costs, including human labor and compute expenses. �Cost modeling ensures the pipeline remains profitable and scalable.

Factors:

  • Annotation cost per item (human-labor rate).
  • LLM inference cost per call (hello, o1-like models).
  • Infrastructure overhead (GPU usage, hosting, maintenance).

Compute Return of Investments (ROI): “Saved human effort” vs. “LLM dev/inference fees.” (do not forget dev cost).�- Periodic reviews ensure long-term economic viability.��Manual cost: $50k for 5k labeling�Hybrid costs: $10k for 1k training set + $200-500 for experiments + $10k for overlap�LLM-only cost: $300 for 5k labeling

112

113 of 148

Continuous Quality Assessment

Early detection of issues save a lot of money.

Ongoing Evaluation:

  • Regularly sample N items for ground-truth expert labeling.
  • Use metrics like accuracy, F1, reliability over time.
  • Nice to track downstream quality, but not always possible.
  • Detect declines (model drift, distribution shifts) and respond promptly.

113

114 of 148

Continuous Model Improving

Systematic iteration raises quality and reduces cost in the long run.

We produce data in batches, so we can use new data to retrain labelling models.

  • Retrain/fine-tune models with newly verified labels from the pipeline.
  • Iterative improvement: better model → fewer human checks → more savings.
  • Keep reproducibility

114

115 of 148

Insights from Real-World Projects

  • o1-like: Higher quality but expensive; be aware of model usage License :)
  • Open models are much cheaper but require fine-tuning.

  • Silver bullet: Find subsets where the model outperforms humans.
  • Ongoing expert validation / held-out set
  • Adjust thresholds and retrain models constantly with a new batches.

115

116 of 148

Case Study: Product Pair Similarity

Context:

  • Classify on 5 classes: “not relevant” -> “same product”
  • “Same Product” is critical yet difficult to distinguish

Key Approaches:

  • Offline setup combining a LLM with human labeling
  • Visual Encoder improve accuracy
  • Fine-Tuning & Calibration led to better performance.
  • Classifier Head is better then generative LM Loss
  • Upsampling Minority Classes for training

Lessons Learned:

  • Calibrated Confidence: Route low-confidence to humans
  • Converting a multi-label to binary classification simplify decision-making
  • Base model is better for training than Instruction tuned

116

Blue - LLaMA 70B�Red - Qwen-VL 72B (/w images)

accuracy

relabel ratio

117 of 148

Conclusion & Next Steps

Key Takeaways:

  1. Hybrid labeling merges LLM power with human judgment, balancing quality and cost.
  2. Calibration and confidence estimation are essential for effective automation.
  3. Infrastructure optimization are key to lower costs.
  4. Iterative improvement and cost modeling keep the pipeline economically viable.

Go Deeper:

  • More advanced routing/aggregation models.
  • Best-of-N generation.
  • Better calibration methods for generative tasks.
  • Extending hybrid approaches to complex, multi-step tasks (e.g., end-to-end QA, dialogue systems).

117

118 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1 �

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

118

119 of 148

LM Workflows

A case study

119

120 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1 �

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

120

121 of 148

Overview

  • Aim: create efficient LMs with acceptable performance
  • Sample problem:
    • Text: ...
    • Labels: [repdim:sent, repdim:sent . . . ]
    • Reputation dimension label set: [Economic performance, Market position…]
    • Sentiment/tonality label set: [-2, -1, 0, 1, 2]
  • Data example (artificial):
    • Text: “Audi closes important plant: Next shock news for the 3,000 employees - the Audi plant in Brussels will close at the end of February 2025. The closure comes after months of negotiations between Audi, works councils and trade unions. Audi cites declining sales figures for the Q8 e-tron and high logistics costs as the reasons.”
    • Labels: [Economic performance: -1, Market position: -1]
  • Experiments done by: Sachin Yadav, Pawan Waldia, Abhishek Chugh

122 of 148

Task

  • Data example (artificial):
    • Text: “Audi schließt wichtiges Werk: Nächste Schock-Nachricht für die 3.000 Mitarbeiter - Ende Februar 2025 schließt das Werk von Audi in Brüssel. Die Schließung kommt nach monatelangen Verhandlungen zwischen Audi, Betriebsräten und Gewerkschaften zustande. Audi nennt schrumpfende Verkaufszahlen des Q8 e-tron und hohe Logistikkosten als Gründe.”
    • Labels: [Economic performance: -1, Market position: -1]
  • Aggregated labels:
    • Reputation Dimension Labels: [Economic performance, Market position]
    • Average Sentiment Label (discretized): -1

123 of 148

Datasets

  • Reputation Dimension Labels (RPL)
  • Average Sentiment Labels (ASL)
  • Sentiment Baseline (GerNTSA): German news data

Data Cleaning and Splitting

  • Removes HTML tags
  • Strips special characters ( e.g., @, #, *, $, %)
  • Removes extra spaces
  • Removes duplicate data
  • Low frequency dimension (e.g., Management, Strategie)
  • are excluded from RPL datasets.

GerNTSA

RPL

ASL

Train

3033

1217

1217

Dev

380

153

153

Test

379

152

152

Total

3792

1522

1522

124 of 148

125 of 148

126 of 148

Model

Parameters

Architecture

Type

LLama-2-7b

7 Billion

Encoder-Decoder

Multilingual

Flan-T5 Small

77 Million

Encoder-Decoder

Multilingual

Flan-T5 Large

783 Million

Encoder-Decoder

Multilingual

BERT-base-uncased

110 Million

Encoder

Monolingual (English)

BERT-large-uncased

340 Million

Encoder

Monolingual (German)

BERT-base-german-uncased

110 Million

Encoder

Monolingual (German)

XLM-RoBERTa Base

125 Million

Encoder

Multilingual

XLM-RoBERTa Large

255 Million

Encoder

Multilingual

Model Information

127 of 148

Overview of Project Workflow

128 of 148

Efficiency optimization

Quantization:

  • Quantization: Reduces model size by representing weights with fewer bits, improving efficiency.
  • Usage: Applied 4-bit quantization to optimize performance and reduce resource usage.

Pruning:

  • Pruning: Removes less important model parameters to reduce size and improve efficiency.
  • Usage: Applied L1 unstructured pruning to remove 10% of weights globally.

Focus Models: Llama-2-7b

129 of 148

Performance optimization

  • Hyperparameter tuning
  • Prompt engineering
  • Systematic prompt optimization (DSPy)
  • Finetuning
  • Learning Classifier

130 of 148

Focus Models: Llama 7b, Flan-t5-small/large, BERT-base-uncased, BERT-large-uncased, BERT-base-german-uncased, XLM-RoBERTa Base, XLM-RoBERTa Large

  • Prompts: Predefined inputs or scenarios provided to the model to guide its output
  • Temperature: Controls randomness in responses (low = focused, high = creative).
  • Top_p: Limits choices to top likely words (e.g., top_p = 0.9 includes top 90%).

we tried these hyperparameters:

For DSPy: The hyperparameter i.e., number of labeled examples was fixed to k=12.

For BERT Finetuning:

“lr”=1e-5, 2e-5, 3e-5, 4e-5

“epoch”=3, 4,

“batch_size”=4, 8, 16, 32, 64

For each prompt, we used the following hyperparameter configurations:

"temperature": 0.1, 0.4, 0.6, 0.8, 1.0

"top_p": 0.1, 0.6, 0.8, 1.0

Hyperparameters:

For MLP:

“lr”=1e-5, 2e-5, 4e-5, 5e-6,6e-5

“number_of_itteration”=6000,

10000,

15000

“batch_size”=32,64,128x

“hidden layer”=[ 512],

[ 512, 1024],

[ 512, 1024, 512]

Prompts:

English: prompt1 and prompt3

German: prompt2 and prompt4,

131 of 148

Prompt Engineering

  1. Research and Adaptation
  2. Initial prompts were researched online and adapted to align with the task requirements on dev data.

  • Experimentation with Prompt Types
  • Started with short and simple prompts to see how the model responds.
  • Tried various prompt styles, including sarcasm-based prompts, to explore different ways of generating responses.
  • Experimented with few-shot prompting by providing labelled examples from train data.

  • Language Testing
  • Checked how well the model works with prompts in both English and German.

  • Finalization of Prompts
  • After testing and improving, we chose the prompts that worked best for the task.

132 of 148

Prompt examples: sentiment and multilabel classification

133 of 148

Prompt optimization with Declarative Self-improving Python (DSPy):

What is DSPy?

  • A Python library that makes programming easier by letting us define goals, and it handles the rest.
  • It uses machine learning to improve performance over time.

Key Features:

  • Self-improving: Learns and gets better with use.
  • Goal-focused: You tell it what to achieve, and it takes care of the details.
  • Adaptive: Great for building tools that can learn and improve, like decision-making systems.

Labelled Few Shot:

  • We used it in DSPy to optimize prompts by providing a small number of labeled examples (e.g., k=12).
  • These examples act as templates to help the model understand the task better.

How it works:

  • We choose k (e.g., 12 examples).
  • DSPy selects diverse, balanced examples from the dataset.
  • These examples are included in the prompt for better results.

134 of 148

135 of 148

136 of 148

137 of 148

Conclusion

  • BERT Models: Smaller BERT models outperformed larger LLaMA and Flan-T5 models despite their size.
  • MLP Model: Proves ideal for classification tasks due to its low computational requirements.
  • Our results confirm previous observations (Bosley et al., n.d.)
  • Quantization & Pruning: Provide similar results, showing that reducing model size does not impact performance.
  • DSPy: worked well with Flan-T5 (large) but not with Llama.
  • Prompt Performance: Both LLaMA and Flan-T5 perform better with short, precise prompts.
  • Flan-T5: Excels with English prompts but occasionally performs well with German prompts in larger models, suggesting limited German training data.
  • LLaMA-7b: Performs best with German prompts, but English prompts also work effectively for German datasets.
  • Optimal Parameters: Consistently better results achieved with temperature=0.1 and top-p=0.1 while DSPy shows consistent results when increasing the value of k.

138 of 148

Limitations

  • Case study: particular setup
  • Many LLMs were not tested
  • Fine-tuning LLMs was not tested
  • Chain-of-thought prompts

139 of 148

Limitations

139

140 of 148

Subjectivity and bias

  • LLMs struggle with tasks lacking a single "correct" answer
    • Tasks, that require understanding emotions, sentiment, toxicity, and cultural nuances
  • LLMs do not exhibit natural variations unlike human annotations
  • Produce overly consistent labels across similar items, missing nuanced differences that human labelers would catch
  • Risk of bias in LLM-generated labels, which could lead to biased classifiers if trained on these labels

140

141 of 148

Errors in LLM labelling

  • LLMs tend to favor some class labels due to their frequency in pre-training data.
  • Context-level bias: LLMs tend to prefer the majority and last label of the provided few-shot examples.
  • Domain-level bias: LLMs tend to associate lexical cues with certain class labels.
    • Example: Slang -> toxic, medical terms -> sick.
  • Input preference bias: LLMs tend to favor certain inputs based on surface features such as length.
  • Output structure issues: LLMs may have difficulties in providing structured outputs (e.g., formatted as JSON) or experience performance decrease.

141

142 of 148

Model collapse

  • The use of synthetic data or LLM-labeled data for training could lead to gradual model degradation and potential model collapse over time.
  • Early stages: Loss of information about distribution tails, affecting minority data
  • Late stages: Significant performance decline, concept confusion, and variance loss

142

143 of 148

Other concerns

  • Difficulty in attributing responsibility for LLM labels raises concerns about misuse and harm.
  • LLMs may lack the knowledge required for domain-specific tasks. Only human domain experts can bring relevant expertise.
  • Using API-level LLMs for labeling sensitive data may raise privacy and security issues.

143

144 of 148

Hands-on session: �Hybrid data annotation

144

145 of 148

Setup and Problem to solve

Dataset:

  • “Debagreement: Reddit 50K”
  • agree, disagree, neutral, unsure
  • 3 annotators and >⅔ agreement label

Convert to single annotator dataset:

  • Corrupt label with (1-agreement) probability
  • We have single annotator labels and 3 annotators golden set.

Plan and Goal:

  • Finetune LLM to predict label
  • Evaluate quality
  • Find best ratio / samples to re-label with humans

145

146 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

  • Introduction
  • Synthetic data generation
  • Active learning with LMs
  • Quality control and managing human workers
  • QA Session 1 �

15:30 Coffee break

  • QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

  • Hybrid pipelines
  • LM Workflows
  • Limitations
  • Hands-on session: Hybrid data annotation
  • QA Session 2

146

147 of 148

QA session 2

147

148 of 148

Talk to us

Tutorial web page

Toloka Research Fellowship program

148