1 of 148

Labelling with LLM and Human-in-the-Loop

Ekaterina Artemova, Akim Tsvigun, Dominik Schlechtweg

Natalia Fedorova, Sergei Tilga, Konstantin Chernyshev, Boris Obmoroshev

1

TALK TO US

TUTORIAL�PAGE

2 of 148

About us

2

Natalia Fedorova

Toloka Partnership Manager, Toloka

Boris Obmoroshev

Toloka AI R&D and Analytics Director, Toloka

Sergei Tilga

Head of R&D, Toloka

Ekaterina Artemova

Machine Learning Researcher, Toloka

Konstantin Chernyshev

Machine Learning Researcher, Toloka

Akim Tsvigun

Natural Language Processing Lead @ Nebius AI, University of Amsterdam

Dominik Schlechtweg

Research Group Lead, Universität Stuttgart

3 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

3

4 of 148

Introduction

4

5 of 148

Data is new oil

Clive Humby

5

6 of 148

Data needs in NLP

Raw data

Is needed for unsupervised learning tasks, such as LLM pre-training
Teaches LLMs to generate plausible language
Is usually collected by scraping web sources

This is not the focus of the this tutorial!

Labelled data

Is needed for tasks like sentiment analysis, named entity recognition, and syntactic parsing
Is needed for specific applications (conversational AI, machine translation) and specific domains (medicine, law, finance, etc)
Is manually annotated by humans

6

7 of 148

Many applications require complex annotation

7

Can GPT-4 Identify Propaganda? Annotation and Detection of Propaganda Spans in News Articles (Hasanain et al., LREC-COLING 2024)

8 of 148

Data annotation pipeline

8

01

02

03

04

Frame the target task

Problem formulation: �Does this text pose a risk of a harm?

Conceptualize the problem

Annotation schema defines labels, how they should be applied and how complex cases should �be treated.

Quality control

Choose an aggregation rule, control inter-annotator agreement, train a baseline model.

Instruct annotators

Annotation guidelines are provided to the annotators to label raw data and to guide annotation decisions

The _main_ stages of data annotation are as _follows_.

_Suppose_ we want to develop a _tool_ to recognize harmful _content_ online. _First_, we need to _formally_ define the _task_, which _might_ be a binary classifi_cat_ion in this case.

_Next_, we _expand_ this formal setup into a conceptual _description_ of the _problem_.

We define the _label_ Inventory—two labels, in our case—and specify what type of content we consider harmful and what we do not. This process is typically referred to as developing an annotation schema.

Based on this annotation schema, we then create annotation guidelines, which provide annotators with clear instructions to guide their labeling decisions. At this stage, we can begin labeling a portion of the data and evaluating the quality of the annotations. We’ll explore quality evaluation in more detail later, but here’s a quick overview of the types of quality metrics we might use. For example, inter-annotator agreement measures how consistently different annotators make similar labeling choices. Alternatively, we can train a model on our annotated data and assess whether it performs as expected.

9 of 148

Labelling with human: Example

Problem formulation: detect emotions in tweets
Annotation schema: six emotions (love, anger, fear, sadness, surprise, joy)
Instruction:

Please read each text carefully and classify it based on the dominant emotion it expresses: love, anger, fear, sadness, surprise, or jou. If the tweet does not clearly convey one of these emotions, or if it is purely factual or neutral, mark it as neutral.

Quality control: overlap of 5 annotations, majority voting, and control tasks

9

Saravia, Elvis, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. "CARER: Contextualized affect representations for emotion recognition." EMNLP. 2018.

i was ready to meet mom in the airport and feel her ever supportive arms around me

love

anger

fear

sadness

surprise

joy

neutral

10 of 148

Labelling with human: Budget estimation

Let X be the cost of annotating a single text.
The target dataset includes 1,000 texts plus an additional 10% for control tasks.
The total budget is calculated as (1,000+100)⋅5⋅X=5,500⋅X(1,000+100)⋅5⋅X=5,500⋅X.
X is determined based on the hourly payment rate and the average number of texts that can be annotated per hour.
The time spent on labelling depends on the complexity of the task and the size of the target dataset.

10

Saravia, Elvis, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. "CARER: Contextualized affect representations for emotion recognition." EMNLP. 2018.

i was ready to meet mom in the airport and feel her ever supportive arms around me

love

anger

fear

sadness

surprise

joy

neutral

11 of 148

Labelling with an LLM

We can leverage the ability of LLMs for zero-shot classification by prompting them to predict labels for unlabeled text.
The cost of an LLM run depends on the type of access (API or self-hosting).
The time spent on labeling depends on the inference speed.
Depending on the computational setup, the cost of labeling 1,100 tweets can be below $5.

11

Saravia, Elvis, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. "CARER: Contextualized affect representations for emotion recognition." EMNLP. 2018.

Given the following text, classify the emotion expressed in the message into one of the following categories: anger, fear, joy, love, sadness, or surprise. If the text does not express a clear emotion, classify it as neutral.

Text: i was ready to meet mom in the airport and feel her ever supportive arms around me

Emotion: <Select the appropriate emotion category>

love

12 of 148

LLMs do the work, humans fix the oops

Labeling with LLMs

Reduces cost and time spent on annotations
Can make mistakes and struggle with complex or subjective tasks
Unclear when to trust LLM predictions

�Solutions

Synthetic Data: Generate synthetic data instead of labeling raw data
Active Learning: Select the most informative instances for human labeling
Hybrid Approaches: LLMs handle simple instances, humans handle complex ones

12

The slide title is generated with ChatGPT

13 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

13

14 of 148

Synthetic data generation

14

15 of 148

Synthetic data generation

15

A Powerful and Thought-Provoking Masterpiece

�Oppenheimer is a stunning portrayal of J. Robert Oppenheimer's life and the creation of the atomic bomb. Cillian Murphy delivers a brilliant performance, capturing the complexity of the man behind history's most pivotal invention. Christopher Nolan's direction, paired with breathtaking visuals and a haunting score, makes every moment gripping.

The film masterfully balances science, ethics, and personal struggles, leaving you reflecting long after the credits roll. With a stellar cast and unforgettable storytelling, Oppenheimer is a must-watch for anyone who loves powerful, impactful cinema.

Suppose you are writing movies review for the IMDB platform. Write a positive review on the film Oppenheimer.

16 of 148

Synthetic data generation

The main idea is to generate synthetic data in a way that it mimics characteristics and feature of real word data.

16

Benefits

Lower costs for data management and analysis
Faster data collection turnaround time
Fewer security issues

Limitations

Difficulties in generating accurate and diverse data
Need for validation procedures
The quality of trained classifiers may suffer

Synthetic data is significantly more affordable and can be generated much faster compared to human annotation.

Since data can be generated and stored in-house, it reduces reliance on external systems and minimizes the risk of data exposure.

While synthetic data can be generated quickly, it doesn’t always deliver strong performance right away.

One common issue is the creation of off-label data—texts that don't align with the intended labels or task. This can limit their usefulness for training and reduce the effectiveness of the model.

When generating large datasets, the data may lack variety. This highlights the need for diversifying the data generation process.

---

This version emphasizes clarity and logical progression, guiding the audience through the benefits before addressing the potential challenges, making the content more digestible for a presentation.

https://arxiv.org/pdf/2406.15126

https://arxiv.org/pdf/2310.17876

https://arxiv.org/pdf/2412.02980

https://arxiv.org/abs/2310.07849

https://arxiv.org/abs/2406.20094

https://arxiv.org/abs/2410.12896

https://arxiv.org/abs/2412.05388

https://arxiv.org/abs/2407.12813v2

https://aclanthology.org/2023.acl-long.34.pdf

https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data

https://syntheticus.ai/blog/the-benefits-and-limitations-of-generating-synthetic-data

17 of 148

Requirements for synthetic data

Accurate labels: �Labels must be correct and consistent to ensure reliable model training
Matching domain: �Data should mimic the target domain in language, style, and content
Diversity: �Variation is essential to prevent overfitting and enhance model robustness

17

18 of 148

LLMs for synthetic data generation

Zero-shot Generation

Prompt an LLM with a task description and a label
No examples provided; LLM relies on pre-trained knowledge

18

Generate a sentence that conveys a {label} sentiment.

19 of 148

LLMs for synthetic data generation

Few-shot Generation

Prompt an LLM with a task description, a label, and several data instances
Directs LLM behavior using sample instances

19

You are given a data entry consisting of a sentence and its sentiment label. Sentence: {sentence}, Label: {label}. Generate a similar data entry and output both a sentence and a label.

20 of 148

LLMs for synthetic data generation

Hierarchical Generation

Prompt an LLM to generate key points
Create detailed prompts incorporating these key points
Use crafted prompts to generate specific content
All steps can be zero- or few-shot

20

Given the topic {topic}, generate a sentence that conveys a {label} sentiment.

Generate a topic that can be used to create sentences with distinct {label} sentiment. The topic should be broad enough.

21 of 148

Validation techniques

Train-real-test-synthetic: �Train a classifier on human-labeled data and evaluate its accuracy on generated dataset
Train-synthetic-test-real: �Train a classifier on synthetic data and evaluate its accuracy using a human-labeled test dataset
Statistical checks: �Examine whether the label distribution, diversity, and stylistic features in the generated dataset match those of the real dataset

21

22 of 148

Case study 1: How does the task subjectivity affect the performance?

Key Findings:

• Subjectivity negatively affects model performance on synthetic data

• Performance gap increases for highly subjective tasks

• Models perform better on less subjective instances within tasks

22

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations (Li et al., EMNLP 2023)

23 of 148

Case study 1: How does the task subjectivity affect the performance?

Methodology:

• Used GPT-3.5-Turbo for synthetic data generation

• Explored zero-shot and few-shot generation settings

• Evaluated across 10 text classification tasks

• Conducted crowdsourced studies to determine task-level subjectivity

23

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations (Li et al., EMNLP 2023)

24 of 148

Case study 2: How does increasing diversity of synthetic data affect performance?

Key findings:

Diversification approaches can increase data diversity but often at the cost of data accuracy
Balancing diversity and accuracy in synthetic data generation is crucial for creating high-quality datasets
Human-LLM loop can facilitate high diversity and accuracy in LLM-based text data generation

24

Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions (Chung et al., ACL 2023)

25 of 148

Case study 2: How does increasing diversity of synthetic data affect performance?

Diversification approaches:

Logit suppression: Decreases probability of sampling frequently generated tokens.
Temperature sampling: Flattens token sampling probability distribution.

Human interventions:

Label replacement (LR): Correcting misaligned labels improved model accuracy by ~15% when used with diversification techniques.
Out-of-scope filtering (OOSF): Removing irrelevant instances showed limited utility in improving accuracy.

25

Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions (Chung et al., ACL 2023)

26 of 148

Case study 3: To which extent can synthetic data replace human annotation?

Key findings:

Using synthetic data only leads to worse performance than the same amount of real data.
A small amount of human-labeled data significantly outperforms large quantities of synthetic data in terms of performance gain and cost-effectiveness.
Labeling hundreds of data points requires lower budgets than generating a magnitude larger amount of data.

26

Ashok, Dhananjay, and Jonathan May. "A Little Human Data Goes A Long Way." arXiv preprint arXiv:2410.13098 (2024).

27 of 148

Case study 3: To which extent can synthetic data replace human annotation?

Methodology

Gradually increase the amount of synthetic data in fixed-size training sets and test model performance.
Use performance prediction methods to estimate the volume of synthetic data that provides a comparable gain to 200 human-labeled data points.

Fit a regression model: performance ~ dataset size.

At some volume, the costs of compute usage become comparable to the cost of human labeling

27

Ashok, Dhananjay, and Jonathan May. "A Little Human Data Goes A Long Way." arXiv preprint arXiv:2410.13098 (2024).

28 of 148

Understanding bias in synthetic data

Synthetic data can perpetuate, amplify, and introduce biases due to several factors

Label-conditioned generations can introduce bias and reduce the difficulty of the synthetic data entry
Generation models can replicate existing biases from the pre-training data.
Synthetic data may reproduce biases found in human data used for fine-tuning or in few-shot setups

28

Chim, Jenny, Julia Ive, and Maria Liakata. "Evaluating Synthetic Data Generation from User Generated Text." Computational Linguistics (2024): 1-44.

29 of 148

Understanding bias in synthetic data

Synthetic data can perpetuate, amplify, and introduce biases due to several factors:

Uneven quality in synthetic subsets, such as poorly generated code-switched or vernacular texts, can impact linguistic minorities.
Neglecting underrepresented groups in data generation can lead to biased outputs that mainly reflect majority perspectives.

29

Chim, Jenny, Julia Ive, and Maria Liakata. "Evaluating Synthetic Data Generation from User Generated Text." Computational Linguistics (2024): 1-44.

30 of 148

Conclusion: Synthetic data for text classification

Effectiveness

Useful in objective tasks (e.g., news topic classification)
Limitations in highly subjective tasks (e.g., humor detection)

Diversity Impact

Increased diversity generally improves classifier performance
Diversity-label accuracy trade-off exists

Best Practices

Use hierarchical approach to LLM prompting
Think about sources of bias in synthetic data
Combine with human labelling for optimal results

30

31 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

31

32 of 148

Active learning with LMs

32

33 of 148

Active learning with LMs

Introduction to active learning (AL)
Active learning strategies in text classification
Generative active learning
Human-based vs LLM-based annotation in active learning
Techniques for efficient active learning

33

34 of 148

Introduction to active learning

AL reduces the required number of annotated inputs to reach the desired quality of an LM
AL methods select for annotation instances that are the most informative or uncertain for the model
In some NLP tasks, training a model on ⅙ of the dataset, selected with AL, reaches 99% of quality of the model, trained on the entire dataset

34

Settles, B.: “Active learning: Synthesis lectures on artificial intelligence and machine learning”. Long Island, NY: Morgan & Clay Pool, 2012.

35 of 148

How to select texts for annotation?

How to measure the amount of “new information” in the instance?
Imagine a classification task with two labels: ‘positive’ and ‘negative’
We have two labeled examples:

“I like this place” has a ‘positive’ label
“I can say I’m disappointed” has a ‘negative’ label

Which text is more important to annotate: “I can’t say I’m disappointed” or "It was really exciting!”?
First text cosine similarities with already annotated texts: 0.71, 0.84*
Second text cosine similarities with already annotated texts: 0.71, 0.62*
The first text is more “similar” to already annotated texts, but its label is more vague

35

* Using BGE-ICL model - one of the SOTA models in text embeddings

In real-world scenarios, we almost always have to deal with restricted annotation budget. Every time we pay the annotator for labelling a new text, we want this pair <text, label> contain as much new information, not contained in the already labeled texts, as possible. Therefore, we should avoid labelling similar texts that contain similar information. However, detecting these ''similar'' texts is not a trivial task. While complete duplicates can be automatically filtered, it is not clear, how dissimilar two texts should be so that labelling the second text provides new information, helpful for the model to generalize on the task. The example on the slide displays this problem. It is not possible to simply use some text similarity measures like Levenshtein distance or cosine similarity of text embeddings, because one needs to take into account the labels of the already annotated texts to select the most informative inputs. Furthermore, it is crucial to consider the existing knowledge of the model, obtained during its pre-training, to avoid selecting instances where it can assign a label with high confidence.�

36 of 148

AL strategies in text classification

AL strategies differ in how they calculate the “informativeness” of the input text
Key Considerations when choosing an AL strategy for your case:

Model uncertainty
Diversity of selected examples
Computational efficiency
Size of the batch for annotation
Balance between exploration and exploitation

36

ALToolbox: A Set of Tools for Active Learning Annotation of Natural Language Texts (Tsvigun et al., EMNLP 2022)

37 of 148

AL strategies in text classification

Least Confidence (LC,↑): measure the probability of the most probable label, predicted by the model.

Breaking Ties (BT, ↓): measure the difference of the probability of the first two most probable labels.

Contrastive Active Learning (CAL, ↑): measure the divergence of predictive likelihoods of input texts from their neighbors in the training set.

Batch Active Learning by Diverse Gradient Embeddings (BADGE, ↑): measure the informativeness as the gradient magnitude with respect to model parameters in the output layer.

37

↑: higher values stands for higher priority, ↓: lower value stand for higher priority

According to research by Tsvigun and colleagues in 2022, there are four particularly effective approaches for text classification. Each of them can lead to slightly higher results in different scenarios.�First, we have Least Confidence. This is perhaps the most intuitive strategy - it simply looks at how confident the model is in its top prediction. If the model assigns a low probability to the most probable label, that's a signal that this example might be particularly informative for training.�Breaking Ties takes a slightly different angle. Instead of just looking at the highest probability, it examines the gap between the top two most likely predictions. When this gap is small, it means the model is struggling to decisively choose between the two options - a clear signal that this example could be valuable for training. For binary classification tasks, BT and LC provide the same results, since there are only two labels to choose from.�Moving to more sophisticated approaches, Contrastive Active Learning, takes into account the relationship between different examples. It looks at how different the model predictions on a new text are compared to similar examples that we've already labeled. If the model predicts very different probabilities for similar texts, this suggests an area of uncertainty that would be valuable to explore, and hence the informativeness of the text is high.�Finally, BADGE, which stands for Batch Active Learning by Diverse Gradient Embeddings, takes a more technical approach. Instead of looking at probabilities directly, it examines how much impact a text would have on the model's parameters. Specifically, it measures the magnitude of gradients in the output layer, effectively identifying examples that would cause the most significant updates to our model.�Each of these strategies has its strengths, and the choice between them often depends on your specific use case and computational resources available.

38 of 148

Generative active learning

Traditional AL is limited to selecting from existing unlabeled data
Sometimes, the existing data is scarce and doesn’t cover many potential inputs
Generative AL leverages the power of modern LLMs to create synthetic examples
It involves the three-stage process:

Generate synthetic examples using LLMs (e.g., GPT-4, Deepseek-v3)
Apply AL strategies to select the most informative synthetic examples
Present selected examples for human annotation

38

Now, let's explore a particularly innovative approach called Generative Active Learning, which leverages the generative capabilities of modern powerful large language models to enhance the active learning process. Unlike traditional AL methods that only select from existing unlabeled data, generative active learning introduces a powerful new dimension by creating synthetic examples that could be particularly informative for model training.�The core idea behind generative active learning is elegantly simple: instead of being limited to the existing pool of unlabeled data, we can generate new examples that specifically target areas where the model needs the most improvement. This process typically unfolds in three main stages:�First, we utilize a generative model, such as GPT-4 or recently introduced Deepseek-v3 to create synthetic examples based on patterns learned from the so far annotated labeled data. This allows us to explore the space of possible inputs more comprehensively than we could with just the original dataset.�Second, we apply selection strategies - similar to those we discussed earlier - to identify which of these generated examples would be the most valuable for training. This step is crucial because not all generated examples are equally informative, and we want to focus our annotation efforts on the most promising candidates.�Finally, these carefully selected synthetic examples are presented to human annotators for labeling, effectively expanding our training data in a highly targeted way.�

39 of 148

Generative active learning

39

Advantages

Explores input space beyond the original dataset
Addresses rare or missing cases
Targets specific areas where the model needs improvement
Is particularly valuable when real-world examples are scarce or expensive to obtain

Challenges

Relies heavily on the generative LLM capabilities
Produces unrealistic or poor examples with weaker LLMs
Struggles with specialized domains
May generated examples that don’t not reflect real-world distribution
Requires extra computational costs for generation step

What makes generative active learning particularly powerful is its ability to generate examples that might be rare or entirely absent in the original dataset. This helps to address potential blind spots in the model's understanding. Moreover, it is especially valuable in scenarios where obtaining real-world examples might be difficult or expensive.

However, it's important to acknowledge several challenges that arise with this approach. The quality of generated texts heavily depends on the capabilities of the underlying language model. Weaker LLMs might produce unrealistic or low-quality examples that don't contribute meaningfully to the training process. This becomes particularly problematic in specialized domains like biomedicine or legal texts, where the LLM might struggle to generate domain-specific content that matches the complexity and accuracy of real-world examples. Additionally, the generated texts might not perfectly reflect the natural distribution of real-world data, potentially introducing biases in the training process. Lastly, the generation step adds computational overhead to the active learning pipeline, which needs to be considered when planning resources.�Despite these challenges, when implemented with appropriate safeguards and quality controls, generative active learning represents a promising direction for expanding the capabilities of traditional active learning approaches.

40 of 148

Case study: How much costs can active learning save?

Key Findings:

Active learning can save up to 90% of annotation costs compared to random sampling

State-of-the-art AL strategies like BT and CAL consistently outperform random sampling

40

Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers (Schröder et al., ACL 2022)

Let's examine a concrete case study that demonstrates the cost-saving potential of active learning in real-world applications. According to research by Schröder and colleagues presented at ACL 2022, active learning can dramatically reduce annotation costs – more than 80% compared to random sampling. On the left top figure, on the AG News dataset, random sampling with BERT achieves slightly lower than 88% of accuracy with 500 labeled examples, while BERT with the breaking ties strategy achieves accuracy higher than 88% with just 100 labeled examples. This finding is particularly significant for complex tasks given the high costs associated with professional annotation services. This study shows that active learning strategies substantially lead to higher results compared to random sampling across different datasets.

41 of 148

Case study: How much costs can active learning save?

Methodology:

Evaluated across 5 text classification tasks
Active Learning Loop:

Query 25 instances per iteration
Oracle-provided labels
Retrain model with accumulated labeled data on each iteration

Used BERT and DistilRoBERTa models

41

Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers (Schröder et al., ACL 2022)

The methodology of this study was quite robust. The researchers evaluated their findings across five different text classification tasks, ensuring the results were generalizable across different domains and complexity levels. They used the simulated active learning, when the genuine labels are assigned to the texts selected for annotation. The model was retrained with the accumulated labeled data after each active learning iteration, allowing to track performance improvements over time. The authors used both BERT and DistilRoBERTa models in their experiments, demonstrating that the benefits of active learning are consistent across different model architectures. This methodological approach provides strong evidence for the practical value of active learning in real-world applications, especially when working with limited annotation budgets.

42 of 148

Conclusion: Active learning

Effectiveness

Drastically reduced annotation budgets

Generalization

Data acquired through AL does not depend much on LM used
AL can be applied to various NLP tasks including text- and token- level classification and natural language generation

Best Practices

Choose appropriate AL strategy and avoid biased data sampling
Implement iterative cycles: train -> sample -> annotate -> train -> sample -> annotate …
Think about cold start problem: which samples to use at the beginning

42

43 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

43

44 of 148

Quality control in human labeling

44

45 of 148

What is quality control in human annotation?

What is quality?
How to measure quality?
How to control quality?

45

46 of 148

What is quality control in human annotation?

What is quality? �When a task does not have one correct answer, it becomes difficult to define quality.
How to measure quality? �Quality is determined based on the specific task and its requirements.
How to control quality? �We will discuss this topic throughout today's presentation.

46

47 of 148

Management triangle

47

QUALITY

COST

TIME

GOOD �+ �QUICK�= EXPENSIVE

GOOD �+ �CHEAP�= �SLOW

QUICK�+ �CHEAP�= �POOR QUALITY

48 of 148

Methods of Quality control

Before task performance

Selection of annotators
Well-designed instruction
Onboarding and exams
Choosing team

Within task performance

Technical improvements of a platform
Control tasks
Motivation �(e.g. performance-based pricing)
Tricks to remove bots and cheaters �(e.g. quick answers)

After task performance

Data acceptance and working with data
Feedback loops

48

49 of 148

Before task performance

49

50 of 148

Selection of annotators

Sourcing & selection of supply of annotators�
Choose the right criteria (e.g. education, languages,country, etc.)

To select proper specialization
To control quality of execution level on your tasks
To get annotators with best quality on past projects

50

51 of 148

Onboarding and exams

An onboarding task is similar to production task that includes comments to guide annotators through the understanding of the task.
An exam is a test that annotators �need to pass before starting production tasks.
Exam tasks evaluate education level �of annotators.

51

52 of 148

Team roles

To achieve the best quality in labeling, it is better to create a cross-functional team.�

Team-leads/Subject experts:

Assist in formulating clear instructions for annotators, providing feedback on the guidelines, clarifying challenging aspects, conducting initial trial tasks, and creating illustrative examples.
They understand the specifics of the task better than others.
Oversee the quality manual and provide feedback to annotators.
Participates in hiring of new annotators.

52

53 of 148

Team roles

Solution engineer is a technical lead who is in charge of setting a pipeline, implementing automatization tools and using quality control instruments.

Supply manager is in charge of finding relevant annotators and communicating with them. Also supply manager is responsible for time and cost efficient labeling.

Annotators/Subject matter expert is in charge of doing the task, providing feedback and following instructions.

53

54 of 148

Pipeline setup

Every complacent task can be splitted into small parts with additional checks, post verifications.
One of the instrument of controlling quality is designing and simplifying the pipeline.
Design of pipeline can help optimizing speed and cost.
Implementing LLM checks on different steps of pipeline.

54

55 of 148

Within task performance

55

56 of 148

Technological methods

56

57 of 148

ML methods for human labeling

Smart matching technics between tasks (domain, complexity level, time, format..etc) and annotators (skillset, performance, preferences..etc).
Smart matching can be set of simple heuristics, works better on a large scale. Can increase quality and speed from 10% to 30% depending on a task.
ML-based inspirational seeds generation.
ML-based auto checks and assistants.
Appeals and feedback loops for our experts.
Automatic deduplication and quality metrics.

57

58 of 148

ML methods for human labeling: Co-pilot features increase productivity of experts

Real-time RAG-based fact check
Prompt coverage check
AI-generated content check
Grammar check

58

59 of 148

Anti-fraud rules

Fraud prevention built into data pipeline from start to finish to guarantee authentic human effort and expertise:

Control of speed of responses
Checking a trajectory of the cursor
Checks via CAPTCHA
Identification of personality
Check whether a link has been visited
Check whether a video has been played

59

60 of 148

2. Operational methods

60

61 of 148

Control tasks/honey pods

Tasks with known correct answer shown to performers to evaluate�their performance.

Distribution of answers in control tasks = distribution in whole set of tasks
But should contain rare answer variants with higher frequency
Refresh your set of control tasks regularly to avoid bots and cheating
Automatic control task generation via annotators
Tasks with answers of high confidence (e.g. aggregation of answers from a large number of annotators)

61

62 of 148

Motivation in human annotator work

Bonuses for a good quality within a period
Gamification (e.g. achievements, leader boards, etc)
Price depending on performance

62

63 of 148

After task performance

63

64 of 148

Data acceptance and working with data

Data quality metrics correlate with model performance gains for confidence in training data
Audit by annotation or domain experts
Aggregation tools for general crowd tasks
ML-based assessment of dataset quality, including LLM-checks

64

65 of 148

Feedback loops

Feedback is essential part of the process of every communication
Providing and getting feedback improves the quality drastically
Feedback can be provided from subject matter experts, from manager of the task, from customer.

65

66 of 148

Summary

To achieve the best quality, you first need to define what "best quality" means.
For complex labeling tasks, you need a team of experts who can combine different skill sets and organize the process effectively.
Implementing quality control at various stages of the data labeling process will improve performance, cost efficiency, and speed.
As data labeling scales, more advanced mechanisms for quality control become necessary.

66

67 of 148

Managing human workers

67

68 of 148

Overview

Best practices in annotation guidelines
Psychological characteristics of doing annotation tasks
Inter annotator agreement
Communication with the annotators

68

69 of 148

Best practices in annotation guidelines

Clear instructions without ambiguities
Generalization and explaining goal of the task
Balancing detailing and conciseness
Including tests and interactive part of the instructions
Many “corner” examples that demonstrates different cases
Using different formats (videos, webinars, meetings)

69

70 of 148

Psychological characteristics of doing annotation tasks

70

71 of 148

Framing effect

71

The size of the red dot “varies” depending on the context. Even when we're aware of the illusion, we can't help but perceive it. It is an illustration of power of framing.

72 of 148

Framing effect

72

The size of the red dot “varies” depending on the context. Even when we're aware of the illusion, we can't help but perceive it. It is an illustration of power of framing.

73 of 148

Attention

Attention as a limited-capacity resource distributed among cognitive processes.
This means that people tend to get tired more quickly when they pay greater attention.
The less attention you get, less quality you get.
New tasks, changes in the instruction requires more attention.

73

74 of 148

Thinking

Information perception is automated, which means that people tend to think according to their stereotypes.
People are resistant to perceiving new information.
Relearning is much harder than learning on the first attempt.

74

75 of 148

Memory

“Working memory” has limited capacity and contains 7+-2 cells
Combine different sections of the instructions into a single, cohesive contextual block.
Try to divide explanation of the task and minor details.

75

76 of 148

Inter annotator agreement

People tend to perceive things subjectively, even professionals.

How to improve Inter annotator agreement?

Clear annotation guidelines
Few-shot examples
Training and calibration
Measure agreement metrics
Iterative feedback and revisions

76

77 of 148

Communication with the annotators

Effective communication is a crucial part of every team's success.

Establish a dedicated channel for consistent communication with annotators.
Make the process interactive to engage them actively.
Implement feedback loops to ensure continuous improvement.
Organize communication streams by separating announcements from responses to general questions.
Consolidate all updates into the instructions for easy reference.

77

78 of 148

Summary

Working with people comes with many limitations that must be considered. Unlike machines, people are complex and not always easy to approach.
Effective communication is crucial for success in labeling tasks.
Structuring and organizing information can significantly improve understanding of the task.
Investing in thorough onboarding processes leads to better performance and overall outcomes.

78

79 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

79

80 of 148

QA session 1

80

81 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

81

82 of 148

Talk to us

Tutorial web page

Toloka Research Fellowship program

82

83 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

83

84 of 148

Hybrid pipelines

84

85 of 148

Introduction: Problem Types

We consider two primary labeling direction:

Classification: Assigning discrete labels / preferences / judgements.
Generation: Creating new text / prompts / replies.

Tutorial Focus:

Primarily on classification / preference tasks, with discrete label / multi-label;
LLMs used for pre-assigning or validate labels;
Generation also interconnected and might use similar approaches.

85

86 of 148

Introduction: Roadmap

Automatic Labeling (Prompts, Fine-Tuning)
Confidence Estimation Techniques
Aggregation Strategies (Thresholding, Overlap, etc.)
Balancing Quality vs. Automation
Hybrid Pipelines & Continuous Improvement
Advanced Considerations (Calibration, Infrastructure, Cost Modeling)
Real-World Case Studies & Lessons Learned

86

Research surveys:�Putting Humans in the Natural Language Processing Loop: A Survey (Wang et al., HCINLP 2021)�A survey of human-in-the-loop for machine learning (Wu et al., 2022)

87 of 148

Example

87

Clarity: Both responses clearly explain the process of resetting a password. However, Response B breaks down the steps more comprehensively.
Detail: Response A gives a high-level overview, while Response B provides actionable details like entering the email address, checking the inbox, and following the link.
Helpfulness: Response B is more helpful because it anticipates potential confusion and offers a step-by-step guide.

Preferred Response

Response B

Evaluate the provided responses to determine which is more helpful to the user based on the given query. Explain your reasoning using a chain-of-thought approach, compare the level of detail, clarity, and helpfulness. response.

Query: “How do I reset my password?”

Response A: “To reset your password, click on ‘Forgot Password’ on the login page and follow the instructions.”
Response B: “Go to the login page, click on ‘Forgot Password,’ enter your email address, check your inbox for the reset link, and follow the steps in the email.”

88 of 148

Introduction: Why and What?

Problem:

Human annotators offer the highest quality but costy on large scale and harder to scale.
LLMs provide cheaper labeling but can introduce errors or biases.

Goal:

Combine the strengths of LLMs (speed, coverage) with human expertise (accuracy, domain knowledge).
Achieve higher throughput, lower costs, and maintain flexible quality control.

General Hybrid Setup: Apply Auto Labelling, later refine with Human Force; But can vary.

88

LLMs & humans: The perfect duo for data labeling (Tilga, 2024)

*

89 of 148

Introduction: General Hybrid Setup

Model First Pass (Auto-Labeling):
LLM predicts labels.
Confidence are estimated
Human-in-the-Loop:
Items with low confidence or high risk are routed to human annotators.
Potential re-check or overlap for final verification.
Dynamic Component (Optional, but worth it):
Model Re-training with new samples
Adjust cutoff points as needed to maintain desired quality vs. cost.
etc

89

LLM pass

Estimate confidence: �e.g. Is confidence > threshold?

Directly accept

Human labeling and Overlap

Update models

Accept \w overlap

90 of 148

Introduction: Tasks / Modalities

Task Types (classification):

Preference labeling: e.g. ranking or rating system outputs for user satisfaction.
Toxicity or content moderation: multi-class or multi-label classification.
Topic detection: discrete labeling of text segments into categories.
Temporal: describe, discuss or compare time-sensitive events.

Modalities:

Text-only
Visual-input / Vision-Language Models (VLLM) / Visual-output
Audio
Agents (it is a lot here)

90

91 of 148

Auto labeling: How to measure quality

We need to measure “quality” of the pipeline.

There are some ways to define it:

Downstream performance change
Label quality (mislabelled samples on honeypots)
Inter-annotator-agreement

Tl;dr: Have help-out golden set (labeled by trusted experts / with overlap)

91

92 of 148

Auto labeling: �Prompt Engineering Basics

Simplest method. Take a ready-made LLM, add a few examples and hope for the best.

Tricks to increase performance:

Tune the main prompt / instructions.
Few-Shot Prompts: Provide small curated examples for guidance.
Chain-of-Thought: Encourage explicit reasoning steps (or use o1-like models);
(Optional) Structured Format: Request outputs in JSON or with delimiters.�BUT may reduce quality.

92

Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance. (Tam et al., EMNLP 2024)

You task is …

Here are criteria …

Here are examples …

Think step-by-step

I will evaluate …

Let’s think step-by-step

Criteria 1, Criteria 2…

The final answer is X

93 of 148

Auto labeling: Prompt Engineering

Pros:

Minimal setup; broad knowledge ‘baked’ in the LLM.
Quick adaptation with few-shot prompts.
Scalability, easy to add a new domain or a few examples;

�Cons:

LLMs make mistakes :)
Susceptible to hallucinations, unpredictable outputs.
Biases inherent in the LLM might propagate to labels. (e.g. longer texts CITEs)

93

94 of 148

Auto labeling: Prompt Engineering Cost

Commercial vs. Open-Source LLMs:

Proprietary (e.g., GPT) might yield higher quality but at higher inference cost and with LICENSE restrictions.
Open-source (e.g., Qwen, LLaMA) can be cheaper to run locally or with a provider, but may require specialized fine-tuning, sometimes open License.

GPT: 5k samples × $0.06 per sample = $300�Qwen 2.5 72B: 0.5-1.0h for 5k sample run time × $10/h for H100 × 2 GPUs = $20-$40�

Trade-offs:

Higher auto labeling quality, less human checks → higher inference cost.
Using more flexible open-source LLMs → infrastructure maintenance and fine-tuning.

94

95 of 148

Auto labeling: Finetuning Basics

Tips and Tricks:

LoRA: Lightweight finetuning. Good for domain-specific improvements.
Always keep a held-out golden set for measuring accuracy, F1, etc.
Monitor overfitting and generalization carefully, check for data leaks.

Take ready-made (or base) LLM, a bunch of golden examples (labeled by trusted experts) and finetune the desired classifier.

95

LoRA: Low-Rank Adaptation of Large Language Models (E. Hu et al., 2021)

96 of 148

Auto labeling: Finetuning

Method 1: �Tune LLM with LM SFT

Pros: Straightforward approach; uses existing base LLM.
Cons: Prompt sensitivity, potential difficulty for multi-label tasks.

Method 2: �Adding a New Classification Head

Pros: Potentially more accurate for classification; can handle multi-label.
Cons: Requires more data for from-scratch head training.

96

LLM layers

LM Head

+LoRA

Input

Output

+LoRA

LLM layers

Classification head

Input

Output

+LoRA

97 of 148

Auto labeling: Increasing Efficiency Post-tuning

Faster Inference: �Pruning, quantization, and distillation etc
Quality increase: �Ensemble, Majority Voting for Models/Temperatures.
Optimized infrastructure: �GPU utilization, caching KV/predictions.
Continuous improvement: �Retrain when data drifts or new classes appear.

Qwen 2.5 72B finetuning: 2h for training × $10/h for H100 × 2 GPUs = $40�BUT Data for finetuing: 1k sample × 10m for sample × $60/h expert time = $10k

97

A Survey on Efficient Inference for Large Language Models (Zhou et al., 2024)�Efficient Large Language Models: A Survey (Z Wan et al., 2023)�Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding (Ryu and Kim, 2024)

98 of 148

Auto labeling: Problem

Insufficient Quality: Often not ready for production without human checks).
LLM Limitations: Performance capped by model biases/quality (e.g., Qwen model used).
Upper Bound: Even best-LLM-case may not match multi-human consensus.
Prompt Sensitivity: Minor wording changes can skew results.
Real-world Experience: That's it ;)

Most of the problems can be curated with careful Human labeling on top.

98

99 of 148

Confidence Estimation: Why It’s Crucial

Key Concept: Confidence directs the labeling workflow

High-confidence model outputs can be accepted automatically.
Low-confidence outputs go to human annotators.

�Benefits:

Save tedious work for LLM, avoid unnecessary human effort.
Improves cost-effectiveness;
Reduces LLM error risks by focusing human attention on critical cases.
Key to making “hybrid” work seamlessly.

99

How Certain is Your Transformer? (Shelmanov et al., 2021)�Uncertainty estimation of transformer predictions for misclassification detection (Vazhentsev et al. 2022)

100 of 148

Confidence Estimation: In Text Generation

Token-Level Probabilities:

Summation or average of log probabilities across tokens.
Challenging to calibrate; context length can skew probabilities.

�Multiple Prompts / Variance Check:

Generate multiple candidate outputs to measure disagreement.

Calibration Challenges:

Log-probs often are not well-calibrated; Post-hoc scaling might be required.
Systematic biases in LLMs (e.g. options order, long context).
Log probabilities require interpretation; e.g. multiple choice not necessary equals to the class.

100

S = <assistant/> The best response is A

context

prediction

P(S) = P(<assistant/> | prompt) ᐧ … ᐧ P(A | prompt <assistant/> The best response is)

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models (X. Wang et al., 2024)�Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions (Pezeshkpour and Hruschka, 2023)

101 of 148

Confidence Estimation: In Classification

Softmax Probabilities:

Typically used out-of-the-box, but can be miscalibrated (over/underconfidence).

�Calibration Methods:

Temperature scaling, Platt scaling, isotonic regression.
Reliability diagrams to visualize predicted vs. true probabilities.
Multi-label tasks may need specialized calibration strategies.

101

The Role of Uncertainty Quantification for Trustworthy AI (J. Deuschel et al., 2024)�Classifier Calibration: A survey on how to assess and improve predicted class probabilities (T Silva Filho et al., 2021)

Calibration in Deep Learning: A Survey of the State-of-the-Art (C. Wang, 2023)

102 of 148

Confidence Estimation: Advanced Calibration

Convert scores to probability: Better calibration → More accurate routing.

Must have: select the one time
Nice to have: Why to you need calibration - to move the between batches
Calibration techniques: Temperature scaling, ensemble methods, reliability diagrams.
Avoid over/underconfidence to reduce mis-assignments.

102

103 of 148

Aggregating: Combining Model & Human Labels – Basics

Goal: Identify when to trust the model vs. when to request human assistance. So we need to find re-labeling ratio.

Strategies:

Simple Thresholding (if confidence is high, use model else use human).
Overlap: Use model as an additional expert (majority vote, weighted blending)
Meta-Classifier (decides if an item should go to human labeling)

Note: Humans make mistakes; covered in Human Labeling part of the tutorial.

103

104 of 148

Aggregating: �Combining Model & Human Labels – Basics

104

LLM pass

Is confidence > threshold?

Directly accept

Human labeling

Sample

Meta Estimator

Likely Ok: LLM labeling

Risky: Human labeling

Human labeling

LLM pass

Weighted Blending

Accept

Simple thresholding

Overlap

Meta-classifier

105 of 148

Aggregating: Threshold-Based

Select a simple threshold to find easy-for-LLM cases while deferring uncertain items.

Method: If LLM confidence ≥ T → Accept model label. Else → Human label.

Pros: Straightforward implementation; development cost-efficient.

Cons: Incorrect threshold choice can lead to suboptimal quality or inflated costs.

105

106 of 148

Aggregating: Overlap-Based

For some uncertain cases it is beneficial to overlap human and model labeling.

Method: Gather both LLM and human labels, combine with:

Majority vote: LLM is an extra expert
Weighted blending: We can assign weight based on confidence,... for humans too!
Heuristic rules.

Pros: More robust final labels; each source acts as a check on the other.

Cons: Increases costs (double labeling). Must select carefully items to overlap.

106

107 of 148

Aggregating: Meta-classifier

Let’s directly optimize the human re-labeling with a separate model!

Method: Train a secondary “routing” model on features (e.g., confidence score, topic, content, input complexity) to predict if LLM’s output is likely incorrect.

Pros: More sophisticated than a fixed threshold; can factor domain-specific cues.

Cons: Requires more labeled data to train routing model; complex pipeline;

107

108 of 148

Aggregating: Conclusion and Recap

It is cheap to use LLM as classifier, we need to find cases to route to human experts.

Rule of thumb (production experience):

High confidence → model-only or minimal human checks.
Medium confidence → overlap with human for extra certainty. (check relabelling ratio)
Low confidence → human-only labeling.

108

LLMs & humans: The perfect duo for data labeling (Tilga, 2024)

109 of 148

Aggregating: Schema

109

LLMs & humans: The perfect duo for data labeling (Tilga, 2024)

110 of 148

Balancing: Quality & Automation

Adjust thresholds and overlap to meet cost-quality targets.

Threshold Tuning:

Lower threshold → higher cost (more human efforts), �higher quality.
Higher threshold → lower cost, higher risk of errors.

Cost vs. Accuracy:

Plot different thresholds or strategies.
Choose sweet spot based on project’s budget and risk tolerance.

110

LLMs & humans: The perfect duo for data labeling (Tilga, 2024)

111 of 148

Balancing: Cost Modeling & ROI Analysis

We need to estimate costs, including human labor and compute expenses. �Cost modeling ensures the pipeline remains profitable and scalable.

Factors:

Annotation cost per item (human-labor rate).
LLM inference cost per call (hello, o1-like models).
Infrastructure overhead (GPU usage, hosting, maintenance).

Compute Return of Investments (ROI): “Saved human effort” vs. “LLM dev/inference fees.” (do not forget dev cost).�- Periodic reviews ensure long-term economic viability.��Manual cost: $50k for 5k labeling�Hybrid costs: $10k for 1k training set + $200-500 for experiments + $10k for overlap�LLM-only cost: $300 for 5k labeling

111

+platform

+infrastructure

+held-out set

+EXPERIMENT FAILS

112 of 148

Balancing: Cost Modeling & ROI Analysis

We need to estimate costs, including human labor and compute expenses. �Cost modeling ensures the pipeline remains profitable and scalable.

Factors:

Annotation cost per item (human-labor rate).
LLM inference cost per call (hello, o1-like models).
Infrastructure overhead (GPU usage, hosting, maintenance).

Compute Return of Investments (ROI): “Saved human effort” vs. “LLM dev/inference fees.” (do not forget dev cost).�- Periodic reviews ensure long-term economic viability.��Manual cost: $50k for 5k labeling�Hybrid costs: $10k for 1k training set + $200-500 for experiments + $10k for overlap�LLM-only cost: $300 for 5k labeling

112

113 of 148

Continuous Quality Assessment

Early detection of issues save a lot of money.

Ongoing Evaluation:

Regularly sample N items for ground-truth expert labeling.
Use metrics like accuracy, F1, reliability over time.
Nice to track downstream quality, but not always possible.
Detect declines (model drift, distribution shifts) and respond promptly.

113

114 of 148

Continuous Model Improving

Systematic iteration raises quality and reduces cost in the long run.

We produce data in batches, so we can use new data to retrain labelling models.

Retrain/fine-tune models with newly verified labels from the pipeline.
Iterative improvement: better model → fewer human checks → more savings.
Keep reproducibility

114

Model Training Components in Azure Machine Learning: An Overview

115 of 148

Insights from Real-World Projects

o1-like: Higher quality but expensive; be aware of model usage License :)
Open models are much cheaper but require fine-tuning.

Silver bullet: Find subsets where the model outperforms humans.
Ongoing expert validation / held-out set
Adjust thresholds and retrain models constantly with a new batches.

115

116 of 148

Case Study: Product Pair Similarity

Context:

Classify on 5 classes: “not relevant” -> “same product”
“Same Product” is critical yet difficult to distinguish

Key Approaches:

Offline setup combining a LLM with human labeling
Visual Encoder improve accuracy
Fine-Tuning & Calibration led to better performance.
Classifier Head is better then generative LM Loss
Upsampling Minority Classes for training

Lessons Learned:

Calibrated Confidence: Route low-confidence to humans
Converting a multi-label to binary classification simplify decision-making
Base model is better for training than Instruction tuned

116

Blue - LLaMA 70B�Red - Qwen-VL 72B (/w images)

accuracy

relabel ratio

117 of 148

Conclusion & Next Steps

Key Takeaways:

Hybrid labeling merges LLM power with human judgment, balancing quality and cost.
Calibration and confidence estimation are essential for effective automation.
Infrastructure optimization are key to lower costs.
Iterative improvement and cost modeling keep the pipeline economically viable.

Go Deeper:

More advanced routing/aggregation models.
Best-of-N generation.
Better calibration methods for generative tasks.
Extending hybrid approaches to complex, multi-step tasks (e.g., end-to-end QA, dialogue systems).

117

118 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

118

119 of 148

LM Workflows

A case study

119

120 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

120

121 of 148

Overview

Aim: create efficient LMs with acceptable performance
Sample problem:

Text: ...
Labels: [repdim:sent, repdim:sent . . . ]
Reputation dimension label set: [Economic performance, Market position…]
Sentiment/tonality label set: [-2, -1, 0, 1, 2]

Data example (artificial):

Text: “Audi closes important plant: Next shock news for the 3,000 employees - the Audi plant in Brussels will close at the end of February 2025. The closure comes after months of negotiations between Audi, works councils and trade unions. Audi cites declining sales figures for the Q8 e-tron and high logistics costs as the reasons.”
Labels: [Economic performance: -1, Market position: -1]

Experiments done by: Sachin Yadav, Pawan Waldia, Abhishek Chugh

122 of 148

Task

Data example (artificial):

Text: “Audi schließt wichtiges Werk: Nächste Schock-Nachricht für die 3.000 Mitarbeiter - Ende Februar 2025 schließt das Werk von Audi in Brüssel. Die Schließung kommt nach monatelangen Verhandlungen zwischen Audi, Betriebsräten und Gewerkschaften zustande. Audi nennt schrumpfende Verkaufszahlen des Q8 e-tron und hohe Logistikkosten als Gründe.”
Labels: [Economic performance: -1, Market position: -1]

Aggregated labels:

Reputation Dimension Labels: [Economic performance, Market position]
Average Sentiment Label (discretized): -1

123 of 148

Datasets

Reputation Dimension Labels (RPL)
Average Sentiment Labels (ASL)
Sentiment Baseline (GerNTSA): German news data

Data Cleaning and Splitting

Removes HTML tags
Strips special characters ( e.g., @, #, *, $, %)
Removes extra spaces
Removes duplicate data
Low frequency dimension (e.g., Management, Strategie)
are excluded from RPL datasets.

	GerNTSA	RPL	ASL
Train	3033	1217	1217
Dev	380	153	153
Test	379	152	152
Total	3792	1522	1522

124 of 148

125 of 148

126 of 148

Model	Parameters	Architecture	Type
LLama-2-7b	7 Billion	Encoder-Decoder	Multilingual
Flan-T5 Small	77 Million	Encoder-Decoder	Multilingual
Flan-T5 Large	783 Million	Encoder-Decoder	Multilingual
BERT-base-uncased	110 Million	Encoder	Monolingual (English)
BERT-large-uncased	340 Million	Encoder	Monolingual (German)
BERT-base-german-uncased	110 Million	Encoder	Monolingual (German)
XLM-RoBERTa Base	125 Million	Encoder	Multilingual
XLM-RoBERTa Large	255 Million	Encoder	Multilingual

Model Information

127 of 148

Overview of Project Workflow

128 of 148

Efficiency optimization

Quantization:

Quantization: Reduces model size by representing weights with fewer bits, improving efficiency.
Usage: Applied 4-bit quantization to optimize performance and reduce resource usage.

Pruning:

Pruning: Removes less important model parameters to reduce size and improve efficiency.
Usage: Applied L1 unstructured pruning to remove 10% of weights globally.

Focus Models: Llama-2-7b

Quantization: Reduces model size by representing weights with fewer bits. Initial weigths had 32 bits, we reduce to 4 bits.

Pruning: Removes less important model parameters to reduce size. We removed 10% of weights globally.

Applied quantization and pruning to reduce the model size and computational complexity.
Experimented with various hyperparameter configurations on development data.
Selected the best configuration based on performance.
Evaluated the final model on test data.

L1 unstructured pruning reduces the size of a neural network by removing the smallest weights—values that determine the strength of connections between neurons. Using the L1 norm (sum of absolute values), it identifies and prunes 10% of the least important weights globally. This unstructured approach improves efficiency without a fixed pattern or significant accuracy loss.

Before quantization, Llama 7b and Flan-t5-small/large use 32-bit floating point precision (FP32) for weights.

129 of 148

Performance optimization

Hyperparameter tuning
Prompt engineering
Systematic prompt optimization (DSPy)
Finetuning
Learning Classifier

130 of 148

Focus Models: Llama 7b, Flan-t5-small/large, BERT-base-uncased, BERT-large-uncased, BERT-base-german-uncased, XLM-RoBERTa Base, XLM-RoBERTa Large

Prompts: Predefined inputs or scenarios provided to the model to guide its output
Temperature: Controls randomness in responses (low = focused, high = creative).
Top_p: Limits choices to top likely words (e.g., top_p = 0.9 includes top 90%).

we tried these hyperparameters:

For DSPy: The hyperparameter i.e., number of labeled examples was fixed to k=12.

For BERT Finetuning:

“lr”=1e-5, 2e-5, 3e-5, 4e-5

“epoch”=3, 4,

“batch_size”=4, 8, 16, 32, 64

For each prompt, we used the following hyperparameter configurations:

"temperature": 0.1, 0.4, 0.6, 0.8, 1.0

"top_p": 0.1, 0.6, 0.8, 1.0

Hyperparameters:

For MLP:

“lr”=1e-5, 2e-5, 4e-5, 5e-6,6e-5

“number_of_itteration”=6000,

10000,

15000

“batch_size”=32,64,128x

“hidden layer”=[ 512],

[ 512, 1024],

[ 512, 1024, 512]

Prompts:

English: prompt1 and prompt3

German: prompt2 and prompt4,

There is a number of hyper-parameters. LLMs have top p and temperature, we also treat the prompt as hp. For finetuning and learning the classifier we have learning rate, batch size, etc. We vary them defining a reasonable hyperparameter grid.

We tried multiple values of k, starting from k=3, 6, 9, 12. There was no change observed in the results on dev data, so we kept the value as 12.

DSPy will pick 12 examples from the dataset and include them in the prompt it sends to the model. These examples help the model understand the task and produce better outputs for new inputs.

Selection of best hyperparameter:

- We tried with all the above mentioned hyperparameter configuration on dev data and then choose the best hyperparameter configuration for e.g. prompt1 along with temperature 0.1 and top_p 0.1 gave us the best result on dev data, so we used this configuration on test data and evaluated the result.

For MLP parameter epoch is calculated mathematically using this formula : num_epochs = int(number_of_itteration/ (len(train_dataset) / batch_size))

131 of 148

Prompt Engineering

Research and Adaptation
Initial prompts were researched online and adapted to align with the task requirements on dev data.

Experimentation with Prompt Types
Started with short and simple prompts to see how the model responds.
Tried various prompt styles, including sarcasm-based prompts, to explore different ways of generating responses.
Experimented with few-shot prompting by providing labelled examples from train data.

Language Testing
Checked how well the model works with prompts in both English and German.

Finalization of Prompts
After testing and improving, we chose the prompts that worked best for the task.

132 of 148

Prompt examples: sentiment and multilabel classification

133 of 148

Prompt optimization with Declarative Self-improving Python (DSPy):

What is DSPy?

A Python library that makes programming easier by letting us define goals, and it handles the rest.
It uses machine learning to improve performance over time.

Key Features:

Self-improving: Learns and gets better with use.
Goal-focused: You tell it what to achieve, and it takes care of the details.
Adaptive: Great for building tools that can learn and improve, like decision-making systems.

Labelled Few Shot:

We used it in DSPy to optimize prompts by providing a small number of labeled examples (e.g., k=12).
These examples act as templates to help the model understand the task better.

How it works:

We choose k (e.g., 12 examples).
DSPy selects diverse, balanced examples from the dataset.
These examples are included in the prompt for better results.

134 of 148

135 of 148

136 of 148

137 of 148

Conclusion

BERT Models: Smaller BERT models outperformed larger LLaMA and Flan-T5 models despite their size.
MLP Model: Proves ideal for classification tasks due to its low computational requirements.
Our results confirm previous observations (Bosley et al., n.d.)
Quantization & Pruning: Provide similar results, showing that reducing model size does not impact performance.
DSPy: worked well with Flan-T5 (large) but not with Llama.
Prompt Performance: Both LLaMA and Flan-T5 perform better with short, precise prompts.
Flan-T5: Excels with English prompts but occasionally performs well with German prompts in larger models, suggesting limited German training data.
LLaMA-7b: Performs best with German prompts, but English prompts also work effectively for German datasets.
Optimal Parameters: Consistently better results achieved with temperature=0.1 and top-p=0.1 while DSPy shows consistent results when increasing the value of k.

138 of 148

Limitations

Case study: particular setup
Many LLMs were not tested
Fine-tuning LLMs was not tested
Chain-of-thought prompts
…

139 of 148

Limitations

139

140 of 148

Subjectivity and bias

LLMs struggle with tasks lacking a single "correct" answer

Tasks, that require understanding emotions, sentiment, toxicity, and cultural nuances

LLMs do not exhibit natural variations unlike human annotations
Produce overly consistent labels across similar items, missing nuanced differences that human labelers would catch
Risk of bias in LLM-generated labels, which could lead to biased classifiers if trained on these labels

140

The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation (Pavlovic & Poesio, NLPerspectives 2024)

141 of 148

Errors in LLM labelling

LLMs tend to favor some class labels due to their frequency in pre-training data.
Context-level bias: LLMs tend to prefer the majority and last label of the provided few-shot examples.
Domain-level bias: LLMs tend to associate lexical cues with certain class labels.

Example: Slang -> toxic, medical terms -> sick.

Input preference bias: LLMs tend to favor certain inputs based on surface features such as length.
Output structure issues: LLMs may have difficulties in providing structured outputs (e.g., formatted as JSON) or experience performance decrease.

141

Mitigating Label Biases for In-context Learning (Fei et al., ACL 2023)

142 of 148

Model collapse

The use of synthetic data or LLM-labeled data for training could lead to gradual model degradation and potential model collapse over time.
Early stages: Loss of information about distribution tails, affecting minority data
Late stages: Significant performance decline, concept confusion, and variance loss

142

AI models collapse when trained on recursively generated data.(Shumailov et al., 2024)

143 of 148

Other concerns

Difficulty in attributing responsibility for LLM labels raises concerns about misuse and harm.
LLMs may lack the knowledge required for domain-specific tasks. Only human domain experts can bring relevant expertise.
Using API-level LLMs for labeling sensitive data may raise privacy and security issues.

143

144 of 148

Hands-on session: �Hybrid data annotation

144

Link

145 of 148

Setup and Problem to solve

Dataset:

“Debagreement: Reddit 50K”
agree, disagree, neutral, unsure
3 annotators and >⅔ agreement label

Convert to single annotator dataset:

Corrupt label with (1-agreement) probability
We have single annotator labels and 3 annotators golden set.

Plan and Goal:

Finetune LLM to predict label
Evaluate quality
Find best ratio / samples to re-label with humans

145

Link

146 of 148

Tutorial overview

14:00 Session 1 – Ekaterina (Katya) & Natalia

Introduction
Synthetic data generation
Active learning with LMs
Quality control and managing human workers
QA Session 1 �

15:30 Coffee break

QA Session 1.5

16:00 Session 2 – Dominik & Konstantin

Hybrid pipelines
LM Workflows
Limitations
Hands-on session: Hybrid data annotation
QA Session 2

146

147 of 148

QA session 2

147

148 of 148

Talk to us

Tutorial web page

Toloka Research Fellowship program

148