Labelling with LLM and Human-in-the-Loop
Ekaterina Artemova, Akim Tsvigun, Dominik Schlechtweg
Natalia Fedorova, Sergei Tilga, Konstantin Chernyshev, Boris Obmoroshev
1
TALK TO US
TUTORIAL�PAGE
About us
2
Natalia Fedorova
Toloka Partnership Manager, Toloka
Boris Obmoroshev
Toloka AI R&D and Analytics Director, Toloka
Sergei Tilga
Head of R&D, Toloka
Ekaterina Artemova
Machine Learning Researcher, Toloka
Konstantin Chernyshev
Machine Learning Researcher, Toloka
Akim Tsvigun
Natural Language Processing Lead @ Nebius AI, University of Amsterdam
Dominik Schlechtweg
Research Group Lead, Universität Stuttgart
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
3
Introduction
4
Data is new oil
Clive Humby
5
Data needs in NLP
Raw data
This is not the focus of the this tutorial!
Labelled data
6
Many applications require complex annotation
7
Can GPT-4 Identify Propaganda? Annotation and Detection of Propaganda Spans in News Articles (Hasanain et al., LREC-COLING 2024)
Data annotation pipeline
8
01
02
03
04
Frame the target task
Problem formulation: �Does this text pose a risk of a harm?
Conceptualize the problem
Annotation schema defines labels, how they should be applied and how complex cases should �be treated.
Quality control
Choose an aggregation rule, control inter-annotator agreement, train a baseline model.
Instruct annotators
Annotation guidelines are provided to the annotators to label raw data and to guide annotation decisions
Labelling with human: Example
Please read each text carefully and classify it based on the dominant emotion it expresses: love, anger, fear, sadness, surprise, or jou. If the tweet does not clearly convey one of these emotions, or if it is purely factual or neutral, mark it as neutral.
9
Saravia, Elvis, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. "CARER: Contextualized affect representations for emotion recognition." EMNLP. 2018.
i was ready to meet mom in the airport and feel her ever supportive arms around me
love
anger
fear
sadness
surprise
joy
neutral
Labelling with human: Budget estimation
10
Saravia, Elvis, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. "CARER: Contextualized affect representations for emotion recognition." EMNLP. 2018.
i was ready to meet mom in the airport and feel her ever supportive arms around me
love
anger
fear
sadness
surprise
joy
neutral
Labelling with an LLM
11
Saravia, Elvis, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. "CARER: Contextualized affect representations for emotion recognition." EMNLP. 2018.
Given the following text, classify the emotion expressed in the message into one of the following categories: anger, fear, joy, love, sadness, or surprise. If the text does not express a clear emotion, classify it as neutral.
Text: i was ready to meet mom in the airport and feel her ever supportive arms around me
Emotion: <Select the appropriate emotion category>
love
LLMs do the work, humans fix the oops
Labeling with LLMs
�Solutions
12
The slide title is generated with ChatGPT
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
13
Synthetic data generation
14
Synthetic data generation
15
A Powerful and Thought-Provoking Masterpiece
�Oppenheimer is a stunning portrayal of J. Robert Oppenheimer's life and the creation of the atomic bomb. Cillian Murphy delivers a brilliant performance, capturing the complexity of the man behind history's most pivotal invention. Christopher Nolan's direction, paired with breathtaking visuals and a haunting score, makes every moment gripping.
The film masterfully balances science, ethics, and personal struggles, leaving you reflecting long after the credits roll. With a stellar cast and unforgettable storytelling, Oppenheimer is a must-watch for anyone who loves powerful, impactful cinema.
Suppose you are writing movies review for the IMDB platform. Write a positive review on the film Oppenheimer.
Synthetic data generation
The main idea is to generate synthetic data in a way that it mimics characteristics and feature of real word data.
16
Benefits
Limitations
Requirements for synthetic data
17
LLMs for synthetic data generation
Zero-shot Generation
18
Generate a sentence that conveys a {label} sentiment.
LLMs for synthetic data generation
Few-shot Generation
19
You are given a data entry consisting of a sentence and its sentiment label. Sentence: {sentence}, Label: {label}. Generate a similar data entry and output both a sentence and a label.
LLMs for synthetic data generation
Hierarchical Generation
20
Given the topic {topic}, generate a sentence that conveys a {label} sentiment.
Generate a topic that can be used to create sentences with distinct {label} sentiment. The topic should be broad enough.
Validation techniques
21
Case study 1: How does the task subjectivity affect the performance?
Key Findings:
• Subjectivity negatively affects model performance on synthetic data
• Performance gap increases for highly subjective tasks
• Models perform better on less subjective instances within tasks
22
Case study 1: How does the task subjectivity affect the performance?
Methodology:
• Used GPT-3.5-Turbo for synthetic data generation
• Explored zero-shot and few-shot generation settings
• Evaluated across 10 text classification tasks
• Conducted crowdsourced studies to determine task-level subjectivity
23
Case study 2: How does increasing diversity of synthetic data affect performance?
Key findings:
24
Case study 2: How does increasing diversity of synthetic data affect performance?
Diversification approaches:
Human interventions:
25
Case study 3: To which extent can synthetic data replace human annotation?
Key findings:
26
Ashok, Dhananjay, and Jonathan May. "A Little Human Data Goes A Long Way." arXiv preprint arXiv:2410.13098 (2024).
Case study 3: To which extent can synthetic data replace human annotation?
Methodology
27
Ashok, Dhananjay, and Jonathan May. "A Little Human Data Goes A Long Way." arXiv preprint arXiv:2410.13098 (2024).
Understanding bias in synthetic data
Synthetic data can perpetuate, amplify, and introduce biases due to several factors
28
Chim, Jenny, Julia Ive, and Maria Liakata. "Evaluating Synthetic Data Generation from User Generated Text." Computational Linguistics (2024): 1-44.
Understanding bias in synthetic data
Synthetic data can perpetuate, amplify, and introduce biases due to several factors:
29
Chim, Jenny, Julia Ive, and Maria Liakata. "Evaluating Synthetic Data Generation from User Generated Text." Computational Linguistics (2024): 1-44.
Conclusion: Synthetic data for text classification
Effectiveness
Diversity Impact
Best Practices
30
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
31
Active learning with LMs
32
Active learning with LMs
33
Introduction to active learning
34
Settles, B.: “Active learning: Synthesis lectures on artificial intelligence and machine learning”. Long Island, NY: Morgan & Clay Pool, 2012.
How to select texts for annotation?
35
* Using BGE-ICL model - one of the SOTA models in text embeddings
AL strategies in text classification
36
ALToolbox: A Set of Tools for Active Learning Annotation of Natural Language Texts (Tsvigun et al., EMNLP 2022)
AL strategies in text classification
37
↑: higher values stands for higher priority, ↓: lower value stand for higher priority
Generative active learning
38
Generative active learning
39
Advantages
Challenges
Case study: How much costs can active learning save?
Key Findings:
40
Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers (Schröder et al., ACL 2022)
Case study: How much costs can active learning save?
Methodology:
41
Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers (Schröder et al., ACL 2022)
Conclusion: Active learning
Effectiveness
Generalization
Best Practices
42
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
43
Quality control in human labeling
44
What is quality control in human annotation?
45
What is quality control in human annotation?
46
Management triangle
47
QUALITY
COST
TIME
GOOD �+ �QUICK�= EXPENSIVE
GOOD �+ �CHEAP�= �SLOW
QUICK�+ �CHEAP�= �POOR QUALITY
Methods of Quality control
Before task performance
Within task performance
After task performance
48
Before task performance
49
Selection of annotators
50
Onboarding and exams
51
Team roles
To achieve the best quality in labeling, it is better to create a cross-functional team.�
Team-leads/Subject experts:
52
Team roles
Solution engineer is a technical lead who is in charge of setting a pipeline, implementing automatization tools and using quality control instruments.
Supply manager is in charge of finding relevant annotators and communicating with them. Also supply manager is responsible for time and cost efficient labeling.
Annotators/Subject matter expert is in charge of doing the task, providing feedback and following instructions.
53
Pipeline setup
54
Within task performance
55
56
ML methods for human labeling
57
ML methods for human labeling: Co-pilot features increase productivity of experts
58
Anti-fraud rules
Fraud prevention built into data pipeline from start to finish to guarantee authentic human effort and expertise:
59
2. Operational methods
60
Control tasks/honey pods
Tasks with known correct answer shown to performers to evaluate�their performance.
61
Motivation in human annotator work
62
After task performance
63
Data acceptance and working with data
64
Feedback loops
65
Summary
66
Managing human workers
67
Overview
68
Best practices in annotation guidelines
69
Psychological characteristics of doing annotation tasks
70
Framing effect
71
The size of the red dot “varies” depending on the context. Even when we're aware of the illusion, we can't help but perceive it. It is an illustration of power of framing.
Framing effect
72
The size of the red dot “varies” depending on the context. Even when we're aware of the illusion, we can't help but perceive it. It is an illustration of power of framing.
Attention
73
Thinking
74
Memory
75
Inter annotator agreement
People tend to perceive things subjectively, even professionals.
How to improve Inter annotator agreement?
76
Communication with the annotators
Effective communication is a crucial part of every team's success.
77
Summary
78
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
79
QA session 1
80
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
81
Talk to us
Tutorial web page
Toloka Research Fellowship program
82
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
83
Hybrid pipelines
84
Introduction: Problem Types
We consider two primary labeling direction:
Tutorial Focus:
85
Introduction: Roadmap
86
Research surveys:�Putting Humans in the Natural Language Processing Loop: A Survey (Wang et al., HCINLP 2021)�A survey of human-in-the-loop for machine learning (Wu et al., 2022)
Example
87
Preferred Response
Response B
Evaluate the provided responses to determine which is more helpful to the user based on the given query. Explain your reasoning using a chain-of-thought approach, compare the level of detail, clarity, and helpfulness. response.
Query: “How do I reset my password?”
Introduction: Why and What?
Problem:
Goal:
General Hybrid Setup: Apply Auto Labelling, later refine with Human Force; But can vary.
88
LLMs & humans: The perfect duo for data labeling (Tilga, 2024)
*
Introduction: General Hybrid Setup
89
LLM pass
Estimate confidence: �e.g. Is confidence > threshold?
Directly accept
Human labeling and Overlap
Update models
Accept \w overlap
Introduction: Tasks / Modalities
Task Types (classification):
Modalities:
90
Auto labeling: How to measure quality
We need to measure “quality” of the pipeline.
There are some ways to define it:
Tl;dr: Have help-out golden set (labeled by trusted experts / with overlap)
91
Auto labeling: �Prompt Engineering Basics
Simplest method. Take a ready-made LLM, add a few examples and hope for the best.
Tricks to increase performance:
92
Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance. (Tam et al., EMNLP 2024)
You task is …
Here are criteria …
Here are examples …
Think step-by-step
I will evaluate …
Let’s think step-by-step
Criteria 1, Criteria 2…
The final answer is X
Auto labeling: Prompt Engineering
Pros:
�Cons:
93
Auto labeling: Prompt Engineering Cost
Commercial vs. Open-Source LLMs:
GPT: 5k samples × $0.06 per sample = $300�Qwen 2.5 72B: 0.5-1.0h for 5k sample run time × $10/h for H100 × 2 GPUs = $20-$40�
Trade-offs:
94
Auto labeling: Finetuning Basics
Tips and Tricks:
Take ready-made (or base) LLM, a bunch of golden examples (labeled by trusted experts) and finetune the desired classifier.
95
LoRA: Low-Rank Adaptation of Large Language Models (E. Hu et al., 2021)
Auto labeling: Finetuning
Method 1: �Tune LLM with LM SFT
Method 2: �Adding a New Classification Head
96
LLM layers
LM Head
+LoRA
Input
Output
+LoRA
LLM layers
Classification head
Input
Output
+LoRA
Auto labeling: Increasing Efficiency Post-tuning
Qwen 2.5 72B finetuning: 2h for training × $10/h for H100 × 2 GPUs = $40�BUT Data for finetuing: 1k sample × 10m for sample × $60/h expert time = $10k
97
A Survey on Efficient Inference for Large Language Models (Zhou et al., 2024)�Efficient Large Language Models: A Survey (Z Wan et al., 2023)�Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding (Ryu and Kim, 2024)
Auto labeling: Problem
Most of the problems can be curated with careful Human labeling on top.
98
Confidence Estimation: Why It’s Crucial
Key Concept: Confidence directs the labeling workflow
�Benefits:
99
How Certain is Your Transformer? (Shelmanov et al., 2021)�Uncertainty estimation of transformer predictions for misclassification detection (Vazhentsev et al. 2022)
Confidence Estimation: In Text Generation
Token-Level Probabilities:
�Multiple Prompts / Variance Check:
Calibration Challenges:
100
S = <assistant/> The best response is A
context
prediction
P(S) = P(<assistant/> | prompt) ᐧ … ᐧ P(A | prompt <assistant/> The best response is)
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models (X. Wang et al., 2024)�Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions (Pezeshkpour and Hruschka, 2023)
Confidence Estimation: In Classification
Softmax Probabilities:
�Calibration Methods:
101
The Role of Uncertainty Quantification for Trustworthy AI (J. Deuschel et al., 2024)�Classifier Calibration: A survey on how to assess and improve predicted class probabilities (T Silva Filho et al., 2021)
Calibration in Deep Learning: A Survey of the State-of-the-Art (C. Wang, 2023)
Confidence Estimation: Advanced Calibration
Convert scores to probability: Better calibration → More accurate routing.
102
Aggregating: Combining Model & Human Labels – Basics
Goal: Identify when to trust the model vs. when to request human assistance. So we need to find re-labeling ratio.
Strategies:
Note: Humans make mistakes; covered in Human Labeling part of the tutorial.
103
Aggregating: �Combining Model & Human Labels – Basics
104
LLM pass
Is confidence > threshold?
Directly accept
Human labeling
Sample
Meta Estimator
Likely Ok: LLM labeling
Risky: Human labeling
Human labeling
LLM pass
Weighted Blending
Accept
Simple thresholding
Overlap
Meta-classifier
Aggregating: Threshold-Based
Select a simple threshold to find easy-for-LLM cases while deferring uncertain items.
Method: If LLM confidence ≥ T → Accept model label. Else → Human label.
Pros: Straightforward implementation; development cost-efficient.
Cons: Incorrect threshold choice can lead to suboptimal quality or inflated costs.
105
Aggregating: Overlap-Based
For some uncertain cases it is beneficial to overlap human and model labeling.
Method: Gather both LLM and human labels, combine with:
Pros: More robust final labels; each source acts as a check on the other.
Cons: Increases costs (double labeling). Must select carefully items to overlap.
106
Aggregating: Meta-classifier
Let’s directly optimize the human re-labeling with a separate model!
Method: Train a secondary “routing” model on features (e.g., confidence score, topic, content, input complexity) to predict if LLM’s output is likely incorrect.
Pros: More sophisticated than a fixed threshold; can factor domain-specific cues.
Cons: Requires more labeled data to train routing model; complex pipeline;
107
Aggregating: Conclusion and Recap
It is cheap to use LLM as classifier, we need to find cases to route to human experts.
Rule of thumb (production experience):
108
LLMs & humans: The perfect duo for data labeling (Tilga, 2024)
Aggregating: Schema
109
LLMs & humans: The perfect duo for data labeling (Tilga, 2024)
Balancing: Quality & Automation
Adjust thresholds and overlap to meet cost-quality targets.
Threshold Tuning:
Cost vs. Accuracy:
110
LLMs & humans: The perfect duo for data labeling (Tilga, 2024)
Balancing: Cost Modeling & ROI Analysis
We need to estimate costs, including human labor and compute expenses. �Cost modeling ensures the pipeline remains profitable and scalable.
Factors:
Compute Return of Investments (ROI): “Saved human effort” vs. “LLM dev/inference fees.” (do not forget dev cost).�- Periodic reviews ensure long-term economic viability.��Manual cost: $50k for 5k labeling�Hybrid costs: $10k for 1k training set + $200-500 for experiments + $10k for overlap�LLM-only cost: $300 for 5k labeling
111
+platform
+infrastructure
+held-out set
+EXPERIMENT FAILS
Balancing: Cost Modeling & ROI Analysis
We need to estimate costs, including human labor and compute expenses. �Cost modeling ensures the pipeline remains profitable and scalable.
Factors:
Compute Return of Investments (ROI): “Saved human effort” vs. “LLM dev/inference fees.” (do not forget dev cost).�- Periodic reviews ensure long-term economic viability.��Manual cost: $50k for 5k labeling�Hybrid costs: $10k for 1k training set + $200-500 for experiments + $10k for overlap�LLM-only cost: $300 for 5k labeling
112
Continuous Quality Assessment
Early detection of issues save a lot of money.
Ongoing Evaluation:
113
Continuous Model Improving
Systematic iteration raises quality and reduces cost in the long run.
We produce data in batches, so we can use new data to retrain labelling models.
114
Insights from Real-World Projects
115
Case Study: Product Pair Similarity
Context:
Key Approaches:
Lessons Learned:
116
Blue - LLaMA 70B�Red - Qwen-VL 72B (/w images)
accuracy
relabel ratio
Conclusion & Next Steps
Key Takeaways:
Go Deeper:
117
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
118
LM Workflows
A case study
119
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
120
Overview
Task
Datasets
Data Cleaning and Splitting
| GerNTSA | RPL | ASL |
Train | 3033 | 1217 | 1217 |
Dev | 380 | 153 | 153 |
Test | 379 | 152 | 152 |
Total | 3792 | 1522 | 1522 |
Model | Parameters | Architecture | Type |
LLama-2-7b | 7 Billion | Encoder-Decoder | Multilingual |
Flan-T5 Small | 77 Million | Encoder-Decoder | Multilingual |
Flan-T5 Large | 783 Million | Encoder-Decoder | Multilingual |
BERT-base-uncased | 110 Million | Encoder | Monolingual (English) |
BERT-large-uncased | 340 Million | Encoder | Monolingual (German) |
BERT-base-german-uncased | 110 Million | Encoder | Monolingual (German) |
XLM-RoBERTa Base | 125 Million | Encoder | Multilingual |
XLM-RoBERTa Large | 255 Million | Encoder | Multilingual |
Model Information
Overview of Project Workflow
Efficiency optimization
Quantization:
Pruning:
Focus Models: Llama-2-7b
Performance optimization
Focus Models: Llama 7b, Flan-t5-small/large, BERT-base-uncased, BERT-large-uncased, BERT-base-german-uncased, XLM-RoBERTa Base, XLM-RoBERTa Large
we tried these hyperparameters:
For DSPy: The hyperparameter i.e., number of labeled examples was fixed to k=12.
For BERT Finetuning:
“lr”=1e-5, 2e-5, 3e-5, 4e-5
“epoch”=3, 4,
“batch_size”=4, 8, 16, 32, 64
For each prompt, we used the following hyperparameter configurations:
"temperature": 0.1, 0.4, 0.6, 0.8, 1.0
"top_p": 0.1, 0.6, 0.8, 1.0
Hyperparameters:
For MLP:
“lr”=1e-5, 2e-5, 4e-5, 5e-6,6e-5
“number_of_itteration”=6000,
10000,
15000
“batch_size”=32,64,128x
“hidden layer”=[ 512],
[ 512, 1024],
[ 512, 1024, 512]
Prompts:
English: prompt1 and prompt3
German: prompt2 and prompt4,
Prompt Engineering
Prompt examples: sentiment and multilabel classification
Prompt optimization with Declarative Self-improving Python (DSPy):
What is DSPy?
Key Features:
Labelled Few Shot:
How it works:
Conclusion
Limitations
Limitations
139
Subjectivity and bias
140
The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation (Pavlovic & Poesio, NLPerspectives 2024)
Errors in LLM labelling
141
Mitigating Label Biases for In-context Learning (Fei et al., ACL 2023)
Model collapse
142
AI models collapse when trained on recursively generated data.(Shumailov et al., 2024)
Other concerns
143
Hands-on session: �Hybrid data annotation
144
Setup and Problem to solve
Dataset:
Convert to single annotator dataset:
Plan and Goal:
145
Tutorial overview
14:00 Session 1 – Ekaterina (Katya) & Natalia
15:30 Coffee break
16:00 Session 2 – Dominik & Konstantin
146
QA session 2
147
Talk to us
Tutorial web page
Toloka Research Fellowship program
148