Embracing Error to Enable Rapid Crowdsourcing
Ranjay Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, Michael Bernstein
Stanford University
1
- Snow et al. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks
- Parameswaran and Marcus. Crowdsourced data management: industry and academic perspectives
Hundreds of thousands of microtasks launched per day
2
2
Business
Research
-- Josephy et al. CrowdScale 2013: Crowdsourcing at Scale workshop report.
3
3
Techniques for speeding up and lowering the cost of labeling are not scaling as quickly as the volume of data we are now producing that must be labeled
4
High speed and high precision
Is this a man riding a motorcycle?
5
One of the most common microtasks is binary annotation tasks
Ipeirotis et al. Analyzing the Amazon Mechanical Turk Marketplace
Yes
No
Workers take 1.7 seconds per image
My current research project requires annotating 100 million images with 50 binary annotations each.
7
One of the most common microtasks is binary annotation tasks
Ipeirotis et al. Analyzing the Amazon Mechanical Turk Marketplace.
2.36 million hours of work
=
$14.16 million
Krishna et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
- Irani et al. Turkopticon: Interrupting worker invisibility in amazon mechanical turk.
- Martin et al. Being a Turker.
- Sheng et al. Get another label? improving data quality and data mining using multiple, noisy labelers
Crowdsourcing platforms punish errors
9
Crowdworkers do slow, deliberate work
RSVP: Rapid Serial Visual Processing
10
We want to allows workers to go faster and make errors, and even encourage it
We want design a technique that is tolerant to the errors
RSVP: Rapid Serial Visual Presentation
11
- Potter et al. 1976. Short-term conceptual memory for pictures
- Fei-Fei et al. What do we perceive in a glance of a real-world scene?
12
13
14
15
16
17
18
19
20
We need to build models that are robust to error
21
are delayed and noisy…
time
22
Number of reactions
time
23
Number of reactions
time
24
Number of reactions
time
25
Number of reactions
time
26
Number of reactions
time
27
Number of reactions
time
28
Number of reactions
time
29
Number of reactions
exgauss distribution
time
30
Number of reactions
μ=379ms
σ=92ms
Workflow using our gaussian model
31
32
Is a man riding a motorcycle?
time
33
time
Worker 1
Is a man riding a motorcycle?
34
time
μ=379ms
Worker 1
Is a man riding a motorcycle?
35
time
Worker 1
Is a man riding a motorcycle?
36
time
This is not a man riding a motorcycle.
Worker 1
Is a man riding a motorcycle?
37
time
Worker 1
Worker 2
38
μ=379ms
Worker 2
Is a man riding a motorcycle?
39
μ=379ms
Worker 2
Is a man riding a motorcycle?
40
μ=379ms
Still not a man riding a motorcycle
Worker 2
Is a man riding a motorcycle?
41
Worker 1
Worker 2
42
Worker 1
Worker 2
Total
By randomizing task ordering and asking multiple workers, our model is able to perform binary classification
43
For a set of images:
44
Each worker gives us a set of reactions:
Our goal is to measure the probability of an image being positive:
For a set of images:
45
Each worker gives us a set of reactions:
Our goal is to measure the probability of an image being positive:
We assume that each worker reaction is independent:
For a set of images:
46
Each worker gives us a set of reactions:
Our goal is to measure the probability of an image being positive:
We assume that each worker reaction is independent:
By asking multiple workers, we calculate which images are positive:
Summary of our technique
RSVP Interface
47
Workflow
Gaussian Model
Evaluation
48
Evaluation
Binary annotation tasks for images
49
Study 1
Evaluation criteria: speedup
50
Control approach:
majority voting with 3 workers
1.7s
1.7s
1.7s
Evaluation criteria: speedup
51
Control approach:
majority voting with 3 workers
1.7s
1.7s
1.7s
Total time per image: 5.1s
Evaluation criteria: speedup
52
Control approach:
majority voting with 3 workers
1.7s
1.7s
1.7s
Total time per image: 5.1s
RSVP:
at the same precision
0.1s
0.1s
0.1s
0.1s
0.1s
Total time per image: 0.5s
Evaluation criteria: speedup
53
Control approach:
majority voting with 3 workers
1.7s
1.7s
1.7s
Total time per image: 5.1s
RSVP:
at the same precision
0.1s
0.1s
0.1s
0.1s
0.1s
Total time per image: 0.5s
That’s a order of magnitude
speed up of > 10X
Study1: binary annotations for images
54
Task order was randomized 5 times for each workers
Each task stream had 100 images
Images are shown at 100ms
Study1: recall suffers for long streams
Study1: recall suffers for long streams
56
Study1: training workers
Study1: eliminating bad work
58
Easy binary verification for images
59
Dog
Precision
Recall
Precision
Recall
Our Approach
0.99
0.99
0.99
0.94
Control Approach
Medium binary verification for images
60
Dog
Man riding motorcycle
Precision
Recall
Precision
Recall
Our Approach
0.99
0.97
0.99
0.99
0.99
0.98
0.94
0.84
Control Approach
Hard binary verification for images
61
Dog
Man riding motorcycle
Eating breakfast
Precision
Recall
Precision
Recall
Our Approach
0.99
0.97
0.93
0.99
0.99
0.89
0.99
0.98
0.90
0.94
0.84
0.74
Control Approach
Time for binary verification for images
62
Dog
Man riding motorcycle
Eating breakfast
Our Approach
1.50
1.70
1.90
0.10
0.10
0.10
9.00X
10.20X
11.40X
Speedup
Control Approach
NASA task load index
Measures the perceived workload of a task between 0 and 100
Control condition: 58.5 (σ = 9.3)
RSVP: 62.4 (σ = 18.5)
not significant (t(99) = −0.53, p = 0.59)
63
Targeting Recall
64
65
Recall
Speedup
66
Recall
Speedup
Slow down workers
100ms 200ms
67
Recall
Speedup
Slow down workers
100ms 200ms
Hire more workers
5 10
Study 2: non image binary annotations
68
Evaluation
Image binary annotation tasks
69
Non-image binary annotation tasks
Study 1
Study 2
Study 2: sentiment analysis
70
4.25 0.25 seconds per tweet
Study 2: word similarity
71
broad
hushing
Find synonyms for wide
crunch
short
6.23 0.60 seconds per word
Study 2: topic detection
72
14.33 2.00 seconds per article
Sales of previously owned homes dropped 14.5% in January to a seasonally adjusted annual rate of 3.47 min units, the national association of realtors ….
Find articles related to “housing”
Study 2: topic detection
73
14.33 2.00 seconds per article
Sales of previously owned homes dropped 14.5% in January to a seasonally adjusted annual rate of 3.47 min units, the national association of realtors ….
Find articles related to “housing”
homes
realtors
Study 2: results
Time
74
Sentiment Analysis
Word Similarity
Topic Detection
Speedup
Precision
10.20X
6.23X
10.75X
Precision
Time
Our Approach
Control Approach
0.94
0.88
0.95
0.25
0.60
2.00
0.93
0.89
0.96
4.25
6.23
14.33
Study 3: multi-class classification
75
Evaluation
Image binary annotation tasks
76
Non-image binary annotation tasks
Multi-class classification
Study 1
Study 2
Study 3
Study 3: multi-class classification
77
2,000 images
10 object categories: “people”, “dog”, “horse”, “cat”, etc.
each category contains between 100 to 250 examples
| Precision | Recall | Time | Cost | Speed up |
Control | 0.99 | 0.95 | 102,000s | $170 | - |
RSVP | 0.98 | 0.83 | 11,700s | $19.50 | 8.70x |
Discussion
78
Successive positive images - at 200ms
79
- Even if positive items occur one after the other, our model can annotate them
Successive positive images - at 200ms
80
- Even if positive items occur one after the other, our model can annotate them
- Even when >50% of images are positive, our model can annotate them
Successive positive images - as we increase speed
81
200ms
200ms
200ms
100ms
100ms
100ms
Successive positive images - as we increase speed
82
recall is inversely proportional to the rate of positive items in a task.
200ms
200ms
200ms
100ms
100ms
100ms
Fine Grained Detection
83
Sayornis
Gray Kingbird
Influence of typicality
84
- Iordan et al. Basic level category structure emerges gradually across human ventral visual cortex.
Typicality score: 0.9
Typicality score: 0.1
Embracing Error by Enabling Rapid Crowdsourcing
85
Embracing Error to Enable Rapid Crowdsourcing
Ranjay Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, Michael Bernstein
Check out our demo:
Embracing Error by Enabling Rapid Crowdsourcing
86
Embracing Error to Enable Rapid Crowdsourcing
Ranjay Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, Michael Bernstein
Thank you
87