1 of 87

Embracing Error to Enable Rapid Crowdsourcing

Ranjay Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, Michael Bernstein

Stanford University

1

2 of 87

- Snow et al. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks

- Parameswaran and Marcus. Crowdsourced data management: industry and academic perspectives

Hundreds of thousands of microtasks launched per day

2

2

Business

Research

3 of 87

-- Josephy et al. CrowdScale 2013: Crowdsourcing at Scale workshop report.

3

3

Techniques for speeding up and lowering the cost of labeling are not scaling as quickly as the volume of data we are now producing that must be labeled

4 of 87

4

High speed and high precision

5 of 87

Is this a man riding a motorcycle?

5

One of the most common microtasks is binary annotation tasks

Ipeirotis et al. Analyzing the Amazon Mechanical Turk Marketplace

Yes

No

6 of 87

Workers take 1.7 seconds per image

7 of 87

My current research project requires annotating 100 million images with 50 binary annotations each.

7

One of the most common microtasks is binary annotation tasks

Ipeirotis et al. Analyzing the Amazon Mechanical Turk Marketplace.

2.36 million hours of work

=

$14.16 million

Krishna et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

8 of 87

9 of 87

- Irani et al. Turkopticon: Interrupting worker invisibility in amazon mechanical turk.

- Martin et al. Being a Turker.

- Sheng et al. Get another label? improving data quality and data mining using multiple, noisy labelers

Crowdsourcing platforms punish errors

9

Crowdworkers do slow, deliberate work

10 of 87

RSVP: Rapid Serial Visual Processing

10

We want to allows workers to go faster and make errors, and even encourage it

We want design a technique that is tolerant to the errors

11 of 87

RSVP: Rapid Serial Visual Presentation

11

- Potter et al. 1976. Short-term conceptual memory for pictures

- Fei-Fei et al. What do we perceive in a glance of a real-world scene?

12 of 87

12

13 of 87

13

14 of 87

14

15 of 87

15

16 of 87

16

17 of 87

17

18 of 87

18

19 of 87

19

20 of 87

20

21 of 87

We need to build models that are robust to error

21

are delayed and noisy

22 of 87

time

22

Number of reactions

23 of 87

time

23

Number of reactions

24 of 87

time

24

Number of reactions

25 of 87

time

25

Number of reactions

26 of 87

time

26

Number of reactions

27 of 87

time

27

Number of reactions

28 of 87

time

28

Number of reactions

29 of 87

time

29

Number of reactions

exgauss distribution

30 of 87

time

30

Number of reactions

μ=379ms

σ=92ms

31 of 87

Workflow using our gaussian model

31

32 of 87

32

Is a man riding a motorcycle?

time

33 of 87

33

time

Worker 1

Is a man riding a motorcycle?

34 of 87

34

time

μ=379ms

Worker 1

Is a man riding a motorcycle?

35 of 87

35

time

Worker 1

Is a man riding a motorcycle?

36 of 87

36

time

This is not a man riding a motorcycle.

Worker 1

Is a man riding a motorcycle?

37 of 87

37

time

Worker 1

Worker 2

38 of 87

38

μ=379ms

Worker 2

Is a man riding a motorcycle?

39 of 87

39

μ=379ms

Worker 2

Is a man riding a motorcycle?

40 of 87

40

μ=379ms

Still not a man riding a motorcycle

Worker 2

Is a man riding a motorcycle?

41 of 87

41

Worker 1

Worker 2

42 of 87

42

Worker 1

Worker 2

Total

43 of 87

By randomizing task ordering and asking multiple workers, our model is able to perform binary classification

43

44 of 87

For a set of images:

44

Each worker gives us a set of reactions:

Our goal is to measure the probability of an image being positive:

45 of 87

For a set of images:

45

Each worker gives us a set of reactions:

Our goal is to measure the probability of an image being positive:

We assume that each worker reaction is independent:

46 of 87

For a set of images:

46

Each worker gives us a set of reactions:

Our goal is to measure the probability of an image being positive:

We assume that each worker reaction is independent:

By asking multiple workers, we calculate which images are positive:

47 of 87

Summary of our technique

RSVP Interface

47

Workflow

Gaussian Model

48 of 87

Evaluation

48

49 of 87

Evaluation

Binary annotation tasks for images

49

Study 1

50 of 87

Evaluation criteria: speedup

50

Control approach:

majority voting with 3 workers

1.7s

1.7s

1.7s

51 of 87

Evaluation criteria: speedup

51

Control approach:

majority voting with 3 workers

1.7s

1.7s

1.7s

Total time per image: 5.1s

52 of 87

Evaluation criteria: speedup

52

Control approach:

majority voting with 3 workers

1.7s

1.7s

1.7s

Total time per image: 5.1s

RSVP:

at the same precision

0.1s

0.1s

0.1s

0.1s

0.1s

Total time per image: 0.5s

53 of 87

Evaluation criteria: speedup

53

Control approach:

majority voting with 3 workers

1.7s

1.7s

1.7s

Total time per image: 5.1s

RSVP:

at the same precision

0.1s

0.1s

0.1s

0.1s

0.1s

Total time per image: 0.5s

That’s a order of magnitude

speed up of > 10X

54 of 87

Study1: binary annotations for images

54

Task order was randomized 5 times for each workers

Each task stream had 100 images

Images are shown at 100ms

55 of 87

Study1: recall suffers for long streams

56 of 87

Study1: recall suffers for long streams

56

57 of 87

Study1: training workers

  • qualification rounds
  • Slowly acclimate workers by slowly increasing the speed

58 of 87

Study1: eliminating bad work

58

  • Eliminate workers who continuously reacted even without any positive stimuli.

  • Inject gold standard images and ensure that workers get them.

59 of 87

Easy binary verification for images

59

Dog

Precision

Recall

Precision

Recall

Our Approach

0.99

0.99

0.99

0.94

Control Approach

60 of 87

Medium binary verification for images

60

Dog

Man riding motorcycle

Precision

Recall

Precision

Recall

Our Approach

0.99

0.97

0.99

0.99

0.99

0.98

0.94

0.84

Control Approach

61 of 87

Hard binary verification for images

61

Dog

Man riding motorcycle

Eating breakfast

Precision

Recall

Precision

Recall

Our Approach

0.99

0.97

0.93

0.99

0.99

0.89

0.99

0.98

0.90

0.94

0.84

0.74

Control Approach

62 of 87

Time for binary verification for images

62

Dog

Man riding motorcycle

Eating breakfast

Our Approach

1.50

1.70

1.90

0.10

0.10

0.10

9.00X

10.20X

11.40X

Speedup

Control Approach

63 of 87

NASA task load index

Measures the perceived workload of a task between 0 and 100

Control condition: 58.5 (σ = 9.3)

RSVP: 62.4 (σ = 18.5)

not significant (t(99) = −0.53, p = 0.59)

63

64 of 87

Targeting Recall

64

65 of 87

65

Recall

Speedup

66 of 87

66

Recall

Speedup

Slow down workers

100ms 200ms

67 of 87

67

Recall

Speedup

Slow down workers

100ms 200ms

Hire more workers

5 10

68 of 87

Study 2: non image binary annotations

68

69 of 87

Evaluation

Image binary annotation tasks

69

Non-image binary annotation tasks

Study 1

Study 2

70 of 87

Study 2: sentiment analysis

70

4.25 0.25 seconds per tweet

71 of 87

Study 2: word similarity

71

broad

hushing

Find synonyms for wide

crunch

short

6.23 0.60 seconds per word

72 of 87

Study 2: topic detection

72

14.33 2.00 seconds per article

Sales of previously owned homes dropped 14.5% in January to a seasonally adjusted annual rate of 3.47 min units, the national association of realtors ….

Find articles related to “housing”

73 of 87

Study 2: topic detection

73

14.33 2.00 seconds per article

Sales of previously owned homes dropped 14.5% in January to a seasonally adjusted annual rate of 3.47 min units, the national association of realtors ….

Find articles related to “housing”

homes

realtors

74 of 87

Study 2: results

Time

74

Sentiment Analysis

Word Similarity

Topic Detection

Speedup

Precision

10.20X

6.23X

10.75X

Precision

Time

Our Approach

Control Approach

0.94

0.88

0.95

0.25

0.60

2.00

0.93

0.89

0.96

4.25

6.23

14.33

75 of 87

Study 3: multi-class classification

75

76 of 87

Evaluation

Image binary annotation tasks

76

Non-image binary annotation tasks

Multi-class classification

Study 1

Study 2

Study 3

77 of 87

Study 3: multi-class classification

77

2,000 images

10 object categories: “people”, “dog”, “horse”, “cat”, etc.

each category contains between 100 to 250 examples

Precision

Recall

Time

Cost

Speed up

Control

0.99

0.95

102,000s

$170

-

RSVP

0.98

0.83

11,700s

$19.50

8.70x

78 of 87

Discussion

78

79 of 87

Successive positive images - at 200ms

79

- Even if positive items occur one after the other, our model can annotate them

80 of 87

Successive positive images - at 200ms

80

- Even if positive items occur one after the other, our model can annotate them

- Even when >50% of images are positive, our model can annotate them

81 of 87

Successive positive images - as we increase speed

81

200ms

200ms

200ms

100ms

100ms

100ms

82 of 87

Successive positive images - as we increase speed

82

recall is inversely proportional to the rate of positive items in a task.

200ms

200ms

200ms

100ms

100ms

100ms

83 of 87

Fine Grained Detection

83

Sayornis

Gray Kingbird

84 of 87

Influence of typicality

84

- Iordan et al. Basic level category structure emerges gradually across human ventral visual cortex.

Typicality score: 0.9

Typicality score: 0.1

85 of 87

Embracing Error by Enabling Rapid Crowdsourcing

85

Embracing Error to Enable Rapid Crowdsourcing

Ranjay Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, Michael Bernstein

Check out our demo:

86 of 87

Embracing Error by Enabling Rapid Crowdsourcing

86

Embracing Error to Enable Rapid Crowdsourcing

Ranjay Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, Michael Bernstein

Thank you

87 of 87

87