1 of 21

The Human Factor

5 November 2015

Language Intelligence

Jana Thompson, NLP Engineer

2 of 21

Machines versus Humans

2

3 of 21

Human cost goes up, machine cost goes down

Source: John C. McCallum, Wikipedia, Federal Reserve Bank of St Louis. Inflation adjusted to 2011 dollars.

Processing language data

4 of 21

Every machine learning pipeline, ever

4

  1. Get the data and clean it
  2. Clean it some more (rules)
  3. Create training data
  4. Ensure quality of training data
  5. Build Model
  6. Refine Features
  7. Create more training data
  8. Create rules to fix output
  9. Apply the predictive analytics

Adaptive Learning

Microtasking

Deep Learning

Active Learning

5 of 21

Positive about Ford

6 of 21

Also positive about Ford…

7 of 21

Will the real Ford car please stand up?

7

8 of 21

Data beats algorithms; feedback beats data

Results on distinguishing the correct ‘Ford’

Distinguishing “Ford” the company from people called “Ford”

9 of 21

Adaptive System�

9

Human Annotation

Machine Learning

Optimization

Prediction Engine

  • Annotation suggestions
  • Document priority
  • Shortest path for coverage
  • Error detection

10 of 21

Positive, about Ford cars…but relevant?

11 of 21

Idibon’s analytics for car sentiment correlates with actual sales

12 of 21

95% accuracy in identifying people talking about buying cars on social media

12

13 of 21

Adaptive System�

13

Human Annotation

Machine Learning

Optimization

Prediction Engine

  • Annotation suggestions
  • Document priority
  • Shortest path for coverage
  • Error detection

14 of 21

Annotators aren’t infallible

14

von Ahn, Luis. 2006. Games With a Purpose. https://www.cs.cmu.edu/~biglou/ieee-gwap.pdf

15 of 21

Reliable data => a better model

15

16 of 21

When does the analyst know when to stop?

16

17 of 21

People are always going to be central

17

18 of 21

Machines cluster; humans label

18

Good

Co-workers

3,845

Pay and Opportunities for Advancement

2,042

Management

490

Benefits

657

19 of 21

Machines sort; humans are multilingual

19

umukobwa sexuels

medicaments

épidémie

commuement

protegér

ebola

prevention sida

aladie ici

kumenya lyce

droit kwiga

concerne inyigisho

20 of 21

UNICEF utilizes Idibon to process millions of SMS in 12 African languages.

SENDER INTENT

CATEGORIZATION

LANGUAGE DETECTION

LOCATION

21 of 21

In conclusion…

  • Humans are always part of the loop
  • They create the data
  • They interpret the data
  • It’s important for them to be right in the middle, too