1 of 61

ML at Veriff:

What is possible

in a year?

2 of 61

Taivo Pungas — Automation Lead

3 of 61

Our Product

Follow instructions given.

Show face and document.

Get verified!

4 of 61

Our Product

Follow instructions given.

Show face and document.

Get verified!

5 of 61

Our Product

Follow instructions given.

Show face and document.

Get verified!

6 of 61

Our Product

Follow instructions given.

Show face and document.

Get verified!

7 of 61

Our Product

Follow instructions given.

Show face and document.

Get verified!

8 of 61

Our Product

Follow instructions given.

Show face and document.

Get verified!

9 of 61

This is Veriff

Established in October 2015�

Y Combinator alumni�

Funding in total $8.3M

>300 people and growing

10 of 61

Automation at Veriff

11 of 61

A machine

photos,�video

yes/no,

data

decisionmaking machine

12 of 61

A hybrid machine

humans

algorithms

decisionmaking machine

photos,�video

yes/no,

data

13 of 61

Algorithms

fast

scalable

easy to modify

infinite nuance

Humans

slow

hireable

trainable

limited memory

14 of 61

Algorithms

goal is human-level

long lead time to production

make stupid mistakes

need to see data

Humans

accurate

onboard fast

have common sense

generalise easily

15 of 61

What’s special?

Each request is high value

Research-iness: high uncertainty about effort & impact

Collaborating with automation ops:

Data annotation & QA

16 of 61

What isn’t?

Still have SLAs

Still care about quality, security, ...

17 of 61

What we build

18 of 61

Example: faces

faces: [

{

location: [

266,

402,

441,

229

],

embedding: [...]

}

]

19 of 61

Example: specimens

{

version: v19,

top1_class: EE_ID_011,

top2_class: EE_ID_012,

top3_class: FI_ID_026,

top1_probability: 0.9719,

top2_probability: 0.0269,

top3_probability": 0.0001

}

20 of 61

21 of 61

22 of 61

ML-ness axis: software 1.0 to 2.0

Normal software

“Not ML”

Hardcoded if-else

Database lookups

One-off annotation

23 of 61

ML-ness axis: software 1.0 to 2.0

Normal software

Hybrid algorithms

“Not ML”

“Barely learning”

Hardcoded if-else

Database lookups

One-off annotation

Classical CV

Manual optimisation

24 of 61

ML-ness axis: software 1.0 to 2.0

Normal software

Hybrid algorithms

Deep learning

“Not ML”

“Barely learning”

Hardcoded if-else

Database lookups

One-off annotation

“Aww yeah”

Classical CV

Manual optimisation

Pretrained / API

In-house trained

25 of 61

Normal software

Hybrid algorithms

Deep learning

70%

“Not ML”

20%

10%

“Barely learning”

Hardcoded if-else

Database lookups

One-off annotation

“Aww yeah”

Classical CV

Manual optimisation

Pretrained / API

In-house trained

26 of 61

10%

27 of 61

Algorithm related tickets

Make <service> faster

Train document classification model on new dataset

Investigate <image algorithm> false positives for unstable images

Review annotation quality for <task>

Build <edge case exception> for <document>

28 of 61

Algorithms

Microservices

Statistical testing

Annotation tools

Algorithm debugging tools

Middleware

Storage

Business logic implementation

Monitoring

29 of 61

Algorithms

Microservices

Statistical testing

Annotation tools

Algorithm debugging tools

Middleware

Storage

Business logic implementation

Monitoring

Deep learning

30 of 61

A hybrid machine

humans

algorithms

decisionmaking machine

photos,�video

yes/no,

data

31 of 61

Data annotation

32 of 61

x1000

33 of 61

34 of 61

35 of 61

36 of 61

37 of 61

38 of 61

39 of 61

40 of 61

41 of 61

Data annotation

Annotation is a Product problem

=> Educate them

=> Provide tools

42 of 61

Data annotation

Annotation is a Product problem

=> Educate them

=> Provide tools

For speccing a task (~100 per class)

For training a model

Maximum volume (10-100k)

Balanced (task specific)

43 of 61

How? Team.

44 of 61

Team today

7 in Data Science

7 in Software Engineering

3 in Data Annotation (mgmt only)

2 in Product

+ external (QA, DevOps, doc research, ...)

45 of 61

ML team

2018

Part of Engineering

2 ML engineers

Reality:

1 ML engineer,�1 PO/recruiter

46 of 61

ML team

a8n team

2018

Part of Engineering

2 ML engineers

early 2019

Separate team

7 DS / engineers

Reality:

1 ML engineer,�1 PO/recruiter

Reality:

Significant support from other engineering teams

47 of 61

ML team

a8n team

a8n teams

2018

Part of Engineering

2 ML engineers

early 2019

now

Separate team

7 DS / engineers

Reality:

1 ML engineer,�1 PO/recruiter

2+ product teams

20 people

Reality:

Significant support from other engineering teams

Reality:

Independent roadmap and consistent delivery

48 of 61

Hiring — challenges

Experience in data annotation?

CV / ML engineer, >0 years experience

Senior backend engineers, Python

49 of 61

Discussion

50 of 61

Direct cost & SLA impact�

New clients unlocked (scale)

New external product and many internal features

Impact

51 of 61

Engineering challenges

Executing compute graph in two modes

Edge case complexity

Edge vs server inference

Data scientist ≠ software engineer

52 of 61

/ taivop

/talks

/awesome-data-annotation

/requests-for-datasets

taivo.ai

53 of 61

taivo.pungas@veriff.com

Thank you!

54 of 61

Data is the spec:

Iteratively solving

complex problems

55 of 61

Instead of

coming up with a good general solution

it is better to

focus on solving specific cases of the problem.

56 of 61

Instead of

trying to solve all problematic cases right away

it is better to

address a proportion of the easiest cases and repeat the process multiple times.

57 of 61

Instead of

writing a good specification of a solution

it is better to

curate a collection of good problem cases.

This collection of problem cases then essentially becomes your specification.

58 of 61

Example: reduce mistakes in extracted names

On <country> <document>.

59 of 61

Example: reduce mistakes in extracted names

On <country> <document>.

60 of 61

On <country> <document>.

Example: reduce mistakes in extracted names

61 of 61

dataisspec.github.io