ML at Veriff:
What is possible
in a year?
Taivo Pungas — Automation Lead
Our Product
Follow instructions given.
Show face and document.
Get verified!
Our Product
Follow instructions given.
Show face and document.
Get verified!
Our Product
Follow instructions given.
Show face and document.
Get verified!
Our Product
Follow instructions given.
Show face and document.
Get verified!
Our Product
Follow instructions given.
Show face and document.
Get verified!
Our Product
Follow instructions given.
Show face and document.
Get verified!
This is Veriff
Established in October 2015�
Y Combinator alumni�
Funding in total $8.3M
>300 people and growing
Automation at Veriff
A machine
photos,�video
yes/no,
data
decisionmaking machine
A hybrid machine
humans
algorithms
decisionmaking machine
photos,�video
yes/no,
data
Algorithms
fast
scalable
easy to modify
infinite nuance
Humans
slow
hireable
trainable
limited memory
Algorithms
goal is human-level
long lead time to production
make stupid mistakes
need to see data
Humans
accurate
onboard fast
have common sense
generalise easily
What’s special?
Each request is high value
Research-iness: high uncertainty about effort & impact
Collaborating with automation ops:
Data annotation & QA
What isn’t?
Still have SLAs
Still care about quality, security, ...
What we build
Example: faces
faces: [
{
location: [
266,
402,
441,
229
],
embedding: [...]
}
]
Example: specimens
{
version: v19,
top1_class: EE_ID_011,
top2_class: EE_ID_012,
top3_class: FI_ID_026,
top1_probability: 0.9719,
top2_probability: 0.0269,
top3_probability": 0.0001
}
ML-ness axis: software 1.0 to 2.0
Normal software
“Not ML”
Hardcoded if-else
Database lookups
One-off annotation
ML-ness axis: software 1.0 to 2.0
Normal software
Hybrid algorithms
“Not ML”
“Barely learning”
Hardcoded if-else
Database lookups
One-off annotation
Classical CV
Manual optimisation
ML-ness axis: software 1.0 to 2.0
Normal software
Hybrid algorithms
Deep learning
“Not ML”
“Barely learning”
Hardcoded if-else
Database lookups
One-off annotation
“Aww yeah”
Classical CV
Manual optimisation
Pretrained / API
In-house trained
Normal software
Hybrid algorithms
Deep learning
70%
“Not ML”
20%
10%
“Barely learning”
Hardcoded if-else
Database lookups
One-off annotation
“Aww yeah”
Classical CV
Manual optimisation
Pretrained / API
In-house trained
10%
Algorithm related tickets
Make <service> faster
Train document classification model on new dataset
Investigate <image algorithm> false positives for unstable images
Review annotation quality for <task>
Build <edge case exception> for <document>
Algorithms
Microservices
Statistical testing
Annotation tools
Algorithm debugging tools
Middleware
Storage
Business logic implementation
Monitoring
Algorithms
Microservices
Statistical testing
Annotation tools
Algorithm debugging tools
Middleware
Storage
Business logic implementation
Monitoring
Deep learning
A hybrid machine
humans
algorithms
decisionmaking machine
photos,�video
yes/no,
data
Data annotation
x1000
Data annotation
Annotation is a Product problem
=> Educate them
=> Provide tools
Data annotation
Annotation is a Product problem
=> Educate them
=> Provide tools
For speccing a task (~100 per class)
For training a model
Maximum volume (10-100k)
Balanced (task specific)
How? Team.
Team today
7 in Data Science
7 in Software Engineering
3 in Data Annotation (mgmt only)
2 in Product
+ external (QA, DevOps, doc research, ...)
ML team
2018
Part of Engineering
2 ML engineers
Reality:
1 ML engineer,�1 PO/recruiter
ML team
a8n team
2018
Part of Engineering
2 ML engineers
early 2019
Separate team
7 DS / engineers
Reality:
1 ML engineer,�1 PO/recruiter
Reality:
Significant support from other engineering teams
ML team
a8n team
a8n teams
2018
Part of Engineering
2 ML engineers
early 2019
now
Separate team
7 DS / engineers
Reality:
1 ML engineer,�1 PO/recruiter
2+ product teams
20 people
Reality:
Significant support from other engineering teams
Reality:
Independent roadmap and consistent delivery
Hiring — challenges
Experience in data annotation?
CV / ML engineer, >0 years experience
Senior backend engineers, Python
Discussion
Direct cost & SLA impact�
New clients unlocked (scale)
New external product and many internal features
Impact
Engineering challenges
Executing compute graph in two modes
Edge case complexity
Edge vs server inference
Data scientist ≠ software engineer
/ taivop
taivo.pungas@veriff.com
Thank you!
Data is the spec:
Iteratively solving
complex problems
Instead of
coming up with a good general solution
it is better to
focus on solving specific cases of the problem.
Instead of
trying to solve all problematic cases right away
it is better to
address a proportion of the easiest cases and repeat the process multiple times.
Instead of
writing a good specification of a solution
it is better to
curate a collection of good problem cases.
This collection of problem cases then essentially becomes your specification.
Example: reduce mistakes in extracted names
On <country> <document>.
Example: reduce mistakes in extracted names
On <country> <document>.
On <country> <document>.
Example: reduce mistakes in extracted names
dataisspec.github.io