1 of 38

ML Systems Fail, Part II:

How to Manage Mistakes at Model Training

June 18th, 2023

2 of 38

Habeeb Shopeju
Research Engineer, Machine Learning
Thomson Reuters Labs
Interested in building Information Retrieval and Machine Learning Systems

3 of 38

You

Software Engineer
Product Owners/Manager
Data Scientist
AI Enthusiast

4 of 38

Recap

https://bit.ly/MLFailsOne

5 of 38

Interacting with ML Systems

Automate
Annotate
Prompt
Organize

6 of 38

Accepting the Perfect Imperfections

Guardrails
The Undo Button
Human in the Loop

7 of 38

Feedback

Explicit
Implicit

8 of 38

Disclaimer!!!

9 of 38

Two Tasks:

Find the wrong text span

Suggest an appropriate replacement

10 of 38

A Trivial Dataset

11 of 38

Model-centric Approach

Setting Expectations
Training Many Models
Regularization
Model Versioning
A Gold Standard

12 of 38

Setting Expectations

Creating a baseline
Lower bounder in terms of “performance”

0.67

0.55

0.82

13 of 38

Training Many Models

Cross Validation

Split, train and find the average.
Better picture of model performance across the dataset.

14 of 38

Training Many Models

Cross Validation

Split, train and find the average.
Better picture of model performance across the dataset.

15 of 38

Training Many Models

Cross Validation

Split, train and find the average.
Better picture of model performance across the dataset.

16 of 38

Training Many Models

Cross Validation

Split, train and find the average.
Better picture of model performance across the dataset.

17 of 38

Training Many Models

Cross Validation

Split, train and find the average.
Better picture of model performance across the dataset.

18 of 38

Training Many Models

Hyperparameter Tuning

Configuration external to model.
Controls the model’s learning process.

19 of 38

Regularization

Preventing overfitting and improving generalizability.

L1 and L2 regularization
Dropout
Early Stopping
Data augmentation

20 of 38

Model Versioning

Tracking and managing changes to your model, hyperparameters, its evaluation results, and other related artifacts.

21 of 38

A Gold Standard

A more correct algorithm, typically slower or more expensive.
Helps with data annotation.
Helps as a fallback method.

Today, Large Language Models seem to fit the role of a gold standard.

22 of 38

Summary

Setting Expectations
Training Many Models
Regularization
Model Versioning
A Gold Standard

23 of 38

Data-centric Approach

Splits
Slices

24 of 38

Splits

Train, Validate, Test
Random Splitting… or Not
Training on Validation Data

25 of 38

Splits: Train, Validate, Test

Training data: 60% - 80%. Validation data: 10% - 20%. Testing data: ???

26 of 38

Splits: Random Splitting… or Not

It works, but suffers when inter-record dependencies exist.

27 of 38

Splits: Random Splitting… or Not

Avoid:

No inter-record dependencies
No temporal dependencies
Absence of certain classes in any of the splits

28 of 38

Splits: Training on Validation Data

Extra data juice
Should be done after evaluation metrics are stored
No further evaluation should be done using those model weights
Beware of behaviour being worse

29 of 38

Splits: Summary

Train, Validate, Test
Random Splitting… or Not
Training on Validation Data

30 of 38

Slices

Not all inputs are equal
Subpopulations
Capabilities
Generating Slices

31 of 38

Slices: Not all inputs are equal.

Context is important. Some predictions have far greater consequences.

32 of 38

Slices: Subpopulations

Splitting the dataset based on specific subpopulation criteria.
For example:

Location of the user
Length of the content
Topic of the content
Amount of numbers in the content
Number of corrections made by the user

33 of 38

Slices: Capabilities

Splitting the dataset based on specific capabilities.
For example:

Detecting errors in names
Detecting errors in scientific references
Changing the tone in documents
Detecting errors in English syntax

35 of 38

Slices: Generating Slices

Difficult, but slices can be programmatically generated.
Full input generation or partial modification
Target specific relationships

Verbal agreement
Dialectical variations

Add noise

36 of 38

Slices: Summary

Not all inputs are equal
Subpopulations
Capabilities
Generating Slices

37 of 38

Questions??

38 of 38

Thank You🎈🎈🎈

Up Next

ML Systems Fail, Part III: How to Manage Mistakes while Planning Requirements

1 of 38

2 of 38

3 of 38

4 of 38

5 of 38

6 of 38

7 of 38

8 of 38

9 of 38

10 of 38

11 of 38

12 of 38

13 of 38

14 of 38

15 of 38

16 of 38

17 of 38

18 of 38

19 of 38

20 of 38

21 of 38

22 of 38

23 of 38

24 of 38

25 of 38

26 of 38

27 of 38

28 of 38

29 of 38

30 of 38

31 of 38

32 of 38

33 of 38

34 of 38

35 of 38

36 of 38

37 of 38

38 of 38