1 of 38

ML Systems Fail, Part II:

How to Manage Mistakes at Model Training

June 18th, 2023

2 of 38

Me

  • Habeeb Shopeju
  • Research Engineer, Machine Learning
  • Thomson Reuters Labs
  • Interested in building Information Retrieval and Machine Learning Systems

3 of 38

You

  • Software Engineer
  • Product Owners/Manager
  • Data Scientist
  • AI Enthusiast

4 of 38

Recap

https://bit.ly/MLFailsOne

5 of 38

Interacting with ML Systems

  • Automate
  • Annotate
  • Prompt
  • Organize

6 of 38

Accepting the Perfect Imperfections

  • Guardrails
  • The Undo Button
  • Human in the Loop

7 of 38

Feedback

  • Explicit
  • Implicit

8 of 38

Disclaimer!!!

9 of 38

Two Tasks:

Find the wrong text span

Suggest an appropriate replacement

10 of 38

A Trivial Dataset

11 of 38

Model-centric Approach

  • Setting Expectations
  • Training Many Models
  • Regularization
  • Model Versioning
  • A Gold Standard

12 of 38

Setting Expectations

  • Creating a baseline
  • Lower bounder in terms of “performance”

0.67

0.67

0.55

0.82

13 of 38

Training Many Models

Cross Validation

  • Split, train and find the average.
  • Better picture of model performance across the dataset.

14 of 38

Training Many Models

Cross Validation

  • Split, train and find the average.
  • Better picture of model performance across the dataset.

15 of 38

Training Many Models

Cross Validation

  • Split, train and find the average.
  • Better picture of model performance across the dataset.

16 of 38

Training Many Models

Cross Validation

  • Split, train and find the average.
  • Better picture of model performance across the dataset.

17 of 38

Training Many Models

Cross Validation

  • Split, train and find the average.
  • Better picture of model performance across the dataset.

18 of 38

Training Many Models

Hyperparameter Tuning

  • Configuration external to model.
  • Controls the model’s learning process.

19 of 38

Regularization

Preventing overfitting and improving generalizability.

  • L1 and L2 regularization
  • Dropout
  • Early Stopping
  • Data augmentation

20 of 38

Model Versioning

Tracking and managing changes to your model, hyperparameters, its evaluation results, and other related artifacts.

21 of 38

A Gold Standard

  • A more correct algorithm, typically slower or more expensive.
  • Helps with data annotation.
  • Helps as a fallback method.

Today, Large Language Models seem to fit the role of a gold standard.

22 of 38

Summary

  • Setting Expectations
  • Training Many Models
  • Regularization
  • Model Versioning
  • A Gold Standard

23 of 38

Data-centric Approach

  • Splits
  • Slices

24 of 38

Splits

  • Train, Validate, Test
  • Random Splitting… or Not
  • Training on Validation Data

25 of 38

Splits: Train, Validate, Test

  • Training data: 60% - 80%. Validation data: 10% - 20%. Testing data: ???

26 of 38

Splits: Random Splitting… or Not

It works, but suffers when inter-record dependencies exist.

27 of 38

Splits: Random Splitting… or Not

Avoid:

  • No inter-record dependencies
  • No temporal dependencies
  • Absence of certain classes in any of the splits

28 of 38

Splits: Training on Validation Data

  • Extra data juice
  • Should be done after evaluation metrics are stored
  • No further evaluation should be done using those model weights
  • Beware of behaviour being worse

29 of 38

Splits: Summary

  • Train, Validate, Test
  • Random Splitting… or Not
  • Training on Validation Data

30 of 38

Slices

  • Not all inputs are equal
  • Subpopulations
  • Capabilities
  • Generating Slices

31 of 38

Slices: Not all inputs are equal.

Context is important. Some predictions have far greater consequences.

32 of 38

Slices: Subpopulations

  • Splitting the dataset based on specific subpopulation criteria.
  • For example:
    • Location of the user
    • Length of the content
    • Topic of the content
    • Amount of numbers in the content
    • Number of corrections made by the user

33 of 38

Slices: Capabilities

  • Splitting the dataset based on specific capabilities.
  • For example:
    • Detecting errors in names
    • Detecting errors in scientific references
    • Changing the tone in documents
    • Detecting errors in English syntax

34 of 38

Slices

35 of 38

Slices: Generating Slices

  • Difficult, but slices can be programmatically generated.
  • Full input generation or partial modification
  • Target specific relationships
    • Verbal agreement
    • Dialectical variations
  • Add noise

36 of 38

Slices: Summary

  • Not all inputs are equal
  • Subpopulations
  • Capabilities
  • Generating Slices

37 of 38

Questions??

38 of 38

Thank You🎈🎈🎈

Up Next

ML Systems Fail, Part III: How to Manage Mistakes while Planning Requirements