1 of 47

Data Engineering (AI5308/AI4005)

Apr 27: Model Development and Offline Evaluation (Ch. 6)

Sundong Kim

Course website: https://sundong.kim/courses/dataeng23sp/�Contents from CS 329S (Chip Huyen, 2022) | cs329s.stanford.edu

2 of 47

Get MLOps Certified With The Course From Weights & Biases: Visit http://wandb.me/MLOps-Course (3 hrs)

3 of 47

4 of 47

Useful Tools - W&B for exp tracking (a.k.a. wandb)

Bringing machine learning models to production is challenging, with a continuous iterative lifecycle that consists of many complex components. Having a disciplined, flexible and collaborative process - an effective MLOps system - is crucial to enabling velocity and rigor, and building an end-to-end machine learning pipeline that continually delivers production-ready ML models and services.

  • Official Courses (3 hrs) by ML Engineers at W&B http://wandb.me/MLOps-Course

  • For deep learners: Integrate Weights & Biases with PyTorch: https://www.youtube.com/watch?v=G7GH0SeNBMA

4

5 of 47

Useful Tools - W&B for exp tracking (a.k.a. wandb)

Integrate Weights & Biases with PyTorch: https://www.youtube.com/watch?v=G7GH0SeNBMA

5

6 of 47

Useful Tools - MLFlow for your ML lifecycle

See Youtube Video:

https://www.youtube.com/watch?v=VokAGy8C6K4

Try MLFlow Tutorial (spend 3-4 hours on this): https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html

For advanced users: see PyCON video (Korean):https://www.youtube.com/watch?v=H-4ZIfOJDaw

6

7 of 47

7

Self-study these tools - by May 17

8 of 47

Project Mid Review (May 2)

See this page and submit the form when ready. https://sundong.kim/courses/dataeng23sp/final-project/#mid-review

  • Team 4: Today (Apr 25)
  • Team 10: Thursday (Apr 27)
  • All the other teams: (May 2)�

8

9 of 47

Team 10’s Presentation

9

10 of 47

Critique Scoring Process

  • Check Plus (5 points) - The critique is very well written. Strength/weakness items are very insightful.
  • Check (3 points) - It looks okay. It’s likely that most critiques will belong to this class
  • Check Minus (1 points) - The critique is weak. For example, a summary is very much vague. Strength/weakness items were trivials and are not insightful at all. Trivial questions were asked, and discussion was very shallow.
  • No submission / late submission (0 points)

11 of 47

Critique 1

0: 2

2: 3

3: 28

4: 6

5: 7�(Avg: 3.38)

Critique 2

0: 3

2: 2

3: 14

4: 9

5: 18�(Avg: 3.82)

TA (Hongyiel) !!

12 of 47

Possible Discussion

  • What would be an acceptable threshold for performance improvement to warrant the adoption of a new technology? From what point can we say that the benefits of changing the technology outweigh the costs?

12

13 of 47

Possible Discussion

  • What would be an acceptable threshold for performance improvement to warrant the adoption of a new technology? From what point can we say that the benefits of changing the technology outweigh the costs?

→ Search A/B test and see Ronny Kohavi’s video: https://exp-platform.com/Documents/2015-08OnlineControlledExperimentsKDDKeynoteNR.pdf https://www.youtube.com/watch?v=HEGI5QN3fXE (next page)

13

14 of 47

A/B Testing Pitfalls: Getting Numbers You Can Trust is Hard

14

15 of 47

Possible Discussion

  • Which factor was the major contribution to its performance improvements? Data preprocessing methods or pre-training methods?

15

16 of 47

Possible Discussion

  • Ethical considerations: Given that one of their downstream tasks is adult product recognition, I want to consider ethical implications around such a task being automated by machine-learning algorithms rather than human moderators.

16

17 of 47

Possible Discussion

  • Maybe there is a way to make the invalid dataset to be valid and use it for other research or use invalid data pairs to augmenting negative pairs?

17

18 of 47

Possible Discussion

  • I’m wondering how there is still so much duplicate data left after preprocessing to represent the image as 5*5 pixels and delete the data if the hash value is the same. Also, I was wondering what the effect of deleting the data would be if I didn’t do soft labeling with catalog-based and just deleted the data when it was the same product ID.

18

19 of 47

Tons of applied data science papers

19

20 of 47

Model selection

20

21 of 47

ML algorithm

  • Function to be learned
    • E.g. model architecture, number of hidden layers
  • Objective function to optimize (minimize)
    • Loss function
  • Learning procedure (optimizer)
    • Adam, Momentum

21

22 of 47

6 tips for evaluating ML algorithms

22

23 of 47

  1. Avoid the state-of-the-art trap
  • SOTA’s promise
    • Why use an old solution when a�newer one exists?
    • It’s exciting to work on shiny things
    • Marketing

23

24 of 47

  • Avoid the state-of-the-art trap
  • SOTA’s reality
    • SOTA on research data != SOTA on your data
    • Cost
    • Latency
    • Proven industry success
    • Community support

24

25 of 47

2. Start with the simplest models

  • Easier to deploy
    • Deploying early allows validating pipeline
  • Easier to debug
  • Easier to improve upon

25

26 of 47

2. Start with the simplest models

  • Easier to deploy
    • Deploying early allows validating pipeline
  • Easier to debug
  • Easier to improve upon
  • Simplest models != models with the least effort
    • BERT is easy to start with pretrained model, but not the simplest

26

27 of 47

3. Avoid human biases in selecting models

  • A tale of human biases
    • Papers proposing LSTM variants show that the variants improve upon the vanilla LSTM.
    • Do they?

27

28 of 47

3. Avoid human biases in selecting models

  • A tale of human biases
    • Papers proposing LSTM variants show that the variants improve upon the vanilla LSTM.
    • Do they?

28

29 of 47

3. Avoid human biases in selecting models

  • It’s important to evaluate models under comparable conditions
    • It’s tempting to run more experiments for X because you’re more excited about X
  • Near-impossible to make blanketed claims that X is always better than Y

29

30 of 47

Better now vs. better later

  • Best model now != best model in 2 months
    • Improvement potential with more data
    • Ease of update

30

31 of 47

Learning curve

31

Good for estimating if performance can improve with more data

32 of 47

5. Evaluate trade-offs

  • False positives vs. false negatives
  • Accuracy vs. compute/latency
  • Accuracy vs. interpretability

32

33 of 47

6. Understand your model’s assumption

  • IID
    • Neural networks assume that examples are independent and identically distributed
  • Smoothness
    • Supervised algorithms assume that there’s a set of functions that can transform inputs into outputs such that similar inputs are transformed into similar outputs
  • Tractability
    • Let X be the input and Z be the latent representation of X. Generative models assume that it’s tractable to compute P(Z|X).
  • Boundaries
    • Linear classifiers assume that decision boundaries are linear.
  • Conditional independence
    • Naive Bayes classifiers assume that the attribute values are independent of each other given the class.

33

34 of 47

6 tips for evaluating ML algorithms

  1. Avoid the state-of-the-art trap
  2. Start with the simplest models
  3. Avoid human biases in selecting models
  4. Evaluate good performance now vs. good performance later
  5. Evaluate trade-offs
  6. Understand your model’s assumptions

34

35 of 47

Ensembles

35

36 of 47

Ensemble

  • Creating a strong model from an ensemble of weak models

36

Base learners

37 of 47

Ensembles: extremely common in leaderboard style projects

  • 20/22 winning solutions on Kaggle in Jan - Aug 2021 use ensembles
  • One solution uses 33 models!

37

38 of 47

Why does ensembling work?

  • Task: email classification (SPAM / NOT SPAM)
  • 3 uncorrelated models, each with accuracy of 70%
  • Ensemble: majority vote of these 3 models
    • Ensemble is correct is at least 2 models are correct

38

39 of 47

Why does ensembling work?

  • 3 models, each with 70% accuracy
  • Ensemble is correct if at least 2 models are correct
  • Probability at least 2 models are correct: 34.3% + 44.1% = 78.4%

39

Outputs of 3 models

Probability

Ensemble’s output

All 3 are correct

0.7 * 0.7 * 0.7 = 0.343

Correct

Only 2 are correct

(0.7 * 0.7 * 0.3) * 3 = 0.441

Correct

Only 1 is correct

(0.3 * 0.3 * 0.7) * 3 = 0.189

Wrong

None is correct

0.3 * 0.3 * 0.3 = 0.027

Wrong

40 of 47

Why does ensembling work?

  • 3 models, each with 70% accuracy
  • Ensemble is correct if at least 2 models are correct
  • Probability at least 2 models are correct: 34.3% + 44.1% = 78.4%

40

  • The less correlation among base learners, the better
  • Common for base learners to have different architectures

41 of 47

Ensemble

  • Bagging
  • Boosting
  • Stacking

41

42 of 47

Bagging

  • Sample with replacement to create different datasets
  • Train a classifier with each dataset
  • Aggregate predictions from classifiers
    • e.g. average, majority vote

42

Illustration by Sirakorn

43 of 47

Bagging

  • Sample with replacement to create different datasets
  • Train a classifier with each dataset
  • Aggregate predictions from classifiers
    • e.g. average, majority vote

43

Illustration by Sirakorn

  • Generally improves unstable methods e.g. neural networks, trees
  • Can degrade stable methods e.g. kNN

Bagging Predictors (Leo Breiman, 1996)

44 of 47

Boosting

  1. Train a weak classifier
  2. Give samples misclassified by weak classifier higher weight
  3. Repeat (1) on this reweighted data as many iterations as needed
  4. Final strong classifier: weighted combination of existing classifiers
    1. classifiers with smaller training errors have higher weights

44

Illustration by Sirakorn

45 of 47

Boosting

  • Train a weak classifier
  • Give samples misclassified by weak classifier higher weight
  • Repeat (1) on this reweighted data as many iterations as needed
  • Final strong classifier: weighted combination of existing classifiers
    • classifiers with smaller training errors have higher weights

45

Illustration by Sirakorn

Extremely popular:

  • XGBoost
  • LightGBM

46 of 47

Stacking

46

Majority vote, logistic regression, simple NN

47 of 47

Data Engineering

Next class: Model Development and Offline Evaluation (Ch. 6)