1 of 47

Data Engineering (AI5308/AI4005)

Apr 27: Model Development and Offline Evaluation (Ch. 6)

Sundong Kim

Course website: https://sundong.kim/courses/dataeng23sp/�Contents from CS 329S (Chip Huyen, 2022) | cs329s.stanford.edu

2 of 47

Get MLOps Certified With The Course From Weights & Biases: Visit http://wandb.me/MLOps-Course (3 hrs)

3 of 47

Today: Skim through https://docs.google.com/presentation/d/1aMrSJyUqS5hkT_2Ljs9nhOEt2zSS7fe5u9jn_HfcUis/edit?usp=sharing (15 min)

4 of 47

Useful Tools - W&B for exp tracking (a.k.a. wandb)

Bringing machine learning models to production is challenging, with a continuous iterative lifecycle that consists of many complex components. Having a disciplined, flexible and collaborative process - an effective MLOps system - is crucial to enabling velocity and rigor, and building an end-to-end machine learning pipeline that continually delivers production-ready ML models and services.

Official Courses (3 hrs) by ML Engineers at W&B http://wandb.me/MLOps-Course

For deep learners: Integrate Weights & Biases with PyTorch: https://www.youtube.com/watch?v=G7GH0SeNBMA �

4

5 of 47

Useful Tools - W&B for exp tracking (a.k.a. wandb)

Integrate Weights & Biases with PyTorch: https://www.youtube.com/watch?v=G7GH0SeNBMA

5

6 of 47

Useful Tools - MLFlow for your ML lifecycle

See Youtube Video:

https://www.youtube.com/watch?v=VokAGy8C6K4

Try MLFlow Tutorial (spend 3-4 hours on this): https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html

For advanced users: see PyCON video (Korean): �https://www.youtube.com/watch?v=H-4ZIfOJDaw

6

7 of 47

7

Self-study these tools - by May 17

8 of 47

Project Mid Review (May 2)

See this page and submit the form when ready. https://sundong.kim/courses/dataeng23sp/final-project/#mid-review

Team 4: Today (Apr 25)
Team 10: Thursday (Apr 27)
All the other teams: (May 2)��

8

9 of 47

Team 10’s Presentation

9

10 of 47

Critique Scoring Process

Check Plus (5 points) - The critique is very well written. Strength/weakness items are very insightful.
Check (3 points) - It looks okay. It’s likely that most critiques will belong to this class
Check Minus (1 points) - The critique is weak. For example, a summary is very much vague. Strength/weakness items were trivials and are not insightful at all. Trivial questions were asked, and discussion was very shallow.
No submission / late submission (0 points)

11 of 47

Critique 1

0: 2

2: 3

3: 28

4: 6

5: 7�(Avg: 3.38)

Critique 2

0: 3

2: 2

3: 14

4: 9

5: 18�(Avg: 3.82)

TA (Hongyiel) !!

12 of 47

Possible Discussion

What would be an acceptable threshold for performance improvement to warrant the adoption of a new technology? From what point can we say that the benefits of changing the technology outweigh the costs?

12

13 of 47

Possible Discussion

What would be an acceptable threshold for performance improvement to warrant the adoption of a new technology? From what point can we say that the benefits of changing the technology outweigh the costs?

→ Search A/B test and see Ronny Kohavi’s video: https://exp-platform.com/Documents/2015-08OnlineControlledExperimentsKDDKeynoteNR.pdf �https://www.youtube.com/watch?v=HEGI5QN3fXE (next page)

13

14 of 47

A/B Testing Pitfalls: Getting Numbers You Can Trust is Hard

14

15 of 47

Possible Discussion

Which factor was the major contribution to its performance improvements? Data preprocessing methods or pre-training methods?

15

16 of 47

Possible Discussion

Ethical considerations: Given that one of their downstream tasks is adult product recognition, I want to consider ethical implications around such a task being automated by machine-learning algorithms rather than human moderators.

16

17 of 47

Possible Discussion

Maybe there is a way to make the invalid dataset to be valid and use it for other research or use invalid data pairs to augmenting negative pairs?

17

18 of 47

Possible Discussion

I’m wondering how there is still so much duplicate data left after preprocessing to represent the image as 5*5 pixels and delete the data if the hash value is the same. Also, I was wondering what the effect of deleting the data would be if I didn’t do soft labeling with catalog-based and just deleted the data when it was the same product ID.

18

19 of 47

Tons of applied data science papers

https://applyingml.com/papers/

19

20 of 47

Model selection

20

21 of 47

ML algorithm

Function to be learned

E.g. model architecture, number of hidden layers

Objective function to optimize (minimize)

Loss function

Learning procedure (optimizer)

Adam, Momentum

21

22 of 47

6 tips for evaluating ML algorithms

22

23 of 47

Avoid the state-of-the-art trap

SOTA’s promise

Why use an old solution when a�newer one exists?
It’s exciting to work on shiny things
Marketing

23

24 of 47

Avoid the state-of-the-art trap

SOTA’s reality

SOTA on research data != SOTA on your data
Cost
Latency
Proven industry success
Community support

24

25 of 47

2. Start with the simplest models

Easier to deploy

Deploying early allows validating pipeline

Easier to debug
Easier to improve upon

25

26 of 47

2. Start with the simplest models

Easier to deploy

Deploying early allows validating pipeline

Easier to debug
Easier to improve upon
Simplest models != models with the least effort

BERT is easy to start with pretrained model, but not the simplest

26

27 of 47

3. Avoid human biases in selecting models

A tale of human biases

Papers proposing LSTM variants show that the variants improve upon the vanilla LSTM.
Do they?

27

28 of 47

3. Avoid human biases in selecting models

A tale of human biases

Papers proposing LSTM variants show that the variants improve upon the vanilla LSTM.
Do they?

28

29 of 47

3. Avoid human biases in selecting models

It’s important to evaluate models under comparable conditions

It’s tempting to run more experiments for X because you’re more excited about X

Near-impossible to make blanketed claims that X is always better than Y

29

30 of 47

Better now vs. better later

Best model now != best model in 2 months

Improvement potential with more data
Ease of update

30

31 of 47

Learning curve

31

https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html

Good for estimating if performance can improve with more data

32 of 47

5. Evaluate trade-offs

False positives vs. false negatives
Accuracy vs. compute/latency
Accuracy vs. interpretability

32

33 of 47

6. Understand your model’s assumption

IID

Neural networks assume that examples are independent and identically distributed

Smoothness

Supervised algorithms assume that there’s a set of functions that can transform inputs into outputs such that similar inputs are transformed into similar outputs

Tractability

Let X be the input and Z be the latent representation of X. Generative models assume that it’s tractable to compute P(Z|X).

Boundaries

Linear classifiers assume that decision boundaries are linear.

Conditional independence

Naive Bayes classifiers assume that the attribute values are independent of each other given the class.

33

34 of 47

6 tips for evaluating ML algorithms

Avoid the state-of-the-art trap
Start with the simplest models
Avoid human biases in selecting models
Evaluate good performance now vs. good performance later
Evaluate trade-offs
Understand your model’s assumptions

34

35 of 47

Ensembles

35

36 of 47

Ensemble

Creating a strong model from an ensemble of weak models

36

Base learners

37 of 47

Ensembles: extremely common in leaderboard style projects

20/22 winning solutions on Kaggle in Jan - Aug 2021 use ensembles
One solution uses 33 models!

37

https://farid.one/kaggle-solutions

38 of 47

Why does ensembling work?

Task: email classification (SPAM / NOT SPAM)
3 uncorrelated models, each with accuracy of 70%
Ensemble: majority vote of these 3 models

Ensemble is correct is at least 2 models are correct

38

39 of 47

Why does ensembling work?

3 models, each with 70% accuracy
Ensemble is correct if at least 2 models are correct
Probability at least 2 models are correct: 34.3% + 44.1% = 78.4%

39

Outputs of 3 models	Probability	Ensemble’s output
All 3 are correct	0.7 * 0.7 * 0.7 = 0.343	Correct
Only 2 are correct	(0.7 * 0.7 * 0.3) * 3 = 0.441	Correct
Only 1 is correct	(0.3 * 0.3 * 0.7) * 3 = 0.189	Wrong
None is correct	0.3 * 0.3 * 0.3 = 0.027	Wrong

40 of 47

Why does ensembling work?

3 models, each with 70% accuracy
Ensemble is correct if at least 2 models are correct
Probability at least 2 models are correct: 34.3% + 44.1% = 78.4%

40

The less correlation among base learners, the better
Common for base learners to have different architectures

41 of 47

Ensemble

Bagging
Boosting
Stacking

41

42 of 47

Bagging

Sample with replacement to create different datasets
Train a classifier with each dataset
Aggregate predictions from classifiers

e.g. average, majority vote

42

Illustration by Sirakorn

43 of 47

Bagging

Sample with replacement to create different datasets
Train a classifier with each dataset
Aggregate predictions from classifiers

e.g. average, majority vote

43

Illustration by Sirakorn

Generally improves unstable methods e.g. neural networks, trees
Can degrade stable methods e.g. kNN

Bagging Predictors (Leo Breiman, 1996)

44 of 47

Boosting

Train a weak classifier
Give samples misclassified by weak classifier higher weight
Repeat (1) on this reweighted data as many iterations as needed
Final strong classifier: weighted combination of existing classifiers

classifiers with smaller training errors have higher weights

44

Illustration by Sirakorn

45 of 47

Boosting

Train a weak classifier
Give samples misclassified by weak classifier higher weight
Repeat (1) on this reweighted data as many iterations as needed
Final strong classifier: weighted combination of existing classifiers

classifiers with smaller training errors have higher weights

45

Illustration by Sirakorn

Extremely popular:

XGBoost
LightGBM

46 of 47

Stacking

46

Majority vote, logistic regression, simple NN

47 of 47

Data Engineering

Next class: Model Development and Offline Evaluation (Ch. 6)

https://sundong.kim/courses/dataeng23sp/ | Sundong Kim