1 of 14

CSE 163

Model Evaluation�

Suh Young Choi�

🎶 Listening to: Inception soundtrack

💬 Before Class: What are some ways you think ML might be used in the future?

2 of 14

Announcements

THA 4 Peer Reviews out now! (due Feb. 18)

Take-Home Assessment 5 out now! (due Feb. 26)

New resubmission cycle has opened today and will close next Tuesday

Can resubmit Education

Suh Young not here next week!

TAs will hold a watch party instead
Lessons will be posted in advance

3 of 14

This Time

Categorical Features
Assessing Performance
Overfitting
Model Complexity

Hyperparameters

When to use ML

Last Time

Machine Learning

Terminology
Types of ML

ML Code (scikit-learn)
Decision Trees

4 of 14

One-Hot Encoding

Most ML models can’t handle categorical features by default
Mapping usually doesn’t work, one-hot encoding can!

5 of 14

Overfitting

The most important problem in science you’ve never heard of
Overfitting: When your model matches the training set so well, that it fails to generalize

Memorizing answers to Multiple Choice test

Tall trees are likely to overfit if you don’t have enough data

Can learn very complex boundaries
Very few points at the leaves

6 of 14

Assessing Performance

Training is cool, but we want to know its future performance
Training data can’t give an accurate evaluation

“I got 100% on the practice test I have been studying for 4 hours, therefore I will get 100% on the real exam”

Must hold out data called a test set to evaluate at the end

Unbiased estimate of performance in the wild

Never ever ever train or make decisions based on your test set.

If you do, it will no longer be good estimate of future performance.

7 of 14

Code Recap

General ML pipeline - For classification tasks w/ categorical features

# Separate data

features = data.loc[:, data.columns != 'target']

features = pd.get_dummies(features)

labels = data['target']

# Train/test split

feat_train, feat_test, lab_train, lab_test = \

train_test_split(features, labels, test_size=0.2)

# Create and train model on train set

model = DecisionTreeClassifier()

model.fit(feat_train, lab_train)

# Predict on test data

predictions = model.predict(feat_test)

accuracy_score(lab_test, predictions)

8 of 14

Model Complexity

One hyperparameter to control complexity of decision tree is the max depth (or height) of tree

9 of 14

Visualize the split

feat_train

lab_train

feat_test

lab_test

fit

Empty model

Trained model

10 of 14

Visualize the split

feat_train

lab_train

feat_test

lab_test

fit

Empty model

Trained model

predictions

accuracy_score

predict

11 of 14

Ground Rules

Talking about social impact and ethics in data science can be challenging since its effects can be deeply personal or harmful

Many reasonable people have differing opinions on how to draw the line between okay/not okay, there isn’t always an easy yes/no answer.

Brought up a lot of questions, but this is not the only set of questions we should ask. I’m just one person with one perspective!

Productive Discussions:

Listen with intention to understand first and forming an opinion only after you fully understand.
Take responsibility for the intended and unintended effects of your words and actions on others.
Mindfully respond to others’ ideas by acknowledging the unique value of each contribution.

12 of 14

Group Work:

Best Practices

When you first working with this group:

Introduce yourself!
If possible, angle one of your screens so that everyone can discuss together

Tips:

Starts with making sure everyone agrees to work on the same problem
Make sure everyone gets a chance to contribute!
Ask if everyone agrees and periodically ask each other questions!
Call TAs over for help if you need any!

13 of 14

Discussion (Canvas)

Consider the case of our credit card churn predictor. Suppose we were using it in our first use case of predicting whether a current customer is likely to churn, and if they are, provide them with special offers to incentivize them to stay.

Consider the case of our credit card churn predictor. Suppose we were using it in our second use case of predicting whether a new customer is likely to churn or not, and if they are, don't provide them with a credit card in the first place.

Would you endorse using either system? Why or why not? Justify what concerns you might have about either system or why you think some potential concerns do not outweigh the benefit of the model.

14 of 14

Before Next Time

Turn in Reading Assignment 4
Work on THA 4 Peer Reviews
Continue working on EDA / Milestone

Next Time

Geospatial Data