1 of 14

CSE 163

Model Evaluation�

Suh Young Choi�

🎶 Listening to: Inception soundtrack

💬 Before Class: What are some ways you think ML might be used in the future?

2 of 14

Announcements

  • THA 4 Peer Reviews out now! (due Feb. 18)

  • Take-Home Assessment 5 out now! (due Feb. 26)

  • New resubmission cycle has opened today and will close next Tuesday
    • Can resubmit Education

  • Suh Young not here next week!
    • TAs will hold a watch party instead
    • Lessons will be posted in advance

2

3 of 14

This Time

  • Categorical Features
  • Assessing Performance
  • Overfitting
  • Model Complexity
    • Hyperparameters
  • When to use ML

Last Time

  • Machine Learning
    • Terminology
    • Types of ML
  • ML Code (scikit-learn)
  • Decision Trees

3

4 of 14

One-Hot Encoding

  • Most ML models can’t handle categorical features by default
  • Mapping usually doesn’t work, one-hot encoding can!

4

5 of 14

Overfitting

  • The most important problem in science you’ve never heard of
  • Overfitting: When your model matches the training set so well, that it fails to generalize
    • Memorizing answers to Multiple Choice test
  • Tall trees are likely to overfit if you don’t have enough data
    • Can learn very complex boundaries
    • Very few points at the leaves

5

6 of 14

Assessing Performance

  • Training is cool, but we want to know its future performance
  • Training data can’t give an accurate evaluation
    • “I got 100% on the practice test I have been studying for 4 hours, therefore I will get 100% on the real exam”
  • Must hold out data called a test set to evaluate at the end
    • Unbiased estimate of performance in the wild

Never ever ever train or make decisions based on your test set.

If you do, it will no longer be good estimate of future performance.

6

7 of 14

Code Recap

General ML pipeline - For classification tasks w/ categorical features

7

# Separate data

features = data.loc[:, data.columns != 'target']

features = pd.get_dummies(features)

labels = data['target']

# Train/test split

feat_train, feat_test, lab_train, lab_test = \

train_test_split(features, labels, test_size=0.2)

# Create and train model on train set

model = DecisionTreeClassifier()

model.fit(feat_train, lab_train)

# Predict on test data

predictions = model.predict(feat_test)

accuracy_score(lab_test, predictions)

8 of 14

Model Complexity

  • One hyperparameter to control complexity of decision tree is the max depth (or height) of tree

8

9 of 14

Visualize the split

9

feat_train

lab_train

feat_test

lab_test

fit

Empty model

Trained model

10 of 14

Visualize the split

10

feat_train

lab_train

feat_test

lab_test

fit

Empty model

Trained model

predictions

accuracy_score

predict

11 of 14

Ground Rules

  • Talking about social impact and ethics in data science can be challenging since its effects can be deeply personal or harmful
    • Many reasonable people have differing opinions on how to draw the line between okay/not okay, there isn’t always an easy yes/no answer.
  • Brought up a lot of questions, but this is not the only set of questions we should ask. I’m just one person with one perspective!

Productive Discussions:

  • Listen with intention to understand first and forming an opinion only after you fully understand.
  • Take responsibility for the intended and unintended effects of your words and actions on others.
  • Mindfully respond to others’ ideas by acknowledging the unique value of each contribution.

11

12 of 14

Group Work:

Best Practices

When you first working with this group:

  • Introduce yourself!
  • If possible, angle one of your screens so that everyone can discuss together

Tips:

  • Starts with making sure everyone agrees to work on the same problem
  • Make sure everyone gets a chance to contribute!
  • Ask if everyone agrees and periodically ask each other questions!
  • Call TAs over for help if you need any!

12

13 of 14

Discussion (Canvas)

  1. Consider the case of our credit card churn predictor. Suppose we were using it in our first use case of predicting whether a current customer is likely to churn, and if they are, provide them with special offers to incentivize them to stay.

  • Consider the case of our credit card churn predictor. Suppose we were using it in our second use case of predicting whether a new customer is likely to churn or not, and if they are, don't provide them with a credit card in the first place.

Would you endorse using either system? Why or why not? Justify what concerns you might have about either system or why you think some potential concerns do not outweigh the benefit of the model.

13

14 of 14

Before Next Time

  • Turn in Reading Assignment 4
  • Work on THA 4 Peer Reviews
  • Continue working on EDA / Milestone

Next Time

  • Geospatial Data

14