1 of 45

sklearn,�Feature Engineering

Building models in code. Transforming data to improve model performance.

Data 100 Summer 2023 @ UC Berkeley

Bella Crouch and Dominic Liu

Content credit: Acknowledgements

1

LECTURE 13

2 of 45

Join at slido.com�#1047766

Start presenting to display the joining instructions on this slide.

3 of 45

Logistics for next Monday to Wednesday

  • Monday, 7/17: pre-recorded guest lecture, discussions will run as usual
    • Will be posted to the course site Monday morning; watch this in your own time
    • Content is needed to complete Project A2!
    • Not in scope for the midterm

3

Only one lab next week (Lab 8), due Saturday 7/22

Project A2 will be released on Monday, due the following Monday 7/24

  • Tuesday, 7/18: live lecture on Regularization and Cross-Validation
    • Not in scope for the midterm
  • Wednesday, 7/19: no lecture, discussions will run as usual

4 of 45

Midterm logistics

The Midterm is next Thursday, 7/20, 5-7 pm

See this Ed post for detailed logistics.

  • You will receive a room and seating assignment by email next week.
  • DSP students will be contacted separately about exam timings/location.
  • Fill out the accommodations form if you need a left-handed seat.

4

5 of 45

Midterm logistics

Scope: Lectures 1-13 (everything up until and including today’s lecture)

  • Mix of multiple choice, math, and coding questions
  • No calculators or notes allowed, but you will be provided with a reference sheet

5

Midterm prep session run by TAs tomorrow, 7/14, 10 am -12 pm in Evans 10

  • Focus will be on problem-solving and walkthroughs of past exam questions
  • Normal exam prep and catch-up sections held at this time are canceled
  • Session will be recorded and posted to Ed by Saturday

Ways to prepare: lecture notes, review assignments, sit past semesters’ practice exams

6 of 45

Goals for this Lecture

Last few lectures: underlying theory of modeling

This lecture: putting things into practice!

  • Introducing sklearn, a useful Python library for building and fitting models
  • Techniques for selecting features to improve model performance

6

Lecture 13, Data 100 Summer 2023

7 of 45

Agenda

  • Implementing Models in Code
  • sklearn
  • Feature Engineering
  • One-Hot Encoding
  • Polynomial Features
  • Complexity and Overfitting

7

Lecture 13, Data 100 Summer 2023

8 of 45

Implementing Models in Code

  • Implementing Models in Code
  • sklearn
  • Feature Engineering
  • One-Hot Encoding
  • Polynomial Features
  • Complexity and Overfitting

8

Lecture 13, Data 100 Summer 2023

9 of 45

Demo: penguins

9

We have the dataset penguins.

We want to predict a penguin’s bill depth given its flipper length and body mass.

Hard to measure without bites.

10 of 45

Performing ordinary least squares in Python

In Lecture 11, we derived the OLS estimate for the optimal model parameters:

10

In Python:

Transpose

Inverse

Matrix Multiplication

matrix.T

np.linalg.inv(matrix)

matrix_1 @ matrix_2

theta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y

11 of 45

sklearn

  • Implementing Models in Code
  • sklearn
  • Feature Engineering
  • One-Hot Encoding
  • Polynomial Features
  • Complexity and Overfitting

11

Lecture 13, Data 100 Summer 2023

12 of 45

sklearn: a standard library for model creation

So far, we have been doing the “heavy lifting” of model creation ourselves – via calculus, ordinary least squares, or gradient descent

In research and industry, it is more common to rely on data science libraries for creating and training models. In Data 100, we will use Scikit-Learn, commonly called sklearn

12

import sklearn

my_model = linear_model.LinearRegression()

my_model.fit(X, y)

my_model.predict(X)

13 of 45

sklearn: a standard library for model creation

sklearn uses an object-oriented programming paradigm. Different types of models are defined as their own classes. To use a model, we initialize an instance of the model class.

  • Don’t worry if you are not familiar with objects in Python. You can think of sklearn as allowing you to “copy” an existing template of a useful model.

13

14 of 45

The sklearn workflow

At a high level, there are three steps to creating an sklearn model:

14

Initialize a new model instance

Make a “copy” of the model template

Fit the model to the training data

Save the optimal model parameters

Use fitted model to make predictions

Fitted model outputs predictions for y

1

2

3

15 of 45

The sklearn workflow

At a high level, there are three steps to creating an sklearn model:

15

Initialize a new model instance

Make a “copy” of the model template

Fit the model to the training data

Save the optimal model parameters

Use fitted model to make predictions

Fitted model makes predictions for y

1

2

3

my_model = lm.LinearRegression()

my_model.fit(X, y)

my_model.predict(X)

To extract the fitted parameters: my_model.coef_ and my_model.intercept_

16 of 45

Feature Engineering

  • Implementing Models in Code
  • sklearn
  • Feature Engineering
  • One-Hot Encoding
  • Polynomial Features
  • Complexity and Overfitting

16

Lecture 13, Data 100 Summer 2023

17 of 45

Transforming features

Two observations:

  • At the end of the Visualization lecture, we looked at transforming variables – we found that applying a transformation could help linearize a dataset
  • In our work on modeling, we saw that linear modeling works best when our dataset has linear relationships

17

Putting ideas together:

Feature engineering = transforming features to improve model performance

18 of 45

Feature engineering

Feature engineering is the process of transforming raw features into more informative features for use in modeling

Allows us to:

  • Capture domain knowledge
  • Express non-linear relationships using linear models
  • Use non-numeric features in models

18

19 of 45

Feature functions

A feature function describes the transformations we apply to raw features in the dataset to create transformed features. Often, the dimension of the featurized dataset increases.

19

Dataset of raw features:

After applying the feature function :

Example: a feature function that adds a squared feature to the design matrix

20 of 45

Feature functions

A feature function describes the transformations we apply to raw features in the dataset to create transformed features. Often, the dimension of the featurized dataset increases.

20

Linear models trained on transformed data are sometimes written using the symbol Φ instead of X:

Shorthand for “the design matrix after feature engineering”

21 of 45

One-Hot Encoding

  • Implementing Models in Code
  • sklearn
  • Feature Engineering
  • One-Hot Encoding
  • Polynomial Features
  • Complexity and Overfitting

21

Lecture 13, Data 100 Summer 2023

22 of 45

Regression using non-numeric features

Think back to the tips dataset we used when first exploring regression

22

Before, we were limited to only using numeric features in a model – total_bill and size

By performing feature engineering, we can incorporate non-numeric features like the day of the week

23 of 45

One-hot encoding

One-hot encoding is a feature engineering technique to transform non-numeric data into numeric features for modeling

  • Each category of a categorical variable gets its own feature
    • Value = 1 if a row belongs to the category
    • Value = 0 otherwise

23

Sunday

Sunday

Thursday

Thursday

Saturday

Sunday

Thursday

Saturday

1

1

0

0

0

0

0

1

1

0

0

0

0

0

1

Original data

One-hot encoding

24 of 45

Regression using the one-hot encoding

The one-hot encoded features can then be used in the design matrix to train a model

24

Raw features

One-hot encoded features

In shorthand:

25 of 45

Regression using the one-hot encoding

Using sklearn to fit the new model:

25

Interpretation: how much the fact that it is Friday impacts the predicted tip

26 of 45

What tip would the model predict for a party with size 3 and a total bill of $50 eating on a Friday?

Start presenting to display the poll results on this slide.

27 of 45

Regression using the one-hot encoding

Party of 3, $50 total bill, eating on a Friday:

27

28 of 45

Why did we not include an intercept term in the one-hot encoded model?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

29 of 45

One-hot encode wisely!

Any set of one-hot encoded columns will always sum to a column of all ones.

29

If we also include a bias column in the design matrix, there will be linear dependence in the model. is not invertible, and our OLS estimate fails.

How to resolve? Omit one of the one-hot encoded columns or do not include an intercept term

The bias column is a linear combination of the OHE columns

30 of 45

One-hot encode wisely!

30

How to resolve? Omit one of the one-hot encoded columns or do not include an intercept term

Adjusted design matrices:

or

We still retain the same information – in both approaches, the omitted column is simply a linear combination of the remaining columns

31 of 45

Polynomial Features

  • sklearn
  • Feature Engineering
  • One-Hot Encoding
  • Polynomial Features
  • Complexity and Overfitting

31

Lecture 13, Data 100 Summer 2023

32 of 45

Accounting for curvature

We’ve seen a few cases now where models with linear features have performed poorly on datasets with a clear non-linear curve.

32

When our model uses only a single linear feature (hp), it cannot capture non-linearity in the relationship

Solution: incorporate a non-linear feature!

MSE: 23.94

33 of 45

Polynomial features

We create a new feature: the square of the hp

33

This is still a linear model. Even though there are non-linear features, the model is linear with respect to

Degree of model: 2

MSE: 18.98

Looking a lot better: our predictions capture the curvature of the data.

34 of 45

Polynomial features

What if we add more polynomial features?

34

MSE continues to decrease with each additional polynomial term

35 of 45

Complexity and Overfitting

  • sklearn
  • Feature Engineering
  • One-Hot Encoding
  • Polynomial Features
  • Complexity and Overfitting

35

Lecture 13, Data 100 Summer 2023

36 of 45

How far can we take this?

36

37 of 45

Model complexity

As we continue to add more and more polynomial features, the MSE continues to decrease

Equivalently: as the model complexity increases, its training error decreases

37

Our experiment using vehicles

General trend for an arbitrary dataset

Seems like a good deal?

38 of 45

An extreme example: perfect polynomial fits

Math fact: given N non-overlapping data points, we can always find a polynomial of degree N-1 that goes through all those points.

38

For example, there always exists a degree-4 polynomial curve that can perfectly model a dataset of 5 datapoints

39 of 45

Model performance on unseen data

Our vehicle models from before considered a somewhat artificial scenario – we trained the models on the entire dataset, then evaluated their ability to make predictions on this same dataset

More realistic situation: we train the model on a sample from the population, then use it to make predictions on data it didn’t encounter during training

39

40 of 45

Model performance on unseen data

New (more realistic) example:

  • We are given a training dataset of just 6 datapoints
  • We want to train a model to then make predictions on a different set of points

We may be tempted to make a highly complex model (eg degree 5)

40

…but performs horribly on the rest of the population!

Complex model makes perfect predictions on the training data…

41 of 45

Model performance on unseen data

What went wrong?

  • The complex model overfit to the training data – it essentially “memorized” these 6 training points
  • The overfitted model does not generalize well to data it did not encounter during training

This is a problem: we want models that are generalizable to “unseen” data

41

42 of 45

Model variance

Complex models are sensitive to the specific dataset used to train them – they have high variance, because they will vary depending on what datapoints are used for training them

Our degree-5 model varies erratically when we fit it to different samples of 6 points from

42

vehicles

43 of 45

Error, variance, and complexity

We face a dilemma:

  • We know that we can decrease training error by increasing model complexity
  • However, models that are too complex start to overfit and do not generalize well – their high variance means they can’t be reapplied to new datasets

43

Our goal: find this “sweet spot”

Stay tuned for Lecture 15!

44 of 45

——End of Midterm Content——

Best of luck studying!

44

45 of 45

sklearn, Feature Engineering

Content credit: Acknowledgments

45

LECTURE 13