sklearn,�Feature Engineering
Building models in code. Transforming data to improve model performance.
1
LECTURE 13
Join at slido.com�#1047766
ⓘ Start presenting to display the joining instructions on this slide.
Logistics for next Monday to Wednesday
3
Only one lab next week (Lab 8), due Saturday 7/22
Project A2 will be released on Monday, due the following Monday 7/24
Midterm logistics
The Midterm is next Thursday, 7/20, 5-7 pm
See this Ed post for detailed logistics.
4
Midterm logistics
Scope: Lectures 1-13 (everything up until and including today’s lecture)
5
Midterm prep session run by TAs tomorrow, 7/14, 10 am -12 pm in Evans 10
Ways to prepare: lecture notes, review assignments, sit past semesters’ practice exams
Goals for this Lecture
Last few lectures: underlying theory of modeling
This lecture: putting things into practice!
6
Lecture 13, Data 100 Summer 2023
Agenda
7
Lecture 13, Data 100 Summer 2023
Implementing Models in Code
8
Lecture 13, Data 100 Summer 2023
Demo: penguins
9
We have the dataset penguins.
We want to predict a penguin’s bill depth given its flipper length and body mass.
Hard to measure without bites.
Performing ordinary least squares in Python
In Lecture 11, we derived the OLS estimate for the optimal model parameters:
10
In Python:
Transpose
Inverse
Matrix Multiplication
matrix.T
np.linalg.inv(matrix)
matrix_1 @ matrix_2
theta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y
sklearn
11
Lecture 13, Data 100 Summer 2023
sklearn: a standard library for model creation
So far, we have been doing the “heavy lifting” of model creation ourselves – via calculus, ordinary least squares, or gradient descent
In research and industry, it is more common to rely on data science libraries for creating and training models. In Data 100, we will use Scikit-Learn, commonly called sklearn
12
import sklearn
my_model = linear_model.LinearRegression()
my_model.fit(X, y)
my_model.predict(X)
sklearn: a standard library for model creation
sklearn uses an object-oriented programming paradigm. Different types of models are defined as their own classes. To use a model, we initialize an instance of the model class.
13
The sklearn workflow
At a high level, there are three steps to creating an sklearn model:
14
Initialize a new model instance
Make a “copy” of the model template
Fit the model to the training data
Save the optimal model parameters
Use fitted model to make predictions
Fitted model outputs predictions for y
1
2
3
The sklearn workflow
At a high level, there are three steps to creating an sklearn model:
15
Initialize a new model instance
Make a “copy” of the model template
Fit the model to the training data
Save the optimal model parameters
Use fitted model to make predictions
Fitted model makes predictions for y
1
2
3
my_model = lm.LinearRegression()
my_model.fit(X, y)
my_model.predict(X)
To extract the fitted parameters: my_model.coef_ and my_model.intercept_
Feature Engineering
16
Lecture 13, Data 100 Summer 2023
Transforming features
Two observations:
17
Putting ideas together:
Feature engineering = transforming features to improve model performance
Feature engineering
Feature engineering is the process of transforming raw features into more informative features for use in modeling
Allows us to:
18
Feature functions
A feature function describes the transformations we apply to raw features in the dataset to create transformed features. Often, the dimension of the featurized dataset increases.
19
Dataset of raw features:
After applying the feature function :
Example: a feature function that adds a squared feature to the design matrix
Feature functions
A feature function describes the transformations we apply to raw features in the dataset to create transformed features. Often, the dimension of the featurized dataset increases.
20
Linear models trained on transformed data are sometimes written using the symbol Φ instead of X:
Shorthand for “the design matrix after feature engineering”
One-Hot Encoding
21
Lecture 13, Data 100 Summer 2023
Regression using non-numeric features
Think back to the tips dataset we used when first exploring regression
22
Before, we were limited to only using numeric features in a model – total_bill and size
By performing feature engineering, we can incorporate non-numeric features like the day of the week
One-hot encoding
One-hot encoding is a feature engineering technique to transform non-numeric data into numeric features for modeling
23
Sunday
Sunday
Thursday
Thursday
Saturday
Sunday
Thursday
Saturday
1
1
0
0
0
0
0
1
1
0
0
0
0
0
1
Original data
One-hot encoding
Regression using the one-hot encoding
The one-hot encoded features can then be used in the design matrix to train a model
24
Raw features
One-hot encoded features
In shorthand:
Regression using the one-hot encoding
Using sklearn to fit the new model:
25
Interpretation: how much the fact that it is Friday impacts the predicted tip
What tip would the model predict for a party with size 3 and a total bill of $50 eating on a Friday?
ⓘ Start presenting to display the poll results on this slide.
Regression using the one-hot encoding
Party of 3, $50 total bill, eating on a Friday:
27
Why did we not include an intercept term in the one-hot encoded model?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
One-hot encode wisely!
Any set of one-hot encoded columns will always sum to a column of all ones.
29
If we also include a bias column in the design matrix, there will be linear dependence in the model. is not invertible, and our OLS estimate fails.
How to resolve? Omit one of the one-hot encoded columns or do not include an intercept term
The bias column is a linear combination of the OHE columns
One-hot encode wisely!
30
How to resolve? Omit one of the one-hot encoded columns or do not include an intercept term
Adjusted design matrices:
or
We still retain the same information – in both approaches, the omitted column is simply a linear combination of the remaining columns
Polynomial Features
31
Lecture 13, Data 100 Summer 2023
Accounting for curvature
We’ve seen a few cases now where models with linear features have performed poorly on datasets with a clear non-linear curve.
32
When our model uses only a single linear feature (hp), it cannot capture non-linearity in the relationship
Solution: incorporate a non-linear feature!
MSE: 23.94
Polynomial features
We create a new feature: the square of the hp
33
This is still a linear model. Even though there are non-linear features, the model is linear with respect to
Degree of model: 2
MSE: 18.98
Looking a lot better: our predictions capture the curvature of the data.
Polynomial features
What if we add more polynomial features?
34
MSE continues to decrease with each additional polynomial term
Complexity and Overfitting
35
Lecture 13, Data 100 Summer 2023
How far can we take this?
36
Model complexity
As we continue to add more and more polynomial features, the MSE continues to decrease
Equivalently: as the model complexity increases, its training error decreases
37
Our experiment using vehicles
General trend for an arbitrary dataset
Seems like a good deal?
An extreme example: perfect polynomial fits
Math fact: given N non-overlapping data points, we can always find a polynomial of degree N-1 that goes through all those points.
38
For example, there always exists a degree-4 polynomial curve that can perfectly model a dataset of 5 datapoints
Model performance on unseen data
Our vehicle models from before considered a somewhat artificial scenario – we trained the models on the entire dataset, then evaluated their ability to make predictions on this same dataset
More realistic situation: we train the model on a sample from the population, then use it to make predictions on data it didn’t encounter during training
39
Model performance on unseen data
New (more realistic) example:
We may be tempted to make a highly complex model (eg degree 5)
40
…but performs horribly on the rest of the population!
Complex model makes perfect predictions on the training data…
Model performance on unseen data
What went wrong?
This is a problem: we want models that are generalizable to “unseen” data
41
Model variance
Complex models are sensitive to the specific dataset used to train them – they have high variance, because they will vary depending on what datapoints are used for training them
Our degree-5 model varies erratically when we fit it to different samples of 6 points from
42
vehicles
Error, variance, and complexity
We face a dilemma:
43
Our goal: find this “sweet spot”
Stay tuned for Lecture 15!
——End of Midterm Content——
Best of luck studying!
44
sklearn, Feature Engineering
Content credit: Acknowledgments
45
LECTURE 13