Data 100�Feature Engineering
Slides by:
Joseph E. Gonzalez, Deb Nolan, John DeNero, & Josh Hug
jegonzal@berkeley.edu
?
Recap of Linear Models
Feature Engineering
Feature Functions
Domain
uid | age | state | hasBought | review |
0 | 32 | NY | True | ”Meh.” |
42 | 50 | WA | True | ”Worked out of the box …” |
57 | 16 | CA | NULL | “Hella tots lit...” |
AK | … | NY | … | WY | age | age^2 | hasBought missing |
0 | … | 1 | … | 0 | 32 | 32^2 | 0 |
0 | … | 0 | … | 0 | 50 | 50^2 | 0 |
0 | … | 0 | … | 0 | 16 | 16^2 | 1 |
Entirely Quantitative and Transformed Values
DataFrame
Different number�of features
Feature Functions
Domain
uid | age | state | hasBought | review |
0 | 32 | NY | True | ”Meh.” |
42 | 50 | WA | True | ”Worked out of the box …” |
57 | 16 | CA | NULL | “Hella tots lit...” |
AK | … | NY | … | WY | age | age^2 | hasBought missing |
0 | … | 1 | … | 0 | 32 | 32^2 | 0 |
0 | … | 0 | … | 0 | 50 | 50^2 | 0 |
0 | … | 0 | … | 0 | 16 | 16^2 | 1 |
Entirely Quantitative and Transformed Values
DataFrame
(phi)ture function
Designing the feature functions is a big part of machine learning and data science.
Feature Functions
Feature Functions
Domain
uid | age | state | hasBought | review |
0 | 32 | NY | True | ”Meh.” |
42 | 50 | WA | True | ”Worked out of the box …” |
57 | 16 | CA | NULL | “Hella tots lit...” |
AK | … | NY | … | WY | age | age^2 | hasBought missing |
0 | … | 1 | … | 0 | 32 | 32^2 | 0 |
0 | … | 0 | … | 0 | 50 | 50^2 | 0 |
0 | … | 0 | … | 0 | 16 | 16^2 | 1 |
Entirely Quantitative and Transformed Values
DataFrame
(phi)ture function
Designing the feature functions is a big part of machine learning and data science.
Feature Functions
Feature Functions Examples
The Constant Feature Function
n
d
n
d
1
1
1
1
…
1 +
= p
Modeling Non-linear Relationships
Feature Functions:
Note that feature functions don’t depend on parameters.
Encoding Categorical and Text Features
uid | age | state | hasBought | review |
0 | 32 | NY | True | ”Meh.” |
42 | 50 | WA | True | ”Worked out of the box …” |
57 | 16 | CA | NULL | “Hella tots lit yo ...” |
rating |
2.0 |
4.5 |
4.1 |
What if x is a record with numbers, text, booleans, etc…
X
Y
Predict rating from review information
uid | age | state | hasBought | review |
0 | 32 | NY | True | ”Meh.” |
42 | 50 | WA | True | ”Worked out of the box …” |
57 | 16 | CA | NULL | “Hella tots lit yo ...” |
RatingsData(uid INTEGER, age FLOAT,
state STRING, hasBought BOOLEAN,
review STRING, rating FLOAT)
Schema:
rating |
2.0 |
4.5 |
4.1 |
As a Linear Model?
Can I use X and Y directly in a linear model
uid | age | state | hasBought | review |
0 | 32 | NY | True | ”Meh.” |
42 | 50 | WA | True | ”Worked out of the box …” |
57 | 16 | CA | NULL | “Hella tots lit yo ...” |
RatingsData(uid INTEGER, age FLOAT,
state STRING, hasBought BOOLEAN,
review STRING, rating FLOAT)
rating |
2.0 |
4.5 |
4.1 |
X=
Y=
Domain
X
Basic Transformations
One Hot Encoding (dummy encoding)
state |
NY |
WA |
CA |
AK | … | CA | … | NY | … | WA | … | WY |
0 | … | 0 | … | 1 | … | 0 | … | 0 |
0 | … | 0 | … | 0 | … | 1 | … | 0 |
0 | … | 1 | … | 0 | … | 0 | … | 0 |
Corresponding feature functions
See notebook for example code.
Fish
Dog
Cat
Origin of the term: multiple “wires” for possible values one is hot …
Encoding Missing Values
Encoding categorical data
“Learning about machine�learning is fun.”
0
0
1
2
1
learning
aardvark
machine
fun
…
…
…
…
0
zyzzyva
aardwolf
Vector
Bag-of-words Encoding
“Learning about machine�learning is fun.”
0
0
1
2
1
learning
aardvark
machine
fun
…
…
…
…
0
zyzzyva
aardwolf
Vector
I made this art piece in graduate school
Do you see the stop word?
There used to be a dustbin and broom
… but the janitors got confused …
N-Gram Encoding
The book was not well written but I did enjoy it.
The book was well written but I did not enjoy it.
2-Gram Encoding
The book was well written ...
the book
well written
book was
was well
the book
book was
was well
well written
0
0
1
1
1
…
…
…
…
0
Vector
1
…
aardvark airlines
apple pen
zyzzyva sf
N-Gram Encoding
The book was not well written but I did enjoy it.
The book was well written but I did not enjoy it.
Feature Transformations to Capture Domain Knowledge
Could do a database lookup
Diurnal patterns.