2 of 23

Data 100�Feature Engineering

Slides by:

Joseph E. Gonzalez, Deb Nolan, John DeNero, & Josh Hug

jegonzal@berkeley.edu

deborah_nolan@berkeley.edu

denero@Berkeley.edu

josh@joshh.ug

Joseph E. Gonzalez

3 of 23

Recap of Linear Models

Joseph E. Gonzalez

4 of 23

Feature Engineering

The process of transforming the raw features to into more informative features that can be �used in modeling tasks.
Feature Engineering enables you to:

capture domain knowledge (e.g., periodicity or relationships between features)
express non-linear relationships using simple linear models
encode non-numeric features to be used as inputs to models

5 of 23

Feature Functions

Feature functions transform features into new features

Domain

uid	age	state	hasBought	review
0	32	NY	True	”Meh.”
42	50	WA	True	”Worked out of the box …”
57	16	CA	NULL	“Hella tots lit...”

AK	…	NY	…	WY	age	age^2	hasBought missing
0	…	1	…	0	32	32^2	0
0	…	0	…	0	50	50^2	0
0	…	0	…	0	16	16^2	1

Entirely Quantitative and Transformed Values

DataFrame

Different number�of features

Joseph E. Gonzalez

6 of 23

Feature Functions

Feature functions transform features into new features

Domain

uid	age	state	hasBought	review
0	32	NY	True	”Meh.”
42	50	WA	True	”Worked out of the box …”
57	16	CA	NULL	“Hella tots lit...”

AK	…	NY	…	WY	age	age^2	hasBought missing
0	…	1	…	0	32	32^2	0
0	…	0	…	0	50	50^2	0
0	…	0	…	0	16	16^2	1

Entirely Quantitative and Transformed Values

DataFrame

(phi)ture function

Designing the feature functions is a big part of machine learning and data science.

Feature Functions

capture domain knowledge
substantial contribute to expressivity (and complexity)

Joseph E. Gonzalez

7 of 23

Feature Functions

Feature functions transform features into new features

Domain

uid	age	state	hasBought	review
0	32	NY	True	”Meh.”
42	50	WA	True	”Worked out of the box …”
57	16	CA	NULL	“Hella tots lit...”

AK	…	NY	…	WY	age	age^2	hasBought missing
0	…	1	…	0	32	32^2	0
0	…	0	…	0	50	50^2	0
0	…	0	…	0	16	16^2	1

Entirely Quantitative and Transformed Values

DataFrame

(phi)ture function

Designing the feature functions is a big part of machine learning and data science.

Feature Functions

capture domain knowledge
substantial contribute to expressivity (and complexity)

Joseph E. Gonzalez

8 of 23

Feature Functions Examples

Joseph E. Gonzalez

9 of 23

The Constant Feature Function

By adding the all 1s column to our original data we were already introducing a feature function:

Sometimes this feature and it’s parameter are called:

Constant feature, offset, intercept, bias

…

1 +

= p

Joseph E. Gonzalez

10 of 23

Modeling Non-linear Relationships

Feature Functions:

Note that feature functions don’t depend on parameters.

Joseph E. Gonzalez

11 of 23

Encoding Categorical and Text Features

uid	age	state	hasBought	review
0	32	NY	True	”Meh.”
42	50	WA	True	”Worked out of the box …”
57	16	CA	NULL	“Hella tots lit yo ...”

rating
2.0
4.5
4.1

What if x is a record with numbers, text, booleans, etc…

12 of 23

Predict rating from review information

uid	age	state	hasBought	review
0	32	NY	True	”Meh.”
42	50	WA	True	”Worked out of the box …”
57	16	CA	NULL	“Hella tots lit yo ...”

RatingsData(uid INTEGER, age FLOAT,

state STRING, hasBought BOOLEAN,

review STRING, rating FLOAT)

Schema:

rating
2.0
4.5
4.1

13 of 23

As a Linear Model?

Can I use X and Y directly in a linear model

No! Why?
Text, Categorical data, Missing values…

uid	age	state	hasBought	review
0	32	NY	True	”Meh.”
42	50	WA	True	”Worked out of the box …”
57	16	CA	NULL	“Hella tots lit yo ...”

RatingsData(uid INTEGER, age FLOAT,

state STRING, hasBought BOOLEAN,

review STRING, rating FLOAT)

rating
2.0
4.5
4.1

Domain

14 of 23

Basic Transformations

Uninformative features: (e.g., UID)

Is this informative (probably not?)
Transformation: remove uninformative features (why?)

They could influence the model.

Quantitative Features (e.g., Age)

Transformation: May apply non-linear transformations (e.g., log)
Transformation: Normalize/standardize (more on this later …)

Example: (x – mean)/stdev

Categorical Features (e.g., State)

How do we convert State into meaningful numbers?

Alabama =1 , …, Utah = 50 ?
Implies order/magnitude means something … we don’t want that ...

Transformation: One-hot-Encode

15 of 23

One Hot Encoding (dummy encoding)

Transform categorical feature into many binary features:

state
NY
WA
CA

AK	…	CA	…	NY	…	WA	…	WY
0	…	0	…	1	…	0	…	0
0	…	0	…	0	…	1	…	0
0	…	1	…	0	…	0	…	0

Corresponding feature functions

See notebook for example code.

Fish

Dog

Cat

Origin of the term: multiple “wires” for possible values one is hot …

16 of 23

Encoding Missing Values

Missing values in Quantitative Data

Try to impute (estimate) missing values… (tricky)

Substitute the sample mean

Add a binary field called “missing_col_name”. (why?)

Sometimes missing data is signal!

Missing values in Categorical Data

Add an addition category called “missing_col_name”

17 of 23

Encoding categorical data

Categorical Data 🡺 One-hot encoding:

Text Data

Bag-of-words & N-gram models

“Learning about machine�learning is fun.”

learning

aardvark

machine

fun

…

zyzzyva

aardwolf

Vector

18 of 23

Bag-of-words Encoding

Generalization of one-hot-encoding for a string of text:

Encode text as a long vector of word counts (Issues?)

Long = millions of columns 🡪typically high dimensional and very sparse
Word order information is lost… (is this an issue?)
New unseen words at prediction (test) time 🡪 drop them …

A bag is another term for a multiset: an unordered collection which may contain multiple instances of each element.
Stop words: words that do not contain significant information

Examples: the, in, at, or, on, a, an, and …
Typically removed

“Learning about machine�learning is fun.”

learning

aardvark

machine

fun

…

zyzzyva

aardwolf

Vector

19 of 23

I made this art piece in graduate school

Do you see the stop word?

There used to be a dustbin and broom

… but the janitors got confused …

20 of 23

N-Gram Encoding

Sometimes word order matters:

How do we capture word order in a “vector” model?

N-Gram: “Bag-of- sequences-of-words”

The book was not well written but I did enjoy it.

The book was well written but I did not enjoy it.

21 of 23

2-Gram Encoding

The book was well written ...

the book

well written

book was

was well

the book

book was

was well

well written

…

Vector

…

aardvark airlines

apple pen

zyzzyva sf

22 of 23

N-Gram Encoding

Sometimes word order matters:

How do we capture word order in a “vector” model?

N-Gram: “Bag-of- sequences-of-words”

Issues:

Can be very sparse (many combinations occur only once)
Many combinations will only occur at prediction time 🡪 drop ..
Often use hashing approximation:

Increment counter at hash(“not enjoy”) collisions are okay

The book was not well written but I did enjoy it.

The book was well written but I did not enjoy it.

AK	…	CA	…	NY	…	WA	…	WY
0	…	0	…	1	…	0	…	0
0	…	0	…	0	…	1	…	0
0	…	1	…	0	…	0	…	0

AK	…	CA	…	NY	…	WA	…	WY
0	…	0	…	1	…	0	…	0
0	…	0	…	0	…	1	…	0
0	…	1	…	0	…	0	…	0

1 of 23

2 of 23

3 of 23

4 of 23

5 of 23

6 of 23

7 of 23

8 of 23

9 of 23

10 of 23

11 of 23

12 of 23

13 of 23

14 of 23

15 of 23

16 of 23

17 of 23

18 of 23

19 of 23

20 of 23

21 of 23

22 of 23

23 of 23

AK	…	CA	…	NY	…	WA	…	WY
0	…	0	…	1	…	0	…	0
0	…	0	…	0	…	1	…	0
0	…	1	…	0	…	0	…	0