JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 23

Feature Engineering

Lecture 7

Wayne Tai Lee

2 of 23

Happening tomorrow!

3 of 23

Agenda

Feature Engineering
Text data preprocessing

4 of 23

Often, feature engineering matters more than models

Real talk with Google Data Scientist

5 of 23

Tweets from Jan 2021

6 of 23

How we normally think about tweets

7 of 23

Twitter augments tweets with mentions and annotations

8 of 23

How do you know what’s important?

9 of 23

Bag of Words Model

10 of 23

Tokenization - splitting the text

Tokenizers in R

11 of 23

Pre-processing tokenized text

Punctuation (@tedcruz)
Numbers (January6)
Lowercasing
Stemming/Lemmatization
Stopword removal
N-gram inclusion (Medal of Freedom)
Infrequently used terms

Denny and Spirling 2018

12 of 23

Let’s think about 2 objectives!

The more you say it, the more important it is

But only relative to how often everyone uses it!

13 of 23

Term Frequency - Inverse Document Frequency

Literally, just multiply your objectives:

TF-IDF = TF * IDF

14 of 23

TFIDF has many variants - Term Frequency (TF)

Thank you Wikipedia

English	Math
Does the word appear in the document or not	1 if term within doc, 0 otherwise

15 of 23

TFIDF has many variants - Term Frequency (TF)

Thank you Wikipedia

English	Math
Does the word appear in the document or not	1 if term within doc, 0 otherwise
Word frequency in document	word_count(term) = �sum_{w in doc} 1[w = term]

16 of 23

TFIDF has many variants - Term Frequency (TF)

Thank you Wikipedia

English	Math
Does the word appear in the document or not	1 if term within doc, 0 otherwise
Word frequency in document	word_count(term) = �sum_{w in doc} 1[w = term]
Normalize word frequency by document length	word_freq(term) = word_count(term) / sum_{w in doc} 1

17 of 23

TFIDF has many variants - Term Frequency (TF)

Thank you Wikipedia

English	Math
Does the word appear in the document or not	1 if term within doc, 0 otherwise
Word frequency in document	word_count(term) = �sum_{w in doc} 1[w = term]
Normalize word frequency by document length	word_freq(term) = word_count(term) / sum_{w in doc} 1
Log scalled word frequency to deal with outlying frequencies	log(1 + word_count(term))

18 of 23

TFIDF has many variants - Term Frequency (TF)

Thank you Wikipedia

English	Math
Does the word appear in the document or not	1 if term within doc, 0 otherwise
Word frequency in document	word_count(term) = �sum_{w in doc} 1[w = term]
Normalize word frequency by document length	word_freq(term) = word_count(term) / sum_{w in doc} 1
Log scalled word frequency to deal with outlying frequencies	log(1 + word_count(term))
Normalized word frequency by most popular word’s frequency	max_freq = max_{term in doc} sum_{w in doc} 1[w = term] k + (1 - k) * word_count(term) / max_freq for k in [0, 1]

19 of 23

TFIDF has many variants - Inverse Doc Freq (IDF)

Thank you Wikipedia

English	Math
Constant weight	1
Log scaled document frequency relative to corpus	log(\|docs\| / n(t))
Log scaled document frequency relative to most popular word	mf = max_{t in docs} n(t) log(mf / (1 + n(t))
Document frequency relative to complement document frequency	log( (\|docs\| - n(t)) / n(t) )

n(t) = num_docs_with_term = |{doc in docs: t in doc}|

20 of 23

Term Frequency - Inverse Document Frequency

Literally, just multiply your objectives:

TF-IDF = TF * IDF

21 of 23

A feature helps highlight what is relevant - NDVI

Normalized difference vegetation index (NDVI)

NIR = Near-Infrared
Red = visible red

For more on NDVI from NASA

22 of 23

Body Mass Index

CDC on Body-Mass Index

23 of 23

A common machine learning feature in classification