Feature Engineering
Lecture 7
Wayne Tai Lee
Happening tomorrow!
Agenda
Often, feature engineering matters more than models
Tweets from Jan 2021
How we normally think about tweets
Twitter augments tweets with mentions and annotations
How do you know what’s important?
Bag of Words Model
Tokenization - splitting the text
Pre-processing tokenized text
Let’s think about 2 objectives!
The more you say it, the more important it is
But only relative to how often everyone uses it!
Term Frequency - Inverse Document Frequency
Literally, just multiply your objectives:
TF-IDF = TF * IDF
TFIDF has many variants - Term Frequency (TF)
English | Math |
Does the word appear in the document or not | 1 if term within doc, 0 otherwise |
TFIDF has many variants - Term Frequency (TF)
English | Math |
Does the word appear in the document or not | 1 if term within doc, 0 otherwise |
Word frequency in document | word_count(term) = �sum_{w in doc} 1[w = term] |
TFIDF has many variants - Term Frequency (TF)
English | Math |
Does the word appear in the document or not | 1 if term within doc, 0 otherwise |
Word frequency in document | word_count(term) = �sum_{w in doc} 1[w = term] |
Normalize word frequency by document length | word_freq(term) = word_count(term) / sum_{w in doc} 1 |
TFIDF has many variants - Term Frequency (TF)
English | Math |
Does the word appear in the document or not | 1 if term within doc, 0 otherwise |
Word frequency in document | word_count(term) = �sum_{w in doc} 1[w = term] |
Normalize word frequency by document length | word_freq(term) = word_count(term) / sum_{w in doc} 1 |
Log scalled word frequency to deal with outlying frequencies | log(1 + word_count(term)) |
TFIDF has many variants - Term Frequency (TF)
English | Math |
Does the word appear in the document or not | 1 if term within doc, 0 otherwise |
Word frequency in document | word_count(term) = �sum_{w in doc} 1[w = term] |
Normalize word frequency by document length | word_freq(term) = word_count(term) / sum_{w in doc} 1 |
Log scalled word frequency to deal with outlying frequencies | log(1 + word_count(term)) |
Normalized word frequency by most popular word’s frequency | max_freq = max_{term in doc} sum_{w in doc} 1[w = term] k + (1 - k) * word_count(term) / max_freq for k in [0, 1] |
TFIDF has many variants - Inverse Doc Freq (IDF)
English | Math |
Constant weight | 1 |
Log scaled document frequency relative to corpus | log(|docs| / n(t)) |
Log scaled document frequency relative to most popular word | mf = max_{t in docs} n(t) log(mf / (1 + n(t)) |
Document frequency relative to complement document frequency | log( (|docs| - n(t)) / n(t) ) |
n(t) = num_docs_with_term = |{doc in docs: t in doc}|
Term Frequency - Inverse Document Frequency
Literally, just multiply your objectives:
TF-IDF = TF * IDF
A feature helps highlight what is relevant - NDVI
Normalized difference vegetation index (NDVI)
For more on NDVI from NASA
Body Mass Index
A common machine learning feature in classification