1 of 23

Feature Engineering

Lecture 7

Wayne Tai Lee

2 of 23

Happening tomorrow!

3 of 23

Agenda

  • Feature Engineering
  • Text data preprocessing

4 of 23

Often, feature engineering matters more than models

5 of 23

Tweets from Jan 2021

6 of 23

How we normally think about tweets

7 of 23

Twitter augments tweets with mentions and annotations

8 of 23

How do you know what’s important?

9 of 23

Bag of Words Model

10 of 23

Tokenization - splitting the text

11 of 23

Pre-processing tokenized text

  • Punctuation (@tedcruz)
  • Numbers (January6)
  • Lowercasing
  • Stemming/Lemmatization
  • Stopword removal
  • N-gram inclusion (Medal of Freedom)
  • Infrequently used terms

12 of 23

Let’s think about 2 objectives!

The more you say it, the more important it is

But only relative to how often everyone uses it!

13 of 23

Term Frequency - Inverse Document Frequency

Literally, just multiply your objectives:

TF-IDF = TF * IDF

14 of 23

TFIDF has many variants - Term Frequency (TF)

English

Math

Does the word appear in the document or not

1 if term within doc, 0 otherwise

15 of 23

TFIDF has many variants - Term Frequency (TF)

English

Math

Does the word appear in the document or not

1 if term within doc, 0 otherwise

Word frequency in document

word_count(term) = �sum_{w in doc} 1[w = term]

16 of 23

TFIDF has many variants - Term Frequency (TF)

English

Math

Does the word appear in the document or not

1 if term within doc, 0 otherwise

Word frequency in document

word_count(term) = �sum_{w in doc} 1[w = term]

Normalize word frequency by document length

word_freq(term) = word_count(term) / sum_{w in doc} 1

17 of 23

TFIDF has many variants - Term Frequency (TF)

English

Math

Does the word appear in the document or not

1 if term within doc, 0 otherwise

Word frequency in document

word_count(term) = �sum_{w in doc} 1[w = term]

Normalize word frequency by document length

word_freq(term) = word_count(term) / sum_{w in doc} 1

Log scalled word frequency to deal with outlying frequencies

log(1 + word_count(term))

18 of 23

TFIDF has many variants - Term Frequency (TF)

English

Math

Does the word appear in the document or not

1 if term within doc, 0 otherwise

Word frequency in document

word_count(term) = �sum_{w in doc} 1[w = term]

Normalize word frequency by document length

word_freq(term) = word_count(term) / sum_{w in doc} 1

Log scalled word frequency to deal with outlying frequencies

log(1 + word_count(term))

Normalized word frequency by most popular word’s frequency

max_freq = max_{term in doc} sum_{w in doc} 1[w = term]

k + (1 - k) * word_count(term) / max_freq

for k in [0, 1]

19 of 23

TFIDF has many variants - Inverse Doc Freq (IDF)

English

Math

Constant weight

1

Log scaled document frequency relative to corpus

log(|docs| / n(t))

Log scaled document frequency relative to most popular word

mf = max_{t in docs} n(t)

log(mf / (1 + n(t))

Document frequency relative to complement document frequency

log( (|docs| - n(t)) / n(t) )

n(t) = num_docs_with_term = |{doc in docs: t in doc}|

20 of 23

Term Frequency - Inverse Document Frequency

Literally, just multiply your objectives:

TF-IDF = TF * IDF

21 of 23

A feature helps highlight what is relevant - NDVI

Normalized difference vegetation index (NDVI)

  • NIR = Near-Infrared
  • Red = visible red

For more on NDVI from NASA

22 of 23

Body Mass Index

23 of 23

A common machine learning feature in classification