1 of 8

Transforming Text Data to Matrix Data via Embeddings

Colleen M. Farrelly, Datasembly

2 of 8

Colleen M. Farrelly

  • Senior data scientist at Datasembly
  • Located in Miami, Florida
  • Expertise in time series, natural language processing, topological data analysis, structural equation modeling, spatial data analysis…
  • cfarrelly@med.miami.edu (or find me on LinkedIn)

3 of 8

Text Data and Machine Learning

  • Common sources of text data:
    • Chatbot or live agent conversations
    • Customer feedback forms
    • Scraped text data
    • Books or articles in a data lake
    • Emails
    • Student essays
  • Common machine learning tasks:
    • Classify email spam
    • Cluster chatbot conversations to derive user groups
    • Predict customer churn likelihood from feedback
    • Predict student educational attainment from writing sample

4 of 8

Wrangling Text to Numeric Matrix

  • Document word counts/frequencies
    • Binary, count, or weighted frequency/inverse frequency
  • Sparse numeric matrix

She bolted the door shut.

(0,0,1,0,0,1,0,0,1,1,0,1,0)

5 of 8

Embeddings:

High dimension to low dimension

6 of 8

Context Matters

She bolted the door shut.

She bolted out the door.

7 of 8

Meet BERT

  • Pretrained neural network that:
    • Is context-aware (encoder/decoder structure)
    • Embeds the sparse matrix into a lower-dimensional dense one

8 of 8

Demo Use Case

Classifying poem types: serious or humorous