Introduction to Tidytext
satRday - Cardiff - 23rd June 2018
Nujcharee Haswell
North Yorkshire County Council
Hi - I am Nujcharee (Ped)
Data & Intelligence Specialist / Data Scientist
North Yorkshire County Council
Experience: 10+ Years Data Warehouse, Business Intelligence, Data Engineering and Data Science ( <1 year)
NASA Datanaut Spring 2017
Let’s get to know tidytext
Online Book: https://www.tidytextmining.com/
About tidy data principle
Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. As described by Hadley Wickham (Wickham 2014), tidy data has a specific structure:
Online Book: https://www.tidytextmining.com/
Why text mining?
Because we want to apply simple interpretation of natural language
My tidytext story
How I use tidytext to solve problems
Analyse staff survey for better engagement
Better understand reasons for delay transfer of care (DToC)
Spam / Non Spam email classifications
Build knowledge management system
Let’s explore tidytext package in 3 steps
Step 1:
Tidy and Tokenising words
Step 2:
Sentiment Analysis
Step 3:
Word Frequency vs TDF-IDF
Step 1: Tidy (our) text
unnest_token
Break text into individual tokens (a process called tokenisation) and transform it to a tidy data structure (one-word-per-row format )
tokens <- dataset %>%
unnest_tokens(word, text)%>%
count(word, sort = TRUE)
stop_words
Remove stop words from our tokens.
We can also create custom stop words.
tokens <- dataset %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
library(tidytext) ## to carry out text mining tasks
library(dplyr) ## for tidy tool
library(wordcloud) ## for visualisation
clean_metoo <- metoo %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
#### customised stop words
custom_stop_words <- bind_rows(data_frame(word = c(“metoo”),
lexicon = c(“custom”)), stop_words)
## re-run
clean_metoo <- metoo %>%
unnest_tokens(word, text) %>%
anti_join(custom_stop_words)
## quick glance on most frequent words
wordcloud(word, n, max.words = 100))
Step 2: Sentiment Analysis
Lexicon - words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness
get_sentiments
get_sentiments("afinn")
## # A tibble: 2,476 x 2�## word score�## <chr> <int>�## 1 abandon -2�## 2 abandoned -2�## 3 abandons -2�## 4 abducted -2�## 5 abduction -2�## 6 abductions -2�## 7 abhor -3�## 8 abhorred -3�## 9 abhorrent -3�## 10 abhors -3�## # ... with 2,466 more rows
get_sentiments("bing")
## # A tibble: 6,788 x 2�## word sentiment�## <chr> <chr> �## 1 2-faced negative �## 2 2-faces negative �## 3 a+ positive �## 4 abnormal negative �## 5 abolish negative �## 6 abominable negative �## 7 abominably negative �## 8 abominate negative �## 9 abomination negative �## 10 abort negative �## # ... with 6,778 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2�## word sentiment�## <chr> <chr> �## 1 abacus trust �## 2 abandon fear �## 3 abandon negative �## 4 abandon sadness �## 5 abandoned anger �## 6 abandoned fear �## 7 abandoned negative �## 8 abandoned sadness �## 9 abandonment anger �## 10 abandonment fear �## # ... with 13,891 more rows
inner_join
metoo_sent <- clean_metoo %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("bing")) %>%� count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n",
fill = 0) %>%� comparison.cloud(colors = c("gray20", "gray80"),� max.words = 100)
comparison.cloud
Step 3: word counts vs TF-IDF
Words occur frequently in a document != important
TF-IDF = Combination of term frequency(tf) and inverse document frequency (idf)
TF - Term Frequency
TF measures how frequently a term occurs in a document.
Term frequency is often divided by the total number of terms in the document as a way of normalization.
IDF - Inverse document frequency
IDF measures how important a term is. While computing TF, all terms are considered equally important.
TF-IDF finds the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents
bind_tf_idf
Find the words most distinctive to each document
bigram_tf_idf <- bigrams_united %>%
count(location, bigram) %>%
bind_tf_idf(bigram, location, n) %>%
arrange(desc(tf_idf))
bigram
Tokenise pairs of two consecutive words to explore how often word X is followed by word Y, we can then build a model of the relationships between them.
metoo_bigrams <- dataset %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(source_r, bigram, sort = TRUE)
Step 4: case study
Summary
Useful resources
Thank you