1 of 25

Introduction to Tidytext

satRday - Cardiff - 23rd June 2018

Nujcharee Haswell

North Yorkshire County Council

2 of 25

Hi - I am Nujcharee (Ped)

Data & Intelligence Specialist / Data Scientist

North Yorkshire County Council

Experience: 10+ Years Data Warehouse, Business Intelligence, Data Engineering and Data Science ( <1 year)

NASA Datanaut Spring 2017

3 of 25

Let’s get to know tidytext

Online Book: https://www.tidytextmining.com/

4 of 25

About tidy data principle

Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. As described by Hadley Wickham (Wickham 2014), tidy data has a specific structure:

Each variable is a column
Each observation is a row
Each type of observational unit is a table

Online Book: https://www.tidytextmining.com/

5 of 25

Why text mining?

Because we want to apply simple interpretation of natural language

6 of 25

My tidytext story

How I use tidytext to solve problems

Analyse staff survey for better engagement

Better understand reasons for delay transfer of care (DToC)

Spam / Non Spam email classifications

Build knowledge management system

7 of 25

Let’s explore tidytext package in 3 steps

Step 1:

Tidy and Tokenising words

Step 2:

Sentiment Analysis

Step 3:

Word Frequency vs TDF-IDF

8 of 25

Step 1: Tidy (our) text

Data acquisition

Many R Packages from scraping text data
rtweet for tweets extraction
textreadr, officeR for office files text extraction
Kaggle
data.world

9 of 25

unnest_token

Break text into individual tokens (a process called tokenisation) and transform it to a tidy data structure (one-word-per-row format )

tokens <- dataset %>%

unnest_tokens(word, text)%>%

count(word, sort = TRUE)

10 of 25

stop_words

Remove stop words from our tokens.

We can also create custom stop words.

tokens <- dataset %>%

unnest_tokens(word, text) %>%

anti_join(stop_words)

11 of 25

library(tidytext) ## to carry out text mining tasks

library(dplyr) ## for tidy tool

library(wordcloud) ## for visualisation

clean_metoo <- metoo %>%

unnest_tokens(word, text) %>%

anti_join(stop_words)

#### customised stop words

custom_stop_words <- bind_rows(data_frame(word = c(“metoo”),

lexicon = c(“custom”)), stop_words)

## re-run

clean_metoo <- metoo %>%

unnest_tokens(word, text) %>%

anti_join(custom_stop_words)

12 of 25

## quick glance on most frequent words

wordcloud(word, n, max.words = 100))

13 of 25

Step 2: Sentiment Analysis

Lexicon - words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness

nrc
AFINN
bing

14 of 25

get_sentiments

get_sentiments("afinn")

## # A tibble: 2,476 x 2�## word score�## <chr> <int>�## 1 abandon -2�## 2 abandoned -2�## 3 abandons -2�## 4 abducted -2�## 5 abduction -2�## 6 abductions -2�## 7 abhor -3�## 8 abhorred -3�## 9 abhorrent -3�## 10 abhors -3�## # ... with 2,466 more rows

get_sentiments("bing")

## # A tibble: 6,788 x 2�## word sentiment�## <chr> <chr> �## 1 2-faced negative �## 2 2-faces negative �## 3 a+ positive �## 4 abnormal negative �## 5 abolish negative �## 6 abominable negative �## 7 abominably negative �## 8 abominate negative �## 9 abomination negative �## 10 abort negative �## # ... with 6,778 more rows

get_sentiments("nrc")

## # A tibble: 13,901 x 2�## word sentiment�## <chr> <chr> �## 1 abacus trust �## 2 abandon fear �## 3 abandon negative �## 4 abandon sadness �## 5 abandoned anger �## 6 abandoned fear �## 7 abandoned negative �## 8 abandoned sadness �## 9 abandonment anger �## 10 abandonment fear �## # ... with 13,891 more rows

15 of 25

inner_join

metoo_sent <- clean_metoo %>%

unnest_tokens(word, text) %>%

anti_join(stop_words) %>%

inner_join(get_sentiments("bing")) %>%� count(word, sentiment, sort = TRUE) %>%

acast(word ~ sentiment, value.var = "n",

fill = 0) %>%� comparison.cloud(colors = c("gray20", "gray80"),� max.words = 100)

16 of 25

comparison.cloud

17 of 25

Step 3: word counts vs TF-IDF

Words occur frequently in a document != important

TF-IDF = Combination of term frequency(tf) and inverse document frequency (idf)

18 of 25

TF - Term Frequency

TF measures how frequently a term occurs in a document.

Term frequency is often divided by the total number of terms in the document as a way of normalization.

19 of 25

IDF - Inverse document frequency

IDF measures how important a term is. While computing TF, all terms are considered equally important.

TF-IDF finds the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents

20 of 25

bind_tf_idf

Find the words most distinctive to each document

bigram_tf_idf <- bigrams_united %>%

count(location, bigram) %>%

bind_tf_idf(bigram, location, n) %>%

arrange(desc(tf_idf))

21 of 25

bigram

Tokenise pairs of two consecutive words to explore how often word X is followed by word Y, we can then build a model of the relationships between them.

metoo_bigrams <- dataset %>%

unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%

count(source_r, bigram, sort = TRUE)

22 of 25

Step 4: case study

Metoo

23 of 25

Summary

Step 1 - Get data, tokenisation process

unnest_tokens
anti_join(stop_words)
get_stopwords

Step 2 - Sentiment Analysis

get_sentiments(“lexicon”)
inner_join(get_sentiments("bing"))

Step 3

bind_tf_idf
unnest_tokens(bigram, text, token = "ngrams", n = 2)

24 of 25

Useful resources

Online Book: https://www.tidytextmining.com/
cleanNLP - NLP with tidy principle wrap
spacyR - R wrapper package for Python’s spaCy NLP
hunspell - for spelling check

25 of 25

Thank you

satRday organiser and sponsor @LockeData
Steph Locke for first time speaker tips & advice
#rstats community & #NASADatanauts
North Yorkshire County Council
All of you for joining me