1 of 20

Do English and math teachers have anything in common?

2020

Jerry Tuttle, FCAS, Rocky Mt College Art & Design

2 of 20

What is data science?

  • "Work that takes more programming skills than most statisticians have, and more statistics skills than a programmer has."

2020

3 of 20

Data science

  • Data science tries to find meaning in large amounts of data.
  • Math people like data.
  • Text analysis is a branch of data science.
  • Math people like to count. English teachers like to use words.
  • So let’s count words!
  • We know data can be dirty.
  • Text data is dirtier than numerical data.
  • First, let’s clean dirty text data.

2020

4 of 20

Cleaning dirty data

  • Delete punctuation, remove capitals
  • Delete funny characters: œ, â, €, URLs
  • Delete escape & Unicode characters: \n, \u0009, etc.
  • Stemming: lov = {love, lovable, loving, lovely, love-affair}
  • Lemmatizing: love = {lov}
  • Delete stop words: a, an, the, etc.

2020

5 of 20

How do writers differ quantitatively ?

  • # Characters per word (used by logician DeMorgan, 1851), # words per sentence
  • Percent unique words
  • Use of frequent words
  • Use of sensory adjectives
  • Use of sentiment words
  • Use of positive or negative words
  • Verb/adjective ratio
  • Complexity (grade level readability)

2020

6 of 20

Sensory adjectives

2020

Word

Sense

bouncy

touch

bouncy

visual

buttery

taste

buttery

touch

chirping

hear

citrusy

smell

citrusy

taste

7 of 20

Sentiment words

2020

Word

Sentiment

mathematical

trust

matrimony

anticipation

matrimony

joy

matrimony

trust

measles

disgust

measles

fear

measles

sadness

8 of 20

Let’s compare 2 great works of literature

# 1. Hamlet

  • Neither a borrower nor a lender be
  • To thine own self be true
  • Something is rotten in Denmark
  • Brevity is the soul of wit
  • There is nothing either good or bad but thinking makes it so
  • To die, to sleep, perchance to dream
  • The lady protests too much methinks
  • Alas poor Yorick! I knew him

2020

9 of 20

#2. 1 + 1 = 0 excerpt:

The miniskirted waitress brought two more beers to the table. She leaned over as she placed each beer bottle on the table, invit- ing the two male patrons a teasing look down the top of her blouse. She flipped her blonde hair, batted her long eyelashes, and flashed a big smile through her shiny red lipstick. Each of these gestures was designed to elicit a greater than average tip from the two considerably drunk young men. She did not understand that the two men were actuaries who could easily estimate a 15% tip to within a couple of pennies, no matter how drunk or distracted they were.

2020

10 of 20

:#2. 1 + 1 = 0 excerpt:

The miniskirted waitress brought two more beers to the table. She leaned over as she placed each beer bottle on the table, invit- ing the two male patrons a teasing look down the top of her blouse. She flipped her blonde hair, batted her long eyelashes, and flashed a big smile through her shiny red lipstick. Each of these gestures was designed to elicit a greater than average tip from the two considerably drunk young men. She did not understand that the two men were actuaries who could easily estimate a 15% tip to within a couple of pennies, no matter how drunk or distracted they were.

2020

11 of 20

Post stop-word freq

2020

12 of 20

Unique words

2020

Some words from Hamlet not in 1+1=0:

  • encompassment
  • inventorially
  • reconcilement
  • recognizances
  • schoolfellows

Unique

Total

Ratio

Hamlet

4,719

32,200

14.7%

1 + 1 = 0

1,275

4,789

26.6%

13 of 20

Compare word lengths

2020

14 of 20

Compare sensory adj.

2020

15 of 20

Compare sentiments

2020

16 of 20

Hamlet Pos & Negs

2020

17 of 20

Hamlet (pos – neg) �index every 80 lines

2020

18 of 20

Common authorship?

2020

JT1

JT9

I used text analysis on 10 actuarial stories. I used k means clustering to test common authorship. Algorithm made 3 clusters and did not pair my 2 stories, but it came close.

19 of 20

Uses of text analysis

  • Did Shakespeare really write his works
  • Did Hamilton write 51 Federalist papers
  • Which Beatles songs did John write
  • Which tweets did Trump’s staff write
  • Compare sentiments of characters
  • Compare books, presidents, judges
  • Opinions on consumer products, politicians, course evaluations
  • Predict spam, crime, fraud, terrorism

2020

20 of 20

Final thoughts

2020

  • Cleaning the data takes much time
  • Hamlet has old English, stage directions
  • Use R or Python, not Excel
  • Good reference is Silge, J. & Robinson, D. Text mining with R. O’Reilly
  • Full text of 1+1=0: bit.ly/jt_oneplusone

https://blog.rstudio.com/2019/12/17/r-vs-python-what-s-the-best-for-language-for-data-science/