1 of 25

What Is Data Science?

Jerry Tuttle, FCAS. Rocky Mountain College of Art & Design

People Buying a Left Shoe Also Buy a Right Shoe  

2 of 25

2

What Is Data Science?

The intersection among math/stats, computer sci, & domain knowledge, to extract meaningful insights from data, translating into tangible business value.

  

3 of 25

3

Some Data Science Problems

Problem

Category

Techniques

Determine rel bet outcome & input

Regression

Linear, logistic regression

Group items by similarity

Clustering

K-nearest neighbor, k-mean

Find rel bet items

Association rules

Apriori

Assign known labels to objects

Classification

Decision trees, random forests

Analyze text data

Text analysis

Text analysis

4 of 25

4

Data Training vs Testing

RMSE on test data = [∑ (Actual – Predicted)2 / n]

  

5 of 25

5

Performance Measures

False pos (Type I): Dr tells male he is pregnant.

False neg (Type 11): Dr tells obviously pregnant woman she is not pregnant.

Accuracy: (TN + TP)/n. True Pos Rate = TP/ (TP+FN). Pos Predicted Value = TP / (TP+FP).

  

6 of 25

6

Exploratory Data Analysis

What questions do you have about the data?  

7 of 25

7

Clustering of similar items

Data Sci. asks: Is a tomato a fruit or a vegetable?

Suppose a tomato is (6,4). Which food is closest? 

8 of 25

8

knn: Assign new obj to class

of closest neighbor

  knn suggests tomato is closest to orange & grape

9 of 25

9

K-means assigns clusters without knowing response variable

Iteratively defines clusters to minimize total within-cluster variation   

10 of 25

10

The student interests dataset

30 K students * 36 interests (friends, football, etc)

Assigned to 5 clusters.

Each cluster contained every interest, but in different amounts.

Cluster 1: princesses

Cluster 2: brains

Cluster 3: bad boys & girls

Cluster 4: sports

Cluster 5: unexceptional

Would this be useful to a marketer?

11 of 25

11

Association rules: market basket

What rules can we create from this, such as

“People who buy bread are likely to buy milk.”

Denote as  {bread} ⇒ {milk}.

12 of 25

12

Association rules: statistics

Corr matrix. But this limits to 2 items {a} ⇒ {b}.

Can also have mult. items X = {a, b, …} ⇒ Y = {c}.

Supp(X ⇒ Y) = #(X ∩ Y)/#N; popularity of X & Y.

Conf(X ⇒ Y) = #(X ∩ Y)/#X; prob buy Y| buy X

Lift(X⇒Y) =Supp(X⇒Y)/[Supp(X)* Supp(Y)];

prob buy Y| buy X, controlling for popularity of Y.

Lift > 1: likely association; lift = 1: no association; lift <1: unlikely asso.

  

13 of 25

13

Prune # possible rules

6 items give C(6,2) +…+ C(6,6) = 57 possible rules.

Want rules with min Support, say 60%.

Prune search:  if a 1-item transaction is unlikely, then all supersets of that item are also unlikely.

Six 1-item trans: eliminate {eggs}, {soda}; < 60%.

Six 2-item trans, elim {beer, bread}, {beer, milk}

One 3-item trans, elim {bread, diapers, milk}.

{diap} ⇒ {beer}?

14 of 25

14

The groceries dataset

10,000 transactions, 169 grocery items.

Creates hundreds of rules: actionable, trivial, inexplicable.

May require domain expertise.  

15 of 25

15

Decision tree

Remember the game “Twenty Questions?”

Suppose we limit to yes/no, non-compound questions (no use of “or”).

I’m thinking of a mathematician who has a chapter in E.T. Bell’s Men of Mathematics.

What are some good questions? What’s your goal with first few questions?

The questions, with yes/no responses, can be plotted as a decision tree.

 

16 of 25

16

Characteristics of decision trees

Tree-like graph modeling decision. Non-parametric.

Easy to understand and visualize.

Target variable usually categorical.

Which stats predict Will it rain tomorrow (Y/N)?

Partition data based on best split of ea input var. to get as homogeneous (pure) of each branch as possible.

Continue partitioning each branch.

Best split maximizes amount of Information Gain.

  

17 of 25

17

Will it rain tomorrow?

18 of 25

18

Will it rain tomorrow?

19 of 25

19

Applications of decision trees

Prediction

Assess importance of variables; eliminate irrelevant variables.

Medical diagnoses.

Desirable candidates for loans, college acceptance, financial decisions, etc.

Similarities with regression.

20 of 25

20

Decision tree: baseball win %

Pitching vs hitting 

21 of 25

21

The Caravan Ins. Problem

What’s a caravan? (Domain knowledge)

● 5,800 observations, 86 variables.

● Imbalanced target var: 94% no, 6% yes.

● What to do with outliers, NAs, etc.

● Predictor vars correlated, skewed, binned.

● Politically incorrect predictor vars !

● My solution included rebalancing the samples & creating some transformed vars.

  

22 of 25

22

Text analysis: quantify texts

How do writers differ quantitatively?

  • # Characters /word, # words / sentence
  • Percent unique words
  • Use of frequent words
  • Use of sensory adjectives
  • Use of sentiment words
  • Use of positive or negative words
  • Verb/adjective ratio
  • Complexity (grade level readability)

  

23 of 25

23

Uses of Text analysis

  • Did Shakespeare really write his works
  • Did Hamilton write 51 Federalist papers
  • Which Beatles songs did John write
  • Which tweets did Trump’s staff write
  • Compare sentiments of characters
  • Compare books, presidents, judges
  • Opinions on consumer products, politicians, course evaluations
  • Predict spam, crime, fraud, terrorism.

  

24 of 25

24

Notes

Tomato problem & student interests problem from Machine Learning with R by Brett Lantz.

Beer/diapers problem from Intro to Data Mining by Pang-Ning Tan et. al.

Weather problem from Data Mining with Rattle and R by Graham Williams.

R for Data Science by Hadley Wickgam & Garrett Grolemund

  

25 of 25

Thanks!

Any questions?

You can find me at:

fcas80@yahoo.com

Jerry Tuttle

25