1 of 25

What Is Data Science?

Jerry Tuttle, FCAS. Rocky Mountain College of Art & Design

People Buying a Left Shoe Also Buy a Right Shoe

2 of 25

2

What Is Data Science?

The intersection among math/stats, computer sci, & domain knowledge, to extract meaningful insights from data, translating into tangible business value.

3 of 25

3

Some Data Science Problems

Problem	Category	Techniques
Determine rel bet outcome & input	Regression	Linear, logistic regression
Group items by similarity	Clustering	K-nearest neighbor, k-mean
Find rel bet items	Association rules	Apriori
Assign known labels to objects	Classification	Decision trees, random forests
Analyze text data	Text analysis	Text analysis

4 of 25

4

Data Training vs Testing

RMSE on test data = √[∑ (Actual – Predicted)² / n]

5 of 25

5

Performance Measures

False pos (Type I): Dr tells male he is pregnant.

False neg (Type 11): Dr tells obviously pregnant woman she is not pregnant.

Accuracy: (TN + TP)/n. True Pos Rate = TP/ (TP+FN). Pos Predicted Value = TP / (TP+FP).

6 of 25

6

Exploratory Data Analysis

What questions do you have about the data?

7 of 25

7

Clustering of similar items

Data Sci. asks: Is a tomato a fruit or a vegetable?

Suppose a tomato is (6,4). Which food is closest?

8 of 25

8

knn: Assign new obj to class

of closest neighbor

knn suggests tomato is closest to orange & grape

9 of 25

9

K-means assigns clusters without knowing response variable

Iteratively defines clusters to minimize total within-cluster variation

10 of 25

10

The student interests dataset

● 30 K students * 36 interests (friends, football, etc)

● Assigned to 5 clusters.

● Each cluster contained every interest, but in different amounts.

Cluster 1: princesses

Cluster 2: brains

Cluster 3: bad boys & girls

Cluster 4: sports

Cluster 5: unexceptional

● Would this be useful to a marketer?

11 of 25

11

Association rules: market basket

● What rules can we create from this, such as

“People who buy bread are likely to buy milk.”

● Denote as {bread} ⇒ {milk}.

12 of 25

12

Association rules: statistics

● Corr matrix. But this limits to 2 items {a} ⇒ {b}.

● Can also have mult. items X = {a, b, …} ⇒ Y = {c}.

● Supp(X ⇒ Y) = #(X ∩ Y)/#N; popularity of X & Y.

● Conf(X ⇒ Y) = #(X ∩ Y)/#X; prob buy Y| buy X

● Lift(X⇒Y) =Supp(X⇒Y)/[Supp(X)* Supp(Y)];

prob buy Y| buy X, controlling for popularity of Y.

Lift > 1: likely association; lift = 1: no association; …lift <1: unlikely asso.

13 of 25

13

Prune # possible rules

● 6 items give C(6,2) +…+ C(6,6) = 57 possible rules.

● Want rules with min Support, say 60%.

● Prune search: if a 1-item transaction is unlikely, then all supersets of that item are also unlikely.

● Six 1-item trans: eliminate {eggs}, {soda}; < 60%.

● Six 2-item trans, elim {beer, bread}, {beer, milk}

● One 3-item trans, elim {bread, diapers, milk}.

{diap} ⇒ {beer}?

14 of 25

14

The groceries dataset

● 10,000 transactions, 169 grocery items.

● Creates hundreds of rules: actionable, trivial, inexplicable.

● May require domain expertise.

15 of 25

15

Decision tree

● Remember the game “Twenty Questions?”

● Suppose we limit to yes/no, non-compound questions (no use of “or”).

● I’m thinking of a mathematician who has a chapter in E.T. Bell’s Men of Mathematics.

● What are some good questions? What’s your goal with first few questions?

● The questions, with yes/no responses, can be plotted as a decision tree.

16 of 25

16

Characteristics of decision trees

● Tree-like graph modeling decision. Non-parametric.

● Easy to understand and visualize.

● Target variable usually categorical.

● Which stats predict Will it rain tomorrow (Y/N)?

● Partition data based on best split of ea input var. to get as homogeneous (pure) of each branch as possible.

● Continue partitioning each branch.

● Best split maximizes amount of Information Gain.

17 of 25

17

Will it rain tomorrow?

18 of 25

18

Will it rain tomorrow?

19 of 25

19

Applications of decision trees

● Prediction

● Assess importance of variables; eliminate irrelevant variables.

● Medical diagnoses.

● Desirable candidates for loans, college acceptance, financial decisions, etc.

● Similarities with regression.

20 of 25

20

Decision tree: baseball win %

Pitching vs hitting

21 of 25

21

The Caravan Ins. Problem

● What’s a caravan? (Domain knowledge)

● 5,800 observations, 86 variables.

● Imbalanced target var: 94% no, 6% yes.

● What to do with outliers, NAs, etc.

● Predictor vars correlated, skewed, binned.

● Politically incorrect predictor vars !

● My solution included rebalancing the samples & creating some transformed vars.

22 of 25

22

Text analysis: quantify texts

How do writers differ quantitatively?

# Characters /word, # words / sentence
Percent unique words
Use of frequent words
Use of sensory adjectives
Use of sentiment words
Use of positive or negative words
Verb/adjective ratio
Complexity (grade level readability)

23 of 25

23

Uses of Text analysis

Did Shakespeare really write his works
Did Hamilton write 51 Federalist papers
Which Beatles songs did John write
Which tweets did Trump’s staff write
Compare sentiments of characters
Compare books, presidents, judges
Opinions on consumer products, politicians, course evaluations
Predict spam, crime, fraud, terrorism.

24 of 25

24

Notes

●Tomato problem & student interests problem from Machine Learning with R by Brett Lantz.

● Beer/diapers problem from Intro to Data Mining by Pang-Ning Tan et. al.

● Weather problem from Data Mining with Rattle and R by Graham Williams.

● R for Data Science by Hadley Wickgam & Garrett Grolemund

25 of 25

Thanks!

Any questions?

You can find me at:

fcas80@yahoo.com

Jerry Tuttle

25