What Is Data Science?
Jerry Tuttle, FCAS. Rocky Mountain College of Art & Design
People Buying a Left Shoe Also Buy a Right Shoe
2
What Is Data Science?
The intersection among math/stats, computer sci, & domain knowledge, to extract meaningful insights from data, translating into tangible business value.
3
Some Data Science Problems
Problem | Category | Techniques |
Determine rel bet outcome & input | Regression | Linear, logistic regression |
Group items by similarity | Clustering | K-nearest neighbor, k-mean |
Find rel bet items | Association rules | Apriori |
Assign known labels to objects | Classification | Decision trees, random forests |
Analyze text data | Text analysis | Text analysis |
4
Data Training vs Testing
RMSE on test data = √[∑ (Actual – Predicted)2 / n]
5
Performance Measures
False pos (Type I): Dr tells male he is pregnant.
False neg (Type 11): Dr tells obviously pregnant woman she is not pregnant.
Accuracy: (TN + TP)/n. True Pos Rate = TP/ (TP+FN). Pos Predicted Value = TP / (TP+FP).
6
Exploratory Data Analysis
What questions do you have about the data?
7
Clustering of similar items
Data Sci. asks: Is a tomato a fruit or a vegetable?
Suppose a tomato is (6,4). Which food is closest?
8
knn: Assign new obj to class
of closest neighbor
knn suggests tomato is closest to orange & grape
9
K-means assigns clusters without knowing response variable
Iteratively defines clusters to minimize total within-cluster variation
10
The student interests dataset
● 30 K students * 36 interests (friends, football, etc)
● Assigned to 5 clusters.
● Each cluster contained every interest, but in different amounts.
Cluster 1: princesses
Cluster 2: brains
Cluster 3: bad boys & girls
Cluster 4: sports
Cluster 5: unexceptional
● Would this be useful to a marketer?
11
Association rules: market basket
● What rules can we create from this, such as
“People who buy bread are likely to buy milk.”
● Denote as {bread} ⇒ {milk}.
12
Association rules: statistics
● Corr matrix. But this limits to 2 items {a} ⇒ {b}.
● Can also have mult. items X = {a, b, …} ⇒ Y = {c}.
● Supp(X ⇒ Y) = #(X ∩ Y)/#N; popularity of X & Y.
● Conf(X ⇒ Y) = #(X ∩ Y)/#X; prob buy Y| buy X
● Lift(X⇒Y) =Supp(X⇒Y)/[Supp(X)* Supp(Y)];
prob buy Y| buy X, controlling for popularity of Y.
Lift > 1: likely association; lift = 1: no association; …lift <1: unlikely asso.
13
Prune # possible rules
● 6 items give C(6,2) +…+ C(6,6) = 57 possible rules.
● Want rules with min Support, say 60%.
● Prune search: if a 1-item transaction is unlikely, then all supersets of that item are also unlikely.
● Six 1-item trans: eliminate {eggs}, {soda}; < 60%.
● Six 2-item trans, elim {beer, bread}, {beer, milk}
● One 3-item trans, elim {bread, diapers, milk}.
{diap} ⇒ {beer}?
14
The groceries dataset
● 10,000 transactions, 169 grocery items.
● Creates hundreds of rules: actionable, trivial, inexplicable.
● May require domain expertise.
15
Decision tree
● Remember the game “Twenty Questions?”
● Suppose we limit to yes/no, non-compound questions (no use of “or”).
● I’m thinking of a mathematician who has a chapter in E.T. Bell’s Men of Mathematics.
● What are some good questions? What’s your goal with first few questions?
● The questions, with yes/no responses, can be plotted as a decision tree.
16
Characteristics of decision trees
● Tree-like graph modeling decision. Non-parametric.
● Easy to understand and visualize.
● Target variable usually categorical.
● Which stats predict Will it rain tomorrow (Y/N)?
● Partition data based on best split of ea input var. to get as homogeneous (pure) of each branch as possible.
● Continue partitioning each branch.
● Best split maximizes amount of Information Gain.
17
Will it rain tomorrow?
18
Will it rain tomorrow?
19
Applications of decision trees
● Prediction
● Assess importance of variables; eliminate irrelevant variables.
● Medical diagnoses.
● Desirable candidates for loans, college acceptance, financial decisions, etc.
● Similarities with regression.
20
Decision tree: baseball win %
Pitching vs hitting
21
The Caravan Ins. Problem
● What’s a caravan? (Domain knowledge)
● 5,800 observations, 86 variables.
● Imbalanced target var: 94% no, 6% yes.
● What to do with outliers, NAs, etc.
● Predictor vars correlated, skewed, binned.
● Politically incorrect predictor vars !
● My solution included rebalancing the samples & creating some transformed vars.
22
Text analysis: quantify texts
How do writers differ quantitatively?
23
Uses of Text analysis
24
Notes
●Tomato problem & student interests problem from Machine Learning with R by Brett Lantz.
● Beer/diapers problem from Intro to Data Mining by Pang-Ning Tan et. al.
● Weather problem from Data Mining with Rattle and R by Graham Williams.
● R for Data Science by Hadley Wickgam & Garrett Grolemund
Thanks!
Any questions?
25