Data Science Aboard the Titanic
Jerry Tuttle, FCAS Rocky Mountain College of Art & Design
1
Data Science Aboard The Titanic
2
Titanic – the 1997 movie
3
What is data science?
4
The data science process
5
What is our goal?
6
Dataset: 9 indep vars, 1309 passengers
7
Variable | Definition | Key |
survival | survival | 0 = No, 1 = Yes |
pclass | passenger ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | sex | |
age | age in years | |
sibsp | # siblings/spouses aboard | |
parch | # parents/children aboard | |
ticket | ticket # | |
fare | passenger fare | |
cabin | cabin # | A (upper) through G |
embarked | port of embarkation | cities C, Q, S |
Data files train.csv, test.csv from Kaggle.com/c/titanic/data . Omit, ID, name. Other sources have different data.
Data training versus data testing
8
RMSE on test data = √[∑ (Actual – Predicted)2 / n]
Performance measures
9
What is a decision tree?
10
Example of decision tree
11
Claim | RearEnd | Whiplash | Fraud |
1 | True | True | True |
2 | True | True | True |
3 | False | True | True |
4 | False | True | False |
5 | False | True | False |
6 | False | False | False |
7 | False | False | False |
Can we use linear regression on Titanic?
12
What is logistic regression?
odds that y = 1. LN[p/(1-p)]
is called the log odds.
13
Data exploration & preparation
create summaries & plots, examine distributions.
14
Variable | NA or blank | Impute, delete obs, or delete variable? |
Age | 263 NA | |
Fare | 1 blank | |
Cabin | 1014 blank | |
Embarked | 2 blank | |
Numerical variables
15
Multivariate data plots
16
Disadvantages of decision trees
17
Decision tree 1
18
| Predicted Did Not Survive | iPredicted Survived |
Actual = Did Not Survive | 260 | 6 |
Actual = Survived | 12 | 140 |
Accuracy = (260 + 140) / 418 = 95.7%
True Pos Rate = 140 / (140 + 12) = 92.1%
Decision tree 2
19
Accuracy = (250 + 123) / 418 = 89.2%
True Pos Rate = 123 / (123 + 29) = 80.9%
Decision tree 3
20
Accuracy = (260 + 146) / 418 = 97.1%
True Pos Rate = 146 / (146 + 6) = 96.1%
Regression 1
21
Regression 2
22
Interpreting the coefficients
23
Regression measures
24
Model | AIC | Resid Dev | True Pos R | Accuracy | AUC |
1 | 804.3 | 784.3 | .941 | .947 | .981 |
2 | 802.8 | 790.8 | .954 | .943 | .984 |
Most likely to survive
Based on a combination of trees and regression:
Many online references to passengers, e.g., encyclopedia-titanic.org . Last survivor died in 2009 at age 97; she was two months old as a passenger.
25
This was a classification problem
Classification is the process of classifying a record into one of a finite number of classes. Now we could take a new record and predict its class and its probability.
Other applications of classification techniques:
Be careful of disparate impact of a protected class.
26
More free fun datasets to amuse you
27
Thanks!
28