1 of 28

Data Science Aboard the Titanic

Jerry Tuttle, FCAS Rocky Mountain College of Art & Design

1

2 of 28

Data Science Aboard The Titanic

The British luxury ship Titanic sank in 1912.
Of est 2,224 passengers and crew, more than 1,500 died.
Counts differ by source. Some names on passenger list canceled, some traveled under aliases and were counted twice, some were rescued but died soon after.
1997 movie with Leonardo DiCaprio, Kate Winslet. Also 1953.
Passenger list is popular dataset for various data analyses.

2

3 of 28

Titanic – the 1997 movie

Leonardo DiCaprio played Jack Dawson, age 20. He has a third-class ticket and is traveling with a male friend.
Kate Winslet played Rose DeWitt Bukater, age 17. She has a first-class ticket and is traveling with her mother.
How did Dawson get his ship ticket?
Why is Butaker not on the passenger list?

3

4 of 28

What is data science?

The intersection among math/stats, computer sci, & domain knowledge, to extract meaningful insights from data, translating into tangible business value.

4

5 of 28

The data science process

Similarities to Polya’s problem solving techniques in How to Solve It.

5

6 of 28

What is our goal?

We have the dataset of Titanic passengers, including various features about them. (I am using Kaggle version.)
We want to develop a data science model that will predict whether any particular passenger will survive.
There are many possible techniques, but we will limit this discussion to Decision Trees and Regression.

6

7 of 28

Dataset: 9 indep vars, 1309 passengers

7

Variable	Definition	Key
survival	survival	0 = No, 1 = Yes
pclass	passenger ticket class	1 = 1^st, 2 = 2^nd, 3 = 3rd
sex	sex
age	age in years
sibsp	# siblings/spouses aboard
parch	# parents/children aboard
ticket	ticket #
fare	passenger fare
cabin	cabin #	A (upper) through G
embarked	port of embarkation	cities C, Q, S

Data files train.csv, test.csv from Kaggle.com/c/titanic/data . Omit, ID, name. Other sources have different data.

8 of 28

Data training versus data testing

8

RMSE on test data = √[∑ (Actual – Predicted)² / n]

9 of 28

Performance measures

9

False pos (Type I): Dr tells male he is pregnant. False neg (Type 11): Dr tells obviously pregnant woman not pregnant.
Accuracy: (TN + TP)/n. True Pos Rate = Recall = TP/ (TP+FN). Pos Predicted Value = Precision = TP / (TP+FP).
Area under curve: True Pos v False Pos, over var prob thresh

10 of 28

What is a decision tree?

A decision tree predicts the value of a target variable by successively splitting the variables into subsets, starting with the root node, until the final nodes have all the same values as the target variable or when further splitting no longer adds value.
Algorithm tests each variable for each possible split.
Splits one variable at each node, and then moves forward (down the tree, never backward).
Split attempts to maximize homogeneity of the target variable within the subsets. Gini = 1 - ∑(P_i)², P_iis prob of ea class.

10

11 of 28

Example of decision tree

Root node: on 100% data, 57% are not fraud (False = 0).
If RearEnd=0 (False=0), on 71% of data, 80% are not fraud & 20% are fraud. If RearEnd=1 (True=1), on 29% of data, 100% are fraud.
Tree chose RearEnd as better to split data than Whiplash.

11

Claim	RearEnd	Whiplash	Fraud
1	True	True	True
2	True	True	True
3	False	True	True
4	False	True	False
5	False	True	False
6	False	False	False
7	False	False	False

12 of 28

Can we use linear regression on Titanic?

Given n independent variables x₁, … , x_n, linear regression fits a model y = β₀ + β₁ x₁+ β₂ x₂+ … + β_n x_n.
There are several important assumptions of linear regression, but the one we will emphasize is that the dependent variable (survival) y must be continuous.

12

13 of 28

What is logistic regression?

Given n independent variables x₁, … , x_n, and a dependent variable y that has values 0 and 1, logistic regression fits a model p = 1/[1 + e^( β₀ + β₁ x₁+ β₂ x₂+ … + β_n x_n)] where p is the probability that y = 1.
Then p/(1-p) is the

odds that y = 1. LN[p/(1-p)]

is called the log odds.

If p>.5, set y=1, else y=0.

13

14 of 28

Data exploration & preparation

Examine variable types (confirm factors, check levels),

create summaries & plots, examine distributions.

Look for outliers, missing data, duplicate records.
What to do about missing data (NA or as blank)?

14

Variable	NA or blank	Impute, delete obs, or delete variable?
Age	263 NA
Fare	1 blank
Cabin	1014 blank
Embarked	2 blank

15 of 28

Numerical variables

15

16 of 28

Multivariate data plots

16

17 of 28

Disadvantages of decision trees

They are greedy: they make decision on the current node, but they can’t look ahead for a better decision. Ex: Suppose you need to make change for 30c using fewest no. of coins out of 25c, 15c(?), and 1c coins. Start with 25c coin because that leaves smallest amount left to pay out. But now you have to use five 1c coins. Less greedy decision would have been to use two 15c coins.
Sensitive to the partic observations in the training data, and hence the partic variables to split each node.
Overfitting: models training data too well, vs test data.
Solution: multiple trees (Random forest).

17

18 of 28

Decision tree 1

18

	Predicted Did Not Survive	iPredicted Survived
Actual = Did Not Survive	260	6
Actual = Survived	12	140

Accuracy = (260 + 140) / 418 = 95.7%

True Pos Rate = 140 / (140 + 12) = 92.1%

19 of 28

Decision tree 2

19

Accuracy = (250 + 123) / 418 = 89.2%

True Pos Rate = 123 / (123 + 29) = 80.9%

20 of 28

Decision tree 3

20

Accuracy = (260 + 146) / 418 = 97.1%

True Pos Rate = 146 / (146 + 6) = 96.1%

21 of 28

Regression 1

LN[p/(1-p)] = β₀ + β₁ Pclass₁+ β₂ Sex₂+ β₃ Age₃+ β₄ SibspAge₄+ β₅ Parch₅+ β₆ Fare₆+ β₇ Embarked₇

21

22 of 28

Regression 2

LN[p/(1-p)] = β₀ + β₁ Pclass₁+ β₂ Sex₂+ β₃ Age₃+ β₄ SibspAge₄

22

23 of 28

Interpreting the coefficients

For a 1 unit increase in x_i, the expected change in LN odds of success (survival) is β_i , holding all other indep vars fixed.
Pclass (passenger ticket class) is a categorical variable, with 3 = 3^rd class < 2 = 2^nd class < 1 = 1^st class. The regression converted Pclass into 3 dummy vars. 3^rd is base.
As Pclass improves, LN odds survival increases.
An increase in Sexmale decreases LN odds survival.
An increase in Age decreases LN odds survival.
An increase in # siblings/spouses decreases LN odds survl.

23

24 of 28

Regression measures

AIC (Akaike Info Criteria) measures rel quality of models. Lower is better. Goodness of fit, adjusting for # vars.
Residual Deviance of predicted vs actual. Lower is better.
AUC (Area under ROC Curve). ROC curve plots false pos vs true pos rates over various probability thresholds.
No standard R² stat for logistic.

24

Model	AIC	Resid Dev	True Pos R	Accuracy	AUC
1	804.3	784.3	.941	.947	.981
2	802.8	790.8	.954	.943	.984

25 of 28

Most likely to survive

Based on a combination of trees and regression:

Women over men
Younger ages over older
First class over third class
Smaller families over larger families
Higher fares over lower fares, although this is weak.

Many online references to passengers, e.g., encyclopedia-titanic.org . Last survivor died in 2009 at age 97; she was two months old as a passenger.

25

26 of 28

This was a classification problem

Classification is the process of classifying a record into one of a finite number of classes. Now we could take a new record and predict its class and its probability.

Other applications of classification techniques:

College admission approval & potential graduation
Loan approval & potential default
Medical diagnosis of particular disease
Spam detection in e-mails.

Be careful of disparate impact of a protected class.

26

27 of 28

More free fun datasets to amuse you

kaggle.com/datasets is subsidiary of Google
archive.ics.uci.edu/ml/datasets.php from UC Irvine
seanlahman.com/baseball-archive/statistics has baseball stats since 1871
vincentarelbundock.io/Rdatasets/articles/data.html contains datasets originally part of R packages
github.com/awesomedata.awesome-public-datasets/blob/master/README.rst
Happy data sciencing!

27

28 of 28

Thanks!

Any questions?

You can find me at

fcas80@yahoo.com

Jerry Tuttle

28