1 of 28

Data Science Aboard the Titanic

Jerry Tuttle, FCAS Rocky Mountain College of Art & Design

1

2 of 28

Data Science Aboard The Titanic

  • The British luxury ship Titanic sank in 1912.
  • Of est 2,224 passengers and crew, more than 1,500 died.
  • Counts differ by source. Some names on passenger list canceled, some traveled under aliases and were counted twice, some were rescued but died soon after.
  • 1997 movie with Leonardo DiCaprio, Kate Winslet. Also 1953.
  • Passenger list is popular dataset for various data analyses.

2

3 of 28

Titanic – the 1997 movie

  • Leonardo DiCaprio played Jack Dawson, age 20. He has a third-class ticket and is traveling with a male friend.
  • Kate Winslet played Rose DeWitt Bukater, age 17. She has a first-class ticket and is traveling with her mother.
  • How did Dawson get his ship ticket?
  • Why is Butaker not on the passenger list?

3

4 of 28

What is data science?

  • The intersection among math/stats, computer sci, & domain knowledge, to extract meaningful insights from data, translating into tangible business value.

4

5 of 28

The data science process

  • Similarities to Polya’s problem solving techniques in How to Solve It.

5

6 of 28

What is our goal?

  • We have the dataset of Titanic passengers, including various features about them. (I am using Kaggle version.)
  • We want to develop a data science model that will predict whether any particular passenger will survive.
  • There are many possible techniques, but we will limit this discussion to Decision Trees and Regression.

6

7 of 28

Dataset: 9 indep vars, 1309 passengers

7

Variable

Definition

Key

survival

survival

0 = No, 1 = Yes

pclass

passenger ticket class

1 = 1st, 2 = 2nd, 3 = 3rd

sex

sex

age

age in years

sibsp

# siblings/spouses aboard

parch

# parents/children aboard

ticket

ticket #

fare

passenger fare

cabin

cabin #

A (upper) through G

embarked

port of embarkation

cities C, Q, S

Data files train.csv, test.csv from Kaggle.com/c/titanic/data . Omit, ID, name. Other sources have different data.

8 of 28

Data training versus data testing

8

RMSE on test data = √[∑ (Actual – Predicted)2 / n]

9 of 28

Performance measures

9

  • False pos (Type I): Dr tells male he is pregnant. False neg (Type 11): Dr tells obviously pregnant woman not pregnant.
  • Accuracy: (TN + TP)/n. True Pos Rate = Recall = TP/ (TP+FN). Pos Predicted Value = Precision = TP / (TP+FP).
  • Area under curve: True Pos v False Pos, over var prob thresh

10 of 28

What is a decision tree?

  • A decision tree predicts the value of a target variable by successively splitting the variables into subsets, starting with the root node, until the final nodes have all the same values as the target variable or when further splitting no longer adds value.
  • Algorithm tests each variable for each possible split.
  • Splits one variable at each node, and then moves forward (down the tree, never backward).
  • Split attempts to maximize homogeneity of the target variable within the subsets. Gini = 1 - ∑(Pi)2, Pi is prob of ea class.

10

11 of 28

Example of decision tree

  • Root node: on 100% data, 57% are not fraud (False = 0).
  • If RearEnd=0 (False=0), on 71% of data, 80% are not fraud & 20% are fraud. If RearEnd=1 (True=1), on 29% of data, 100% are fraud.
  • Tree chose RearEnd as better to split data than Whiplash.

11

Claim

RearEnd

Whiplash

Fraud

1

True

True

True

2

True

True

True

3

False

True

True

4

False

True

False

5

False

True

False

6

False

False

False

7

False

False

False

12 of 28

Can we use linear regression on Titanic?

  • Given n independent variables x1 , … , xn , linear regression fits a model y = β0 + β1 x1 + β2 x2 + … + βn xn .
  • There are several important assumptions of linear regression, but the one we will emphasize is that the dependent variable (survival) y must be continuous.

12

13 of 28

What is logistic regression?

  • Given n independent variables x1 , … , xn , and a dependent variable y that has values 0 and 1, logistic regression fits a model p = 1/[1 + e^( β0 + β1 x1 + β2 x2 + … + βn xn )] where p is the probability that y = 1.
  • Then p/(1-p) is the

odds that y = 1. LN[p/(1-p)]

is called the log odds.

  • If p>.5, set y=1, else y=0.

13

14 of 28

Data exploration & preparation

  • Examine variable types (confirm factors, check levels),

create summaries & plots, examine distributions.

  • Look for outliers, missing data, duplicate records.
  • What to do about missing data (NA or as blank)?

14

Variable

NA or blank

Impute, delete obs, or delete variable?

Age

263 NA

Fare

1 blank

Cabin

1014 blank

Embarked

2 blank

15 of 28

Numerical variables

15

16 of 28

Multivariate data plots

16

17 of 28

Disadvantages of decision trees

  • They are greedy: they make decision on the current node, but they can’t look ahead for a better decision. Ex: Suppose you need to make change for 30c using fewest no. of coins out of 25c, 15c(?), and 1c coins. Start with 25c coin because that leaves smallest amount left to pay out. But now you have to use five 1c coins. Less greedy decision would have been to use two 15c coins.
  • Sensitive to the partic observations in the training data, and hence the partic variables to split each node.
  • Overfitting: models training data too well, vs test data.
  • Solution: multiple trees (Random forest).

17

18 of 28

Decision tree 1

18

Predicted Did Not Survive

iPredicted Survived

Actual = Did Not Survive

260

6

Actual = Survived

12

140

Accuracy = (260 + 140) / 418 = 95.7%

True Pos Rate = 140 / (140 + 12) = 92.1%

19 of 28

Decision tree 2

19

Accuracy = (250 + 123) / 418 = 89.2%

True Pos Rate = 123 / (123 + 29) = 80.9%

20 of 28

Decision tree 3

20

Accuracy = (260 + 146) / 418 = 97.1%

True Pos Rate = 146 / (146 + 6) = 96.1%

21 of 28

Regression 1

  • LN[p/(1-p)] = β0 + β1 Pclass1 + β2 Sex2 + β3 Age3 + β4 SibspAge4 + β5 Parch5 + β6 Fare6 + β7 Embarked7

21

22 of 28

Regression 2

  • LN[p/(1-p)] = β0 + β1 Pclass1 + β2 Sex2 + β3 Age3 + β4 SibspAge4

22

23 of 28

Interpreting the coefficients

  • For a 1 unit increase in xi , the expected change in LN odds of success (survival) is βi , holding all other indep vars fixed.
  • Pclass (passenger ticket class) is a categorical variable, with 3 = 3rd class < 2 = 2nd class < 1 = 1st class. The regression converted Pclass into 3 dummy vars. 3rd is base.
  • As Pclass improves, LN odds survival increases.
  • An increase in Sexmale decreases LN odds survival.
  • An increase in Age decreases LN odds survival.
  • An increase in # siblings/spouses decreases LN odds survl.

23

24 of 28

Regression measures

  • AIC (Akaike Info Criteria) measures rel quality of models. Lower is better. Goodness of fit, adjusting for # vars.
  • Residual Deviance of predicted vs actual. Lower is better.
  • AUC (Area under ROC Curve). ROC curve plots false pos vs true pos rates over various probability thresholds.
  • No standard R2 stat for logistic.

24

Model

AIC

Resid Dev

True Pos R

Accuracy

AUC

1

804.3

784.3

.941

.947

.981

2

802.8

790.8

.954

.943

.984

25 of 28

Most likely to survive

Based on a combination of trees and regression:

  • Women over men
  • Younger ages over older
  • First class over third class
  • Smaller families over larger families
  • Higher fares over lower fares, although this is weak.

Many online references to passengers, e.g., encyclopedia-titanic.org . Last survivor died in 2009 at age 97; she was two months old as a passenger.

25

26 of 28

This was a classification problem

Classification is the process of classifying a record into one of a finite number of classes. Now we could take a new record and predict its class and its probability.

Other applications of classification techniques:

  • College admission approval & potential graduation
  • Loan approval & potential default
  • Medical diagnosis of particular disease
  • Spam detection in e-mails.

Be careful of disparate impact of a protected class.

26

27 of 28

More free fun datasets to amuse you

  • kaggle.com/datasets is subsidiary of Google
  • archive.ics.uci.edu/ml/datasets.php from UC Irvine
  • seanlahman.com/baseball-archive/statistics has baseball stats since 1871
  • vincentarelbundock.io/Rdatasets/articles/data.html contains datasets originally part of R packages
  • github.com/awesomedata.awesome-public-datasets/blob/master/README.rst
  • Happy data sciencing!

27

28 of 28

Thanks!

Any questions?

You can find me at

fcas80@yahoo.com

Jerry Tuttle

28