1 of 48

BioStatsII: Lecture 16, Trees

Amanda Hart

3/31/20

Acknowledgements: Steve Cadrin, Gavin Fay

2 of 48

Lecture Outline

Chapter 9 Univariate Tree models

9.1 Introduction
9.2 Pruning the tree
9.3 Classification trees
9.4 A detailed example: Ditch data

Case Studies

21.4 Portuguese sole survey
24 A tree for the birds

Brief introduction to bagged trees, boosted trees and random forests (Hastie 2003)

2

3 of 48

Evaluate Management Performance: Tree Analysis (An Example)

4 of 48

Explain variability in dataset

Group B.2

Group B.1

Group A

Data

Group B

5 of 48

Explain variability in dataset

Group B.2

Group B.1

Group A

Data

Group B

Explore relationships between response variable (data) and multiple explanatory variables (determine grouping)

Classification trees

Nominal response variables

Regression trees

Continuous response variables

6 of 48

9.1 Tree Introduction: Bee Example

Example: predicting bee counts

Total Deviance in bee count=28:

6

Site	Bees	Plant Spp	Time	Deviation	devSq
A	0	1	AM	-3	9
B	1	1	PM	-2	4
C	2	2	AM	-1	1
D	3	2	PM	0	0
E	4	3	AM	1	1
F	5	3	PM	2	4
G	6	3	PM	3	9
Mean	3			Sum	28

7 of 48

9.1 Tree Introduction: Bee Example

Deviance can be partitioned to AM and PM:

… and partitioned into between-group variation (deviations of group means from the grand mean) and within-group variation (deviations from group means):

7

D= D-between-groups (5.25) + D-within-groups(22.75) = 28

Group B

Group A

Data

AM

PM

8 of 48

9.1 Tree Introduction: Bee Example

Time effect

8

Time=morning

True False

Mean=2 Mean=3.75

n=3 n=4

Mean=3

n=7

Site	Bees	Plant Spp	Time
A	0	1	AM
B	1	1	PM
C	2	2	AM
D	3	2	PM
E	4	3	AM
F	5	3	PM
G	6	3	PM

9 of 48

9.1 Tree Introduction: Bee Example

Partition deviance by plant species

Continuous explanatory variables need to be in ranked groups (e.g., 1 spp vs. >1 spp. or <=2 spp vs. >2 spp).

9

D=within group AB + within group C-G + AB group effect + C-G group effect

D= 0.5 + 10 + 12.5 + 5 = 28

D = D-within-groups(10.5) + D-between groups(17.5) = 28

<=1 Plant Species

True False

Mean=0.5 Mean=4

n=2 n=5

Mean=3

n=7

Site	Bees	Plant Spp	Time
A	0	1	AM
B	1	1	PM
C	2	2	AM
D	3	2	PM
E	4	3	AM
F	5	3	PM
G	6	3	PM

10 of 48

9.1 Tree Introduction: Bee Example

Patterns of variation determined by deviance among each splitting of the data into branches.

Which split explained the most deviance in the data?

10

11 of 48

9.1 Tree Introduction: Bee Example

Tree diagram represents the effect of explanatory variables on response variable as dichotomous groups determined from an explanatory variable threshold value (‘branches’) and terminal groups (‘leafs’).

11

Plants<2.5

Plants<1.5 Time=AM

0.5 2.5 4.0 5.5

n=2 n=2 n=1 n=2

(sitesAB) (sitesCD) (siteE) (sitesFG)

True False

Mean=3

n=7

Mean=5

n=3

Mean=1.5

n=4

12 of 48

9.1 Tree Introduction: Bee Example

12

Advanced Stats

Use the rpart package to produce trees

> Bees <- read.table("Bees.txt", header = T)

> library(rpart)

> bee_tree <- rpart(Bees ~ Plants + as.factor(Time),

+ data = Bees,

+ control = rpart.control(minsplit=3, cp = 0.001))

> bee_tree

> plot(bee_tree)

13 of 48

9.1 Tree Introduction: Bee Example

13

Advanced Stats

Use the rpart package to produce trees

> bee_tree

> plot(bee_tree)

n = 7

node), split, n, deviance, yval

* denotes terminal node

1) root 7 28.0 3.0

n dev. mean

2) Plants< 2.5 4 5.0 1.5

4) Plants< 1.5 2 0.5 0.5 *

branch n dev. Mean ‘leaf’

5) Plants>=1.5 2 0.5 2.5 *

3) Plants>=2.5 3 2.0 5.0

6) as.factor(Time)=1 1 0.0 4.0 *

7) as.factor(Time)=2 2 0.5 5.5 *

14 of 48

bee_tree pbject contains more information than you think!

Using bee_tree$ notation gives lots of other options

bee_tree$frame	Contains information on which explanatory variables used to make splits
bee_tree$where	Tree location of every data point
bee_tree$terms
bee_tree$cptable	Information used in cross validation (this version is easy to manipulate)
bee_tree$method
bee_tree$parms
bee_tree$control	Shows hows how rpart.control ( ) is used
bee_tree$functions
bee_tree$numresp
bee_tree$splits
bee_tree$variable.importance	Lists how important each variable was in determining tree splits
bee_tree$y
bee_tree$ordered

15 of 48

9.1 Tree Introduction: Bee Example

15

Advanced Stats

Use the rpart package to produce trees

> Bees <- read.table("Bees.txt", header = T)

> library(rpart)

> bee_tree <- rpart(Bees ~ Plants + as.factor(Time),

+ data = Bees,

+ control = rpart.control(minsplit=3, cp = 0.001))

> bee_tree

> plot(bee_tree)

16 of 48

Tree analysis: 3 Steps

Produce complex tree

Cross-validation: pick the optimal tree complexity

Produce tree with optimal complexity

17 of 48

9.2 Pruning the Tree

Similar to other models, trees can become too large to interpret, and model selection is needed to explain the data with the fewest branches.

Small trees have few branches with large groups that may not be homogeneous.
Large trees have small groups that may be difficult to interpret.
Tree size is measured by # splits (1 - # leaves).

Minimum split value (complexity parameter)

17

18 of 48

9.2 Pruning the Tree

Model selection criteria

D: deviance within leafs
for continuous response variables, D=RSS
cp: complexity parameter (0.001 default starting value, 0 no penalty and largest possible tree, 1 no tree)

Similar to AIC, smaller is more parsimonious:

Small models have relatively large within-group deviance but relatively small penalty for size.
Large models have relatively small within-group deviance but relatively large penalty for size.

18

19 of 48

Parrotfish Example: Step1 Initial Tree

Parrotfish density survey
rpart:

Recursive Partitioning and Regression Trees

19

parrot_tree <- rpart(Parrot ~ CoralTotal + as.factor(Month) + as.factor(Station) + as.factor(Method),

data = Bahama,

control = rpart.control(cp = 0.02))

parrot_tree

20 of 48

Parrotfish Example: Step 1 Initial Tree

20

plot(parrot_tree)

21 of 48

Parrotfish Example: Step 1 Initial Tree

21

n= 402

node), split, n, deviance, yval

* denotes terminal node

1) root 402 50188.3500 10.781120

n deviance mean

2) as.factor(Method)=2 244 7044.2100 6.449180

branch n deviance mean

4) CoralTotal< 4.955 87 1401.1290 3.526437 *

branch n deviance mean ‘leaf’

5) CoralTotal>=4.955 157 4488.0570 8.068790 *

3) as.factor(Method)=1 158 31494.2000 17.470950

6) as.factor(Station)=3,4 25 781.7439 3.157200 *

7) as.factor(Station)=1,2,5,6,7,8,9,10 133 24627.5800 20.1615

14) as.factor(Month)=5,7,8,10 94 11587.5500 17.093940 *

15) as.factor(Month)=11 39 10023.5200 27.555130 *

parrot_tree

22 of 48

Parrotfish Example: Step 2 Pruning the Tree

22

1) Overfit tree using cp = 0.001

parrotTree <- rpart(formula = Parrot ~ CoralTotal +

as.factor(Month) +

as.factor(Station) +

as.factor(Method),

data = Bahama,

control = rpart.control(cp = 0.001))

2) Print cross validation

printcp(parrotTree)

23 of 48

Parrotfish Example: Step 2 Pruning the Tree

Cross-validation is preferred for model selection and determination of cp.
One tenth of the data are reserved for comparing classification with extrinsic data to provide 10 estimates of within-group leaf deviance.

23

plotcp(parrot_tree)

Upper confidence limit of optimal tree

Size of tree (number of branches)

24 of 48

Parrotfish Example: Step 3 Produce Optimal Tree

24

1) Overfit tree using cp = 0.001

parrotTree <- rpart(formula = Parrot ~ CoralTotal +

as.factor(Month) +

as.factor(Station) +

as.factor(Method),

data = Bahama,

control = rpart.control(cp = 0.0230138))

25 of 48

9.3 Classification Trees

If the response variable is nominal, the deviance is not calculated as the sum of square deviations.

For a dichotomous response variable (0,1) in leaf j:

For a response variable with m>2 classes:

25

26 of 48

9.4. Ditch Data

26

Response variable is Ditch Number (nominal response)

rpart(formula = as.factor(Site) ~ Depth + Conductivity +

Total_Calcium, data = Ditch, method = "class",

control = rpart.control(cp = 0.001))

Variables used in tree construction:

[1] Conductivity Depth Total_Calcium

Root node error: 38/48 = 0.79167

n= 48

CP nsplit rel error xerror xstd

1 0.23684 0 1.00000 1.18421 0.044133

2 0.18421 1 0.76316 1.10526 0.060297

3 0.13158 2 0.57895 1.02632 0.071162

4 0.00100 3 0.44737 0.89474 0.082870

27 of 48

9.4. Ditch Data

27

Advanced Stats

Size of Tree

28 of 48

9.4. Ditch Data

28

Different leaf statistics:

<- Ditch with most observations

2 in ditch1, 2 in ditch2,

2 in ditch3, 9 in ditch4

0 in ditch5

29 of 48

9.4. Ditch Data

29

n= 48

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 48 38 1 (0.21 0.19 0.21 0.21 0.19)

2) Total_Calcium>=118 25 16 5 (0.32 0.28 0 0.04 0.36)

branch n D Y (p1 p2 p3 p4 p5)

4) Conductivity< 1505 11 6 1 (0.45 0.45 0 0.091 0) *

branch n D Y (p1 p2 p3 p4 p5) leaf

5) Conductivity>=1505 14 5 5 (0.21 0.14 0 0 0.64) *

3) Total_Calcium< 118 23 13 3 (0.087 0.087 0.43 0.39 0)

6) Depth>=0.505 8 0 3 (0 0 1 0 0) *

7) Depth< 0.505 15 6 4 (0.13 0.13 0.13 0.6 0) *

30 of 48

9.4. Ditch Data

Matrix Algebra

30

Advanced Stats

dotchart(Ditch$Total_Calcium, pch = Ditch$Site,xlab = "Range", ylab = "Sample", main = "Total Calcium")

Ditch

5

4

3

2

1

31 of 48

21.4 Sole Survey

31

Solea salinity mud depth
0 30 71.18 3.00
0 29 87.63 2.60
1 30 71.29 2.60
0 29 55.03 2.10
0 30 42.04 3.20
0 32 50.72 3.50
1 29 68.80 1.60
1 28 36.75 1.70
0 29 72.24 1.80
1 12 36.42 4.50
0 17 16.65 6.00
1 3 23.73 4.00
1 2 20.00 4.00
0 28 72.94 2.70
1 29 91.40 2.20

32 of 48

21.4 Sole Survey

32

Advanced Stats

solea_tree <- rpart(as.factor(Solea_solea) ~ depth +

temperature +

salinity +

transparency +

gravel +

mud +

factor(month) +

Area,

data = Solea,

method = "class",

minsplit=5,

cp = 0.001)

33 of 48

21.4 Sole Survey

MAR 596 - Logistic Regression

33

absent/present

34 of 48

21.4 Sole Survey

34

Advanced Stats

n= 65

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 65 26 0 (0.60000000 0.40000000)

2) salinity>=15.5 49 12 0 (0.75510204 0.24489796)

4) gravel>=1.42 43 8 0 (0.81395349 0.18604651)

8) gravel< 17.275 33 3 0 (0.90909091 0.09090909) *

9) gravel>=17.275 10 5 0 (0.50000000 0.50000000)

18) mud>=50.55 4 0 0 (1.00000000 0.00000000) *

19) mud< 50.55 6 1 1 (0.16666667 0.83333333) *

5) gravel< 1.42 6 2 1 (0.33333333 0.66666667)

10) salinity>=29.5 2 0 0 (1.00000000 0.00000000) *

11) salinity< 29.5 4 0 1 (0.00000000 1.00000000) *

3) salinity< 15.5 16 2 1 (0.12500000 0.87500000)

6) gravel< 1.34 3 1 0 (0.66666667 0.33333333) *

7) gravel>=1.34 13 0 1 (0.00000000 1.00000000) *

35 of 48

21.4 Sole Survey

Crossvalidated Relative Error shows worse performance of larger trees.

35

36 of 48

24 A Tree for the Birds

Classification of bird species from radar information to detect birds, waves, ships and air clutter for wind farm siting.

36

Advanced Stats

37 of 48

24 A Tree for the Birds

Objective: determine which variables help to predict distinguish birds from other signals.

37

Advanced Stats

tree1 <- rpart(g9 ~ EPT + TKQ + TKT + AVV +

VEL + MXA + AREA + MAXREF +

TRKDIS + MAXSEG + ORIENT +

ELLRATIO + ELONG + COMPACT +

CHY + MAXREF1 + MINREF + SDREF,

data = TreeData.1,

method = "class",

control = rpart.control(cp = 0.001))

38 of 48

24 A Tree for the Birds

Two branches appear to be optimal.

38

Advanced Stats

39 of 48

24 A Tree for the Birds

39

Advanced Stats

tree1a <- prune(tree1,cp = 0.0323)

printcp(tree1a,digits=3)

tree1a

n= 629

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 629 217 7 (0.083 0.059 0.054 0.017 0.035 0.043 0.66 0.029 0.025)

2) EPT< 1.5 93 41 1 (0.56 0.4 0.043 0 0 0 0 0 0) *

3) EPT>=1.5 536 124 7 (0 0 0.056 0.021 0.041 0.05 0.77 0.034 0.03) *

40 of 48

24 A Tree for the Birds

Echoes per track was the most informative variable for classifying birds from other signals.

40

Advanced Stats

Groups:

Air clutter, waves, ships, birdspp1… birdspp6

41 of 48

24 A Tree for the Birds

Second attempt grouped all bird species into a single category from ships, waves and air clutter.

41

Advanced Stats

tree3 <- rpart(g4 ~ EPT + TKQ + TKT + AVV + VEL + MXA +

AREA + MAXREF + TRKDIS + MAXSEG + ORIENT +

ELLRATIO + ELONG + COMPACT + CHY + MAXREF1 +

MINREF + SDREF,

data = TreeData.1,

method = "class",

control = rpart.control(cp = 0.001))

42 of 48

24 A Tree for the Birds

Echoes per track, ellipse ratio, area and track quality were important predictors

42

Advanced Stats

Groups:

Air clutter, waves, ships, birds

43 of 48

Regression Tree Models

Regression trees predict the value of the response variable (Y; e.g., bees) as a step function of each explanatory variable (X; e.g., #plants, AM/PM) resulting in a multi-dimensional step function.

43

Advanced Stats

Y

44 of 48

Classification Tree

Simple example with two predictors (7% error).

44

Advanced Stats

45 of 48

Bagged Tree

Bagged trees (boostrapped aggregation) – classification based on majority (3% error)

45

Advanced Stats

46 of 48

Random Forests & Boosted Trees

Random forests are a refinement of bagged trees in which a random sample is drawn and considered for splitting, with ‘out-of-bag’ (extrinsic) classification removing correlation among trees.
Boosted trees average many trees that are grown to re-weighted versions of the training data.

46

Advanced Stats

47 of 48

Summary – Tree Models

Tree models identify dichotomous groups in a response variable.

Determines which dichotomous groups have the greatest between-group differences.
Leaf groups are assumed to be homogeneous.

The response variable can be continuous (regression tree) or discrete (classification tree).
Tree models are commonly used as an exploratory tool for predictive modeling.
Determining the optimal tree can be tricky.
Bagging, boosting and random forests involve pseudo-replication to improve prediction or classification.

47

48 of 48

Other Resources:

Ch 8, James, G., Wittem, D., Hastie, T., and Tibshirani, R. (2014). An Introduction to Statistical Learning With Applications in R. Springer. (ebook available online)

Support website for book (http://www-bcf.usc.edu/~gareth/ISL/)

rpart handbook

https://cran.r-project.org/web/packages/rpart/rpart.pdf

Plot.rpart package

https://rstudio-pubs-static.s3.amazonaws.com/27179_e64f0de316fc4f169d6ca300f18ee2aa.html

This is the link for the supplemental material (includes SS code for multivariate regression trees) which was difficult to find

https://figshare.com/collections/MULTIVARIATE_REGRESSION_TREES_A_NEW_TECHNIQUE_FOR_MODELING_SPECIES_ENVIRONMENT_RELATIONSHIPS/3297569

mvpart example

https://sites.ualberta.ca/~ahamann/teaching/renr690/Lab9a.pdf

Nice summary of classification and regression trees

https://store.fmi.uni-sofia.bg/fmi/statist/education/textbook/eng/stcart.html