1 of 48

BioStatsII: Lecture 16, Trees

Amanda Hart

3/31/20

Acknowledgements: Steve Cadrin, Gavin Fay

2 of 48

Lecture Outline

  • Chapter 9 Univariate Tree models
    • 9.1 Introduction
    • 9.2 Pruning the tree
    • 9.3 Classification trees
    • 9.4 A detailed example: Ditch data
  • Case Studies
    • 21.4 Portuguese sole survey
    • 24 A tree for the birds
  • Brief introduction to bagged trees, boosted trees and random forests (Hastie 2003)

2

3 of 48

Evaluate Management Performance: Tree Analysis (An Example)

4 of 48

Explain variability in dataset

Group B.2

Group B.1

Group A

Data

Group B

5 of 48

Explain variability in dataset

Group B.2

Group B.1

Group A

Data

Group B

Explore relationships between response variable (data) and multiple explanatory variables (determine grouping)

Classification trees

  • Nominal response variables

Regression trees

  • Continuous response variables

6 of 48

9.1 Tree Introduction: Bee Example

  • Example: predicting bee counts

  • Total Deviance in bee count=28:

6

Site

Bees

Plant Spp

Time

Deviation

devSq

A

0

1

AM

-3

9

B

1

1

PM

-2

4

C

2

2

AM

-1

1

D

3

2

PM

0

0

E

4

3

AM

1

1

F

5

3

PM

2

4

G

6

3

PM

3

9

Mean

3

Sum

28

 

7 of 48

9.1 Tree Introduction: Bee Example

  • Deviance can be partitioned to AM and PM:

  • … and partitioned into between-group variation (deviations of group means from the grand mean) and within-group variation (deviations from group means):

7

 

 

D= D-between-groups (5.25) + D-within-groups(22.75) = 28

Group B

Group A

Data

AM

PM

8 of 48

9.1 Tree Introduction: Bee Example

  • Time effect

8

Time=morning

True False

Mean=2 Mean=3.75

n=3 n=4

Mean=3

n=7

Site

Bees

Plant Spp

Time

A

0

1

AM

B

1

1

PM

C

2

2

AM

D

3

2

PM

E

4

3

AM

F

5

3

PM

G

6

3

PM

9 of 48

9.1 Tree Introduction: Bee Example

  • Partition deviance by plant species
    • Continuous explanatory variables need to be in ranked groups (e.g., 1 spp vs. >1 spp. or <=2 spp vs. >2 spp).

9

D=within group AB + within group C-G + AB group effect + C-G group effect

D= 0.5 + 10 + 12.5 + 5 = 28

D = D-within-groups(10.5) + D-between groups(17.5) = 28

<=1 Plant Species

True False

Mean=0.5 Mean=4

n=2 n=5

Mean=3

n=7

Site

Bees

Plant Spp

Time

A

0

1

AM

B

1

1

PM

C

2

2

AM

D

3

2

PM

E

4

3

AM

F

5

3

PM

G

6

3

PM

10 of 48

9.1 Tree Introduction: Bee Example

  • Patterns of variation determined by deviance among each splitting of the data into branches.

  • Which split explained the most deviance in the data?

10

11 of 48

9.1 Tree Introduction: Bee Example

  • Tree diagram represents the effect of explanatory variables on response variable as dichotomous groups determined from an explanatory variable threshold value (‘branches’) and terminal groups (‘leafs’).

11

Plants<2.5

Plants<1.5 Time=AM

0.5 2.5 4.0 5.5

n=2 n=2 n=1 n=2

(sitesAB) (sitesCD) (siteE) (sitesFG)

True False

Mean=3

n=7

Mean=5

n=3

Mean=1.5

n=4

12 of 48

9.1 Tree Introduction: Bee Example

12

Advanced Stats

Use the rpart package to produce trees

> Bees <- read.table("Bees.txt", header = T)

> library(rpart)

> bee_tree <- rpart(Bees ~ Plants + as.factor(Time),

+ data = Bees,

+ control = rpart.control(minsplit=3, cp = 0.001))

> bee_tree

> plot(bee_tree)

13 of 48

9.1 Tree Introduction: Bee Example

13

Advanced Stats

Use the rpart package to produce trees

> bee_tree

> plot(bee_tree)

n = 7

node), split, n, deviance, yval

* denotes terminal node

1) root 7 28.0 3.0

n dev. mean

2) Plants< 2.5 4 5.0 1.5

4) Plants< 1.5 2 0.5 0.5 *

branch n dev. Mean ‘leaf’

5) Plants>=1.5 2 0.5 2.5 *

3) Plants>=2.5 3 2.0 5.0

6) as.factor(Time)=1 1 0.0 4.0 *

7) as.factor(Time)=2 2 0.5 5.5 *

14 of 48

bee_tree pbject contains more information than you think!

Using bee_tree$ notation gives lots of other options

bee_tree$frame

Contains information on which explanatory variables used to make splits

bee_tree$where

Tree location of every data point

bee_tree$terms

bee_tree$cptable

Information used in cross validation (this version is easy to manipulate)

bee_tree$method

bee_tree$parms

bee_tree$control

Shows hows how rpart.control ( ) is used

bee_tree$functions

bee_tree$numresp

bee_tree$splits

bee_tree$variable.importance

Lists how important each variable was in determining tree splits

bee_tree$y

bee_tree$ordered

15 of 48

9.1 Tree Introduction: Bee Example

15

Advanced Stats

Use the rpart package to produce trees

> Bees <- read.table("Bees.txt", header = T)

> library(rpart)

> bee_tree <- rpart(Bees ~ Plants + as.factor(Time),

+ data = Bees,

+ control = rpart.control(minsplit=3, cp = 0.001))

> bee_tree

> plot(bee_tree)

16 of 48

Tree analysis: 3 Steps

  1. Produce complex tree

  1. Cross-validation: pick the optimal tree complexity

  • Produce tree with optimal complexity

17 of 48

9.2 Pruning the Tree

  • Similar to other models, trees can become too large to interpret, and model selection is needed to explain the data with the fewest branches.
    • Small trees have few branches with large groups that may not be homogeneous.
    • Large trees have small groups that may be difficult to interpret.
    • Tree size is measured by # splits (1 - # leaves).
  • Minimum split value (complexity parameter)

17

18 of 48

9.2 Pruning the Tree

  • Model selection criteria
    • D: deviance within leafs
    • for continuous response variables, D=RSS
    • cp: complexity parameter (0.001 default starting value, 0 no penalty and largest possible tree, 1 no tree)
  • Similar to AIC, smaller is more parsimonious:
    • Small models have relatively large within-group deviance but relatively small penalty for size.
    • Large models have relatively small within-group deviance but relatively large penalty for size.

18

19 of 48

Parrotfish Example: Step1 Initial Tree

  • Parrotfish density survey
  • rpart:

Recursive Partitioning and Regression Trees

19

parrot_tree <- rpart(Parrot ~ CoralTotal + as.factor(Month) + as.factor(Station) + as.factor(Method),

data = Bahama,

control = rpart.control(cp = 0.02))

parrot_tree

20 of 48

Parrotfish Example: Step 1 Initial Tree

20

plot(parrot_tree)

21 of 48

Parrotfish Example: Step 1 Initial Tree

21

n= 402

node), split, n, deviance, yval

* denotes terminal node

1) root 402 50188.3500 10.781120

n deviance mean

2) as.factor(Method)=2 244 7044.2100 6.449180

branch n deviance mean

4) CoralTotal< 4.955 87 1401.1290 3.526437 *

branch n deviance mean ‘leaf’

5) CoralTotal>=4.955 157 4488.0570 8.068790 *

3) as.factor(Method)=1 158 31494.2000 17.470950

6) as.factor(Station)=3,4 25 781.7439 3.157200 *

7) as.factor(Station)=1,2,5,6,7,8,9,10 133 24627.5800 20.1615

14) as.factor(Month)=5,7,8,10 94 11587.5500 17.093940 *

15) as.factor(Month)=11 39 10023.5200 27.555130 *

parrot_tree

22 of 48

Parrotfish Example: Step 2 Pruning the Tree

22

1) Overfit tree using cp = 0.001

parrotTree <- rpart(formula = Parrot ~ CoralTotal +

as.factor(Month) +

as.factor(Station) +

as.factor(Method),

data = Bahama,

control = rpart.control(cp = 0.001))

2) Print cross validation

printcp(parrotTree)

23 of 48

Parrotfish Example: Step 2 Pruning the Tree

  • Cross-validation is preferred for model selection and determination of cp.
  • One tenth of the data are reserved for comparing classification with extrinsic data to provide 10 estimates of within-group leaf deviance.

23

plotcp(parrot_tree)

Upper confidence limit of optimal tree

Size of tree (number of branches)

24 of 48

Parrotfish Example: Step 3 Produce Optimal Tree

24

1) Overfit tree using cp = 0.001

parrotTree <- rpart(formula = Parrot ~ CoralTotal +

as.factor(Month) +

as.factor(Station) +

as.factor(Method),

data = Bahama,

control = rpart.control(cp = 0.0230138))

25 of 48

9.3 Classification Trees

  • If the response variable is nominal, the deviance is not calculated as the sum of square deviations.
    • For a dichotomous response variable (0,1) in leaf j:

    • For a response variable with m>2 classes:

25

 

 

26 of 48

9.4. Ditch Data

26

Response variable is Ditch Number (nominal response)

rpart(formula = as.factor(Site) ~ Depth + Conductivity +

Total_Calcium, data = Ditch, method = "class",

control = rpart.control(cp = 0.001))

Variables used in tree construction:

[1] Conductivity Depth Total_Calcium

Root node error: 38/48 = 0.79167

n= 48

CP nsplit rel error xerror xstd

1 0.23684 0 1.00000 1.18421 0.044133

2 0.18421 1 0.76316 1.10526 0.060297

3 0.13158 2 0.57895 1.02632 0.071162

4 0.00100 3 0.44737 0.89474 0.082870

27 of 48

9.4. Ditch Data

27

Advanced Stats

Size of Tree

28 of 48

9.4. Ditch Data

28

Different leaf statistics:

<- Ditch with most observations

2 in ditch1, 2 in ditch2,

2 in ditch3, 9 in ditch4

0 in ditch5

29 of 48

9.4. Ditch Data

29

n= 48

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 48 38 1 (0.21 0.19 0.21 0.21 0.19)

2) Total_Calcium>=118 25 16 5 (0.32 0.28 0 0.04 0.36)

branch n D Y (p1 p2 p3 p4 p5)

4) Conductivity< 1505 11 6 1 (0.45 0.45 0 0.091 0) *

branch n D Y (p1 p2 p3 p4 p5) leaf

5) Conductivity>=1505 14 5 5 (0.21 0.14 0 0 0.64) *

3) Total_Calcium< 118 23 13 3 (0.087 0.087 0.43 0.39 0)

6) Depth>=0.505 8 0 3 (0 0 1 0 0) *

7) Depth< 0.505 15 6 4 (0.13 0.13 0.13 0.6 0) *

30 of 48

9.4. Ditch Data

Matrix Algebra

30

Advanced Stats

dotchart(Ditch$Total_Calcium, pch = Ditch$Site,xlab = "Range", ylab = "Sample", main = "Total Calcium")

Ditch

5

4

3

2

1

31 of 48

21.4 Sole Survey

31

  • Solea salinity mud depth
  • 0 30 71.18 3.00
  • 0 29 87.63 2.60
  • 1 30 71.29 2.60
  • 0 29 55.03 2.10
  • 0 30 42.04 3.20
  • 0 32 50.72 3.50
  • 1 29 68.80 1.60
  • 1 28 36.75 1.70
  • 0 29 72.24 1.80
  • 1 12 36.42 4.50
  • 0 17 16.65 6.00
  • 1 3 23.73 4.00
  • 1 2 20.00 4.00
  • 0 28 72.94 2.70
  • 1 29 91.40 2.20

32 of 48

21.4 Sole Survey

32

Advanced Stats

solea_tree <- rpart(as.factor(Solea_solea) ~ depth +

temperature +

salinity +

transparency +

gravel +

mud +

factor(month) +

Area,

data = Solea,

method = "class",

minsplit=5,

cp = 0.001)

33 of 48

21.4 Sole Survey

MAR 596 - Logistic Regression

33

absent/present

34 of 48

21.4 Sole Survey

34

Advanced Stats

n= 65

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 65 26 0 (0.60000000 0.40000000)

2) salinity>=15.5 49 12 0 (0.75510204 0.24489796)

4) gravel>=1.42 43 8 0 (0.81395349 0.18604651)

8) gravel< 17.275 33 3 0 (0.90909091 0.09090909) *

9) gravel>=17.275 10 5 0 (0.50000000 0.50000000)

18) mud>=50.55 4 0 0 (1.00000000 0.00000000) *

19) mud< 50.55 6 1 1 (0.16666667 0.83333333) *

5) gravel< 1.42 6 2 1 (0.33333333 0.66666667)

10) salinity>=29.5 2 0 0 (1.00000000 0.00000000) *

11) salinity< 29.5 4 0 1 (0.00000000 1.00000000) *

3) salinity< 15.5 16 2 1 (0.12500000 0.87500000)

6) gravel< 1.34 3 1 0 (0.66666667 0.33333333) *

7) gravel>=1.34 13 0 1 (0.00000000 1.00000000) *

35 of 48

21.4 Sole Survey

  • Crossvalidated Relative Error shows worse performance of larger trees.

35

36 of 48

24 A Tree for the Birds

  • Classification of bird species from radar information to detect birds, waves, ships and air clutter for wind farm siting.

36

Advanced Stats

37 of 48

24 A Tree for the Birds

  • Objective: determine which variables help to predict distinguish birds from other signals.

37

Advanced Stats

tree1 <- rpart(g9 ~ EPT + TKQ + TKT + AVV +

VEL + MXA + AREA + MAXREF +

TRKDIS + MAXSEG + ORIENT +

ELLRATIO + ELONG + COMPACT +

CHY + MAXREF1 + MINREF + SDREF,

data = TreeData.1,

method = "class",

control = rpart.control(cp = 0.001))

38 of 48

24 A Tree for the Birds

  • Two branches appear to be optimal.

38

Advanced Stats

39 of 48

24 A Tree for the Birds

39

Advanced Stats

tree1a <- prune(tree1,cp = 0.0323)

printcp(tree1a,digits=3)

tree1a

n= 629

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 629 217 7 (0.083 0.059 0.054 0.017 0.035 0.043 0.66 0.029 0.025)

2) EPT< 1.5 93 41 1 (0.56 0.4 0.043 0 0 0 0 0 0) *

3) EPT>=1.5 536 124 7 (0 0 0.056 0.021 0.041 0.05 0.77 0.034 0.03) *

40 of 48

24 A Tree for the Birds

  • Echoes per track was the most informative variable for classifying birds from other signals.

40

Advanced Stats

Groups:

Air clutter, waves, ships, birdspp1… birdspp6

41 of 48

24 A Tree for the Birds

  • Second attempt grouped all bird species into a single category from ships, waves and air clutter.

41

Advanced Stats

tree3 <- rpart(g4 ~ EPT + TKQ + TKT + AVV + VEL + MXA +

AREA + MAXREF + TRKDIS + MAXSEG + ORIENT +

ELLRATIO + ELONG + COMPACT + CHY + MAXREF1 +

MINREF + SDREF,

data = TreeData.1,

method = "class",

control = rpart.control(cp = 0.001))

42 of 48

24 A Tree for the Birds

  • Echoes per track, ellipse ratio, area and track quality were important predictors

42

Advanced Stats

Groups:

Air clutter, waves, ships, birds

43 of 48

Regression Tree Models

  • Regression trees predict the value of the response variable (Y; e.g., bees) as a step function of each explanatory variable (X; e.g., #plants, AM/PM) resulting in a multi-dimensional step function.

43

Advanced Stats

Y

44 of 48

Classification Tree

  • Simple example with two predictors (7% error).

44

Advanced Stats

45 of 48

Bagged Tree

  • Bagged trees (boostrapped aggregation) – classification based on majority (3% error)

45

Advanced Stats

46 of 48

Random Forests & Boosted Trees

  • Random forests are a refinement of bagged trees in which a random sample is drawn and considered for splitting, with ‘out-of-bag’ (extrinsic) classification removing correlation among trees.
  • Boosted trees average many trees that are grown to re-weighted versions of the training data.

46

Advanced Stats

47 of 48

Summary – Tree Models

  • Tree models identify dichotomous groups in a response variable.
    • Determines which dichotomous groups have the greatest between-group differences.
    • Leaf groups are assumed to be homogeneous.
  • The response variable can be continuous (regression tree) or discrete (classification tree).
  • Tree models are commonly used as an exploratory tool for predictive modeling.
  • Determining the optimal tree can be tricky.
  • Bagging, boosting and random forests involve pseudo-replication to improve prediction or classification.

47

48 of 48

Other Resources: