BioStatsII: Lecture 16, Trees
Amanda Hart
3/31/20
Acknowledgements: Steve Cadrin, Gavin Fay
Lecture Outline
2
Evaluate Management Performance: Tree Analysis (An Example)
Explain variability in dataset
Group B.2
Group B.1
Group A
Data
Group B
Explain variability in dataset
Group B.2
Group B.1
Group A
Data
Group B
Explore relationships between response variable (data) and multiple explanatory variables (determine grouping)
Classification trees
Regression trees
9.1 Tree Introduction: Bee Example
6
Site | Bees | Plant Spp | Time | Deviation | devSq |
A | 0 | 1 | AM | -3 | 9 |
B | 1 | 1 | PM | -2 | 4 |
C | 2 | 2 | AM | -1 | 1 |
D | 3 | 2 | PM | 0 | 0 |
E | 4 | 3 | AM | 1 | 1 |
F | 5 | 3 | PM | 2 | 4 |
G | 6 | 3 | PM | 3 | 9 |
Mean | 3 | | | Sum | 28 |
9.1 Tree Introduction: Bee Example
7
D= D-between-groups (5.25) + D-within-groups(22.75) = 28
Group B
Group A
Data
AM
PM
9.1 Tree Introduction: Bee Example
8
Time=morning
True False
Mean=2 Mean=3.75
n=3 n=4
Mean=3
n=7
Site | Bees | Plant Spp | Time | |
A | 0 | 1 | AM | |
B | 1 | 1 | PM | |
C | 2 | 2 | AM | |
D | 3 | 2 | PM | |
E | 4 | 3 | AM | |
F | 5 | 3 | PM | |
G | 6 | 3 | PM | |
9.1 Tree Introduction: Bee Example
9
D=within group AB + within group C-G + AB group effect + C-G group effect
D= 0.5 + 10 + 12.5 + 5 = 28
D = D-within-groups(10.5) + D-between groups(17.5) = 28
<=1 Plant Species
True False
Mean=0.5 Mean=4
n=2 n=5
Mean=3
n=7
Site | Bees | Plant Spp | Time | |
A | 0 | 1 | AM | |
B | 1 | 1 | PM | |
C | 2 | 2 | AM | |
D | 3 | 2 | PM | |
E | 4 | 3 | AM | |
F | 5 | 3 | PM | |
G | 6 | 3 | PM | |
9.1 Tree Introduction: Bee Example
10
9.1 Tree Introduction: Bee Example
11
Plants<2.5
Plants<1.5 Time=AM
0.5 2.5 4.0 5.5
n=2 n=2 n=1 n=2
(sitesAB) (sitesCD) (siteE) (sitesFG)
True False
Mean=3
n=7
Mean=5
n=3
Mean=1.5
n=4
9.1 Tree Introduction: Bee Example
12
Advanced Stats
Use the rpart package to produce trees
> Bees <- read.table("Bees.txt", header = T)
> library(rpart)
> bee_tree <- rpart(Bees ~ Plants + as.factor(Time),
+ data = Bees,
+ control = rpart.control(minsplit=3, cp = 0.001))
> bee_tree
> plot(bee_tree)
9.1 Tree Introduction: Bee Example
13
Advanced Stats
Use the rpart package to produce trees
> bee_tree
> plot(bee_tree)
n = 7
node), split, n, deviance, yval
* denotes terminal node
1) root 7 28.0 3.0
n dev. mean
2) Plants< 2.5 4 5.0 1.5
4) Plants< 1.5 2 0.5 0.5 *
branch n dev. Mean ‘leaf’
5) Plants>=1.5 2 0.5 2.5 *
3) Plants>=2.5 3 2.0 5.0
6) as.factor(Time)=1 1 0.0 4.0 *
7) as.factor(Time)=2 2 0.5 5.5 *
bee_tree pbject contains more information than you think!
Using bee_tree$ notation gives lots of other options
bee_tree$frame | Contains information on which explanatory variables used to make splits |
bee_tree$where | Tree location of every data point |
bee_tree$terms | |
bee_tree$cptable | Information used in cross validation (this version is easy to manipulate) |
bee_tree$method | |
bee_tree$parms | |
bee_tree$control | Shows hows how rpart.control ( ) is used |
bee_tree$functions | |
bee_tree$numresp | |
bee_tree$splits | |
bee_tree$variable.importance | Lists how important each variable was in determining tree splits |
bee_tree$y | |
bee_tree$ordered | |
9.1 Tree Introduction: Bee Example
15
Advanced Stats
Use the rpart package to produce trees
> Bees <- read.table("Bees.txt", header = T)
> library(rpart)
> bee_tree <- rpart(Bees ~ Plants + as.factor(Time),
+ data = Bees,
+ control = rpart.control(minsplit=3, cp = 0.001))
> bee_tree
> plot(bee_tree)
Tree analysis: 3 Steps
9.2 Pruning the Tree
17
9.2 Pruning the Tree
18
Parrotfish Example: Step1 Initial Tree
Recursive Partitioning and Regression Trees
19
parrot_tree <- rpart(Parrot ~ CoralTotal + as.factor(Month) + as.factor(Station) + as.factor(Method),
data = Bahama,
control = rpart.control(cp = 0.02))
parrot_tree
Parrotfish Example: Step 1 Initial Tree
20
plot(parrot_tree)
Parrotfish Example: Step 1 Initial Tree
21
n= 402
node), split, n, deviance, yval
* denotes terminal node
1) root 402 50188.3500 10.781120
n deviance mean
2) as.factor(Method)=2 244 7044.2100 6.449180
branch n deviance mean
4) CoralTotal< 4.955 87 1401.1290 3.526437 *
branch n deviance mean ‘leaf’
5) CoralTotal>=4.955 157 4488.0570 8.068790 *
3) as.factor(Method)=1 158 31494.2000 17.470950
6) as.factor(Station)=3,4 25 781.7439 3.157200 *
7) as.factor(Station)=1,2,5,6,7,8,9,10 133 24627.5800 20.1615
14) as.factor(Month)=5,7,8,10 94 11587.5500 17.093940 *
15) as.factor(Month)=11 39 10023.5200 27.555130 *
parrot_tree
Parrotfish Example: Step 2 Pruning the Tree
22
1) Overfit tree using cp = 0.001
parrotTree <- rpart(formula = Parrot ~ CoralTotal +
as.factor(Month) +
as.factor(Station) +
as.factor(Method),
data = Bahama,
control = rpart.control(cp = 0.001))
2) Print cross validation
printcp(parrotTree)
Parrotfish Example: Step 2 Pruning the Tree
23
plotcp(parrot_tree)
Upper confidence limit of optimal tree
Size of tree (number of branches)
Parrotfish Example: Step 3 Produce Optimal Tree
24
1) Overfit tree using cp = 0.001
parrotTree <- rpart(formula = Parrot ~ CoralTotal +
as.factor(Month) +
as.factor(Station) +
as.factor(Method),
data = Bahama,
control = rpart.control(cp = 0.0230138))
9.3 Classification Trees
25
9.4. Ditch Data
26
Response variable is Ditch Number (nominal response)
rpart(formula = as.factor(Site) ~ Depth + Conductivity +
Total_Calcium, data = Ditch, method = "class",
control = rpart.control(cp = 0.001))
Variables used in tree construction:
[1] Conductivity Depth Total_Calcium
Root node error: 38/48 = 0.79167
n= 48
CP nsplit rel error xerror xstd
1 0.23684 0 1.00000 1.18421 0.044133
2 0.18421 1 0.76316 1.10526 0.060297
3 0.13158 2 0.57895 1.02632 0.071162
4 0.00100 3 0.44737 0.89474 0.082870
9.4. Ditch Data
27
Advanced Stats
Size of Tree
9.4. Ditch Data
28
Different leaf statistics:
<- Ditch with most observations
2 in ditch1, 2 in ditch2,
2 in ditch3, 9 in ditch4
0 in ditch5
9.4. Ditch Data
29
n= 48
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 48 38 1 (0.21 0.19 0.21 0.21 0.19)
2) Total_Calcium>=118 25 16 5 (0.32 0.28 0 0.04 0.36)
branch n D Y (p1 p2 p3 p4 p5)
4) Conductivity< 1505 11 6 1 (0.45 0.45 0 0.091 0) *
branch n D Y (p1 p2 p3 p4 p5) leaf
5) Conductivity>=1505 14 5 5 (0.21 0.14 0 0 0.64) *
3) Total_Calcium< 118 23 13 3 (0.087 0.087 0.43 0.39 0)
6) Depth>=0.505 8 0 3 (0 0 1 0 0) *
7) Depth< 0.505 15 6 4 (0.13 0.13 0.13 0.6 0) *
9.4. Ditch Data
Matrix Algebra
30
Advanced Stats
dotchart(Ditch$Total_Calcium, pch = Ditch$Site,xlab = "Range", ylab = "Sample", main = "Total Calcium")
Ditch
5
4
3
2
1
21.4 Sole Survey
31
21.4 Sole Survey
32
Advanced Stats
solea_tree <- rpart(as.factor(Solea_solea) ~ depth +
temperature +
salinity +
transparency +
gravel +
mud +
factor(month) +
Area,
data = Solea,
method = "class",
minsplit=5,
cp = 0.001)
21.4 Sole Survey
MAR 596 - Logistic Regression
33
absent/present
21.4 Sole Survey
34
Advanced Stats
n= 65
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 65 26 0 (0.60000000 0.40000000)
2) salinity>=15.5 49 12 0 (0.75510204 0.24489796)
4) gravel>=1.42 43 8 0 (0.81395349 0.18604651)
8) gravel< 17.275 33 3 0 (0.90909091 0.09090909) *
9) gravel>=17.275 10 5 0 (0.50000000 0.50000000)
18) mud>=50.55 4 0 0 (1.00000000 0.00000000) *
19) mud< 50.55 6 1 1 (0.16666667 0.83333333) *
5) gravel< 1.42 6 2 1 (0.33333333 0.66666667)
10) salinity>=29.5 2 0 0 (1.00000000 0.00000000) *
11) salinity< 29.5 4 0 1 (0.00000000 1.00000000) *
3) salinity< 15.5 16 2 1 (0.12500000 0.87500000)
6) gravel< 1.34 3 1 0 (0.66666667 0.33333333) *
7) gravel>=1.34 13 0 1 (0.00000000 1.00000000) *
21.4 Sole Survey
35
24 A Tree for the Birds
36
Advanced Stats
24 A Tree for the Birds
37
Advanced Stats
tree1 <- rpart(g9 ~ EPT + TKQ + TKT + AVV +
VEL + MXA + AREA + MAXREF +
TRKDIS + MAXSEG + ORIENT +
ELLRATIO + ELONG + COMPACT +
CHY + MAXREF1 + MINREF + SDREF,
data = TreeData.1,
method = "class",
control = rpart.control(cp = 0.001))
24 A Tree for the Birds
38
Advanced Stats
24 A Tree for the Birds
39
Advanced Stats
tree1a <- prune(tree1,cp = 0.0323)
printcp(tree1a,digits=3)
tree1a
n= 629
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 629 217 7 (0.083 0.059 0.054 0.017 0.035 0.043 0.66 0.029 0.025)
2) EPT< 1.5 93 41 1 (0.56 0.4 0.043 0 0 0 0 0 0) *
3) EPT>=1.5 536 124 7 (0 0 0.056 0.021 0.041 0.05 0.77 0.034 0.03) *
24 A Tree for the Birds
40
Advanced Stats
Groups:
Air clutter, waves, ships, birdspp1… birdspp6
24 A Tree for the Birds
41
Advanced Stats
tree3 <- rpart(g4 ~ EPT + TKQ + TKT + AVV + VEL + MXA +
AREA + MAXREF + TRKDIS + MAXSEG + ORIENT +
ELLRATIO + ELONG + COMPACT + CHY + MAXREF1 +
MINREF + SDREF,
data = TreeData.1,
method = "class",
control = rpart.control(cp = 0.001))
24 A Tree for the Birds
42
Advanced Stats
Groups:
Air clutter, waves, ships, birds
Regression Tree Models
43
Advanced Stats
Y
Classification Tree
44
Advanced Stats
Bagged Tree
45
Advanced Stats
Random Forests & Boosted Trees
46
Advanced Stats
Summary – Tree Models
47
Other Resources: