1 of 31

An Investigation and Predictive Analysis of Tree Cover Types in Roosevelt National Forest

Jacob Swe, Iva Porfirova, Lee Ding, Derek Huang

2 of 31

The Plan

The Big Idea
The Features
An Approach to Modeling
Linear Discriminant Analysis Classifier
Decision Tree Classifier
Random Forest Classifier
Comparative Analysis
Final Takeaways

3 of 31

Forest Conservation

4 of 31

Cartographical measurements of distances and orientations of the different forest cover types - total of 10.

Elevation

Aspect

Slope

Horizontal_Distance_To_Hydrology

Vertical_Distance_To_Hydrology

Horizontal_Distance_To_Roadways

Hillshade_9am

Hillshade_Noon

Hillshade_3pm

Horizontal_Distance_To_Fire_Points

Continuous Variables

D uwuwuwuwu brOther

continuous - can take any real value.

Elevation - Elevation in meters.

Aspect - Aspect in degrees azimuth. picture attached; we were initially a bit confused as well! measured in degrees clockwise from north based on wherever they measured the middle of the forest area to be.

Slope - Slope in degrees. less than 90 degrees usually.

Horizontal_Distance_To_Hydrology - Horizontal distance to nearest surface water features.

Vertical_Distance_To_Hydrology - Vertical distance to nearest surface water features. can be < 0 (some trees below water line)

Horizontal_Distance_To_Roadways - Horizontal distance to nearest roadway.

- hillshade done by taking picture with a camera, measuring the shade, and then squishing it to 0-255 index.

Hillshade_9am - Hill shade index at 9am, summer solstice. Value out of 255.

Hillshade_Noon - Hill shade index at noon, summer solstice. Value out of 255.

Hillshade_3pm - Hill shade index at 3pm, summer solstice. Value out of 255.

Horizontal_Distance_To_Fire_Points - Horizontal distance to nearest wildfire ignition points.

5 of 31

Continuous Variables

6 of 31

Continuous Variables

7 of 31

Continuous Variables

8 of 31

Different categories for wilderness areas and soil types

Wilderness_Area1 (Rawah)

Wilderness_Area2 (Neota)

Wilderness_Area3 (Comanche)

Wilderness_Area4 (Cache la Poudre)

Soil_Type1

…

Soil_Type40

Binary Variables

0 = absence

1 = presence

9 of 31

Binary Variables

10 of 31

Binary Variables

11 of 31

Visual Analysis

12 of 31

Clustering

13 of 31

Our Approach to Modeling

Data Partitioning:

3-fold CV

Hyperparameter Tuning:

CV Grid Search

Resampling:

Rebalance Class Membership

D

- now we want to develop a model for the data, in the hopes that we can classify given an observation vector.

- 80% of our data set aside for training, 20% for holdout to prevent information leakage.

- we used 3-fold cross validation. we wanted to maximize our training data; use 25% for testing each time and include different outliers in our training and testing.

- hyperparameters are model parameters that change the model behavior. grid search: we take all possible combinations we want to consider, exhaustively train and test models with these parameters, and then rank the models by average accuracy.

- resampling: we performed rebalancing on the data to increase the frequency of the less common classes and reduce the frequency of the more populous classes, to increase the significance of the less populous classes.

14 of 31

Our Approach to Modeling

Feature Selection:

CV Recursive Feature Elimination

15 of 31

Our Approach to Modeling

Elevation
Horizontal_Distance_To_Hydrology
Vertical_Distance_To_Hydrology
Horizontal_Distance_To_Roadways
Hillshade_9am
Hillshade_Noon
Horizontal_Distance_To_Fire_Points
Wilderness_Area1
Soil_Type38
Soil_Type39

16 of 31

Linear Discriminant Analysis

LDA:

Regularization: 0.3,

Tolerance: 0.01

3-fold CV Accuracy: 0.6597

Holdout Accuracy: 0.6612

17 of 31

Decision Tree Classifier

18 of 31

Decision Tree Classifier

LDA:

Regularization: 0.3,

Tolerance: 0.01

Holdout Accuracy: 0.6612

DTC:

Optimizer: Entropy,

Depth: 25,

Min Samples in Leaf: 5

Holdout Accuracy: 0.92190

19 of 31

Random Forest Classifier

D mega uwuwuwuwu

- building off of the decision tree model, we would expect that an ensemble model with similar parameters (and more trees!) would be more accurate. each decision tree may produce a slightly different class probability, so we average the predictions of many, we should gain better accuracy.

FOR IBM:

- used the random forest model as classifier, to see if there was any advantage over a single decision tree.

for the kids & club:

- conceptually, a random forest model is a collection of decision trees being evaluated. each are different, and numerically randomized

- choose different no. trees for the forest

- consider one decision tree: at each node a decision is made that will lead to evaluation proceeding down one branch or another. ex. is aroma > k? if so, left, if not, right.

- can be used for both classification or regression; we used it for regression

- plus thingy at end is sum of results from all trees (each tree contributes a portion of the final output)

20 of 31

Random Forest Classifier

RFC:

Optimizer: Entropy,

Depth: 15,

Min Samples in Leaf: 10

# Trees: 15

Holdout Accuracy: 0.82850

LDA:

Regularization: 0.3,

Tolerance: 0.01

Holdout Accuracy: 0.6612

DTC:

Optimizer: Entropy,

Depth: 25,

Min Samples in Leaf: 5

Holdout Accuracy: 0.92190

21 of 31

Comparative Performance

DTC:

Optimizer: Entropy,

Depth: 25,

Min Samples in Leaf: 5

Holdout Accuracy: 0.92190

RFC:

Optimizer: Entropy,

Depth: 25,

Min Samples in Leaf: 5

# Trees: 15

Holdout Accuracy: 0.94908

22 of 31

Best Model Selection

One vs. Many

23 of 31

Best Model Selection

One vs. Many

24 of 31

Resampling

Interpolating New

Observations

25 of 31

Resampling

~ 500,000 Samples

~ 1,500,000 Samples

26 of 31

Changing the Goal

What does resampling do to accuracy calculations?

What metrics should we pay attention to?

How can we decide if the new model is better?

27 of 31

Tuning a New Model

Accuracy now means Precision

28 of 31

Evaluating Performance

29 of 31

Evaluating Performance

Resampled

Raw Data

Accuracy:

~93% ➝ ~93%

ROC AUC:

~99% ➝ ~99%

Precision (Macro):

~96% ➝ ~97%

Precision-Recall AUC:

~96%➝ ~98%

30 of 31

Final Conclusions

Did we beat the existing records?

Yes! By ~2 percentage points.

Other unused modeling techniques?

Various resampling techniques, other boosting tree implementations, neural networks, multi-model ensemble.

Other questions in the data?

Trees and where they like to grow.
Impact of roadways on nature.

31 of 31

Questions

&

Answers