1 of 31

An Investigation and Predictive Analysis of Tree Cover Types in Roosevelt National Forest

Jacob Swe, Iva Porfirova, Lee Ding, Derek Huang

2 of 31

The Plan

  1. The Big Idea
  2. The Features
  3. An Approach to Modeling
  4. Linear Discriminant Analysis Classifier
  5. Decision Tree Classifier
  6. Random Forest Classifier
  7. Comparative Analysis
  8. Final Takeaways

3 of 31

Forest Conservation

4 of 31

Cartographical measurements of distances and orientations of the different forest cover types - total of 10.

Elevation

Aspect

Slope

Horizontal_Distance_To_Hydrology

Vertical_Distance_To_Hydrology

Horizontal_Distance_To_Roadways

Hillshade_9am

Hillshade_Noon

Hillshade_3pm

Horizontal_Distance_To_Fire_Points

Continuous Variables

5 of 31

Continuous Variables

6 of 31

Continuous Variables

7 of 31

Continuous Variables

8 of 31

Different categories for wilderness areas and soil types

Wilderness_Area1 (Rawah)

Wilderness_Area2 (Neota)

Wilderness_Area3 (Comanche)

Wilderness_Area4 (Cache la Poudre)

Soil_Type1

Soil_Type40

Binary Variables

0 = absence

1 = presence

9 of 31

Binary Variables

10 of 31

Binary Variables

11 of 31

Visual Analysis

12 of 31

Clustering

13 of 31

Our Approach to Modeling

Data Partitioning:

3-fold CV

Hyperparameter Tuning:

CV Grid Search

Resampling:

Rebalance Class Membership

14 of 31

Our Approach to Modeling

Feature Selection:

CV Recursive Feature Elimination

15 of 31

Our Approach to Modeling

  1. Elevation
  2. Horizontal_Distance_To_Hydrology
  3. Vertical_Distance_To_Hydrology
  4. Horizontal_Distance_To_Roadways
  5. Hillshade_9am
  6. Hillshade_Noon
  7. Horizontal_Distance_To_Fire_Points
  8. Wilderness_Area1
  9. Soil_Type38
  10. Soil_Type39

16 of 31

Linear Discriminant Analysis

LDA:

Regularization: 0.3,

Tolerance: 0.01

3-fold CV Accuracy: 0.6597

Holdout Accuracy: 0.6612

17 of 31

Decision Tree Classifier

18 of 31

Decision Tree Classifier

LDA:

Regularization: 0.3,

Tolerance: 0.01

Holdout Accuracy: 0.6612

DTC:

Optimizer: Entropy,

Depth: 25,

Min Samples in Leaf: 5

Holdout Accuracy: 0.92190

19 of 31

Random Forest Classifier

20 of 31

Random Forest Classifier

RFC:

Optimizer: Entropy,

Depth: 15,

Min Samples in Leaf: 10

# Trees: 15

Holdout Accuracy: 0.82850

LDA:

Regularization: 0.3,

Tolerance: 0.01

Holdout Accuracy: 0.6612

DTC:

Optimizer: Entropy,

Depth: 25,

Min Samples in Leaf: 5

Holdout Accuracy: 0.92190

21 of 31

Comparative Performance

DTC:

Optimizer: Entropy,

Depth: 25,

Min Samples in Leaf: 5

Holdout Accuracy: 0.92190

RFC:

Optimizer: Entropy,

Depth: 25,

Min Samples in Leaf: 5

# Trees: 15

Holdout Accuracy: 0.94908

22 of 31

Best Model Selection

One vs. Many

23 of 31

Best Model Selection

One vs. Many

24 of 31

Resampling

Interpolating New

Observations

25 of 31

Resampling

~ 500,000 Samples

~ 1,500,000 Samples

26 of 31

Changing the Goal

What does resampling do to accuracy calculations?

What metrics should we pay attention to?

How can we decide if the new model is better?

27 of 31

Tuning a New Model

Accuracy now means Precision

28 of 31

Evaluating Performance

29 of 31

Evaluating Performance

Resampled

Raw Data

Accuracy:

~93% ➝ ~93%

ROC AUC:

~99% ➝ ~99%

Precision (Macro):

~96% ➝ ~97%

Precision-Recall AUC:

~96%➝ ~98%

30 of 31

Final Conclusions

  • Did we beat the existing records?
    1. Yes! By ~2 percentage points.
  • Other unused modeling techniques?
    • Various resampling techniques, other boosting tree implementations, neural networks, multi-model ensemble.
  • Other questions in the data?
    • Trees and where they like to grow.
    • Impact of roadways on nature.

31 of 31

Questions

&

Answers