2 of 52

Python and Spark

A very powerful group of algorithms falls under the “Tree Methods” title.
We’ll discuss decision trees, random forests, and gradient boosted trees!

3 of 52

Chapter 8 of

Introduction to Statistical Learning

By Gareth James, et al.

Reading Assignment

4 of 52

Math &

Statistics

Domain

Knowledge

Machine Learning

Software

Research

Tree Methods

Let’s start off with a thought experiment to give some motivation behind using a decision tree method.

5 of 52

Math &

Statistics

Domain

Knowledge

Machine Learning

Software

Research

Tree Methods

Imagine that I play Tennis every Saturday and I always invite a friend to come with me.
Sometimes my friend shows up, sometimes not.
For him it depends on a variety of factors, such as: weather, temperature, humidity, wind etc..
I start keeping track of these features and whether or not he showed up to play with me.

6 of 52

Math &

Statistics

Domain

Knowledge

Machine Learning

Software

Research

Tree Methods

7 of 52

Math &

Statistics

Domain

Knowledge

Machine Learning

Software

Research

Tree Methods

I want to use this data to predict whether or not he will show up to play.

An intuitive way to do this is through a Decision Tree

8 of 52

Math &

Statistics

Domain

Knowledge

Machine Learning

Software

Research

Tree Methods

In this tree we have:

Nodes

Split for the value of a certain attribute

Edges

Outcome of a split to next node

9 of 52

Math &

Statistics

Domain

Knowledge

Machine Learning

Software

Research

Tree Methods

In this tree we have:

Root

The node that performs the first split

Leaves

Terminal nodes that predict the outcome

10 of 52

Math &

Statistics

Domain

Knowledge

Machine Learning

Software

Research

Intuition Behind Splits

Imaginary Data with 3 features (X,Y, and Z) with two possible classes.

11 of 52

Math &

Statistics

Domain

Knowledge

Machine Learning

Software

Research

Intuition Behind Splits

Splitting on Y gives us a clear separation between classes

12 of 52

Math &

Statistics

Machine Learning

Intuition Behind Splits

We could have also tried splitting on other features first:

13 of 52

Math &

Statistics

Machine Learning

Intuition Behind Splits

Entropy and Information Gain are the Mathematical Methods of choosing the best split. Refer to reading assignment.

14 of 52

Math &

Statistics

Machine Learning

Random Forests

To improve performance, we can use many trees with a random sample of features chosen as the split.

A new random sample of features is chosen for every single tree at every single split.
Works for both classification and regression tasks!

15 of 52

Math &

Statistics

Machine Learning

Random Forests

What's the point?

Suppose there is one very strong feature in the data set. Most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are highly correlated.

16 of 52

Math &

Statistics

Machine Learning

Random Forests

What's the point?

Averaging highly correlated quantities does not significantly reduce variance.
By randomly leaving out candidate features from each split, Random Forests "decorrelates" the trees, such that the averaging process can reduce the variance of the resulting model.

17 of 52

Python and Spark

Finally, let’s discuss Gradient Boosted Trees!

18 of 52

Python and Spark

Gradient boosting involves three elements:

A loss function to be optimized.
A weak learner to make predictions.
An additive model to add weak learners to minimize the loss function.

19 of 52

Python and Spark

Loss Function

A loss function in basic terms is the function/equation you will use to determine how “far off” your predictions are.

20 of 52

Python and Spark

Loss Function

For example, regression may use a squared error and classification may use logarithmic loss.
We won’t have to deal with this directly using Spark.

21 of 52

Python and Spark

Weak Learner

Decision trees are used as the weak learner in gradient boosting.
It is common to constrain the weak learners: such as a maximum number of layers, nodes, splits or leaf nodes.

22 of 52

Python and Spark

Additive Model

Trees are added one at a time, and existing trees in the model are not changed.
A gradient descent procedure is used to minimize the loss when adding trees.

23 of 52

Python and Spark

So what is the most intuitive way to think about all this if Spark does this all for us?
Here is a nice way to think about it in 3 “easy” steps

24 of 52

Python and Spark

Train a weak model m using data samples drawn according to some weight distribution

25 of 52

Python and Spark

2. Increase the weight of samples that are misclassified by model m, and decrease the weight of samples that are classified correctly by model m

26 of 52

Python and Spark

3. Train next weak model using samples drawn according to the updated weight distribution.

27 of 52

Python and Spark

In this way, the algorithm always trains models using data samples that are "difficult" to learn in previous rounds, which results an ensemble of models that are good at learning different "parts" of training data.

28 of 52

Python and Spark

Basically “boosting” weights of samples that were difficult to get correct.

29 of 52

Python and Spark

The real details of gradient boosting lies in the mathematics, which you may or may not be ready for depending on your background.
The full details are at the end of Chapter 8 of ISLR!

30 of 52

Python and Spark

Spark handles all of this “under the hood” for you, so you can use the defaults if you want, or dive into ISLR and begin to play around with the parameters!
Let’s continue!

31 of 52

Tree Methods

Documentation Example

Let’s learn something!

32 of 52

Python and Spark

Now that we have some intuitive understanding of how these algorithms work, let’s dive into the documentation examples!

33 of 52

Python and Spark

We’ll work through Decision Tree, Random Forest and Gradient Boosted Trees.
We will also expand a little more from the documentation example and show some useful evaluation features!

34 of 52

Tree Methods

Code Along

35 of 52

Python and Spark

We’ll work through all 3 Tree Methods discussed and compare their results on a college dataset.
The data set has features of universities and labeled either Private or Public.
Let’s get started!

36 of 52

Tree Methods

Consulting Project

Let’s learn something!

37 of 52

Python and Spark

You’ve been contracted by the Purina Dog Food company and flown out to their HQ in St. Louis, Missouri!

38 of 52

Python and Spark

You've been hired by a dog food company to try to predict why some batches of their dog food are spoiling much quicker than intended!

39 of 52

Python and Spark

Unfortunately this Dog Food company hasn't upgraded to the latest machinery, meaning that the amounts of the five preservative chemicals they are using can vary a lot, but which is the chemical that has the strongest effect?

40 of 52

Python and Spark

The dog food company first mixes up a batch of preservative that contains 4 different preservative chemicals (A,B,C,D) and then is completed with a "filler" chemical.

41 of 52

Python and Spark

The food scientists believe one of the A,B,C, or D preservatives is causing the problem, but need your help to figure out which one!

42 of 52

Python and Spark

Use Machine Learning with RF to find out which parameter had the most predictive power, thus finding out which chemical causes the early spoiling!
So create a model and then find out how you can decide which chemical is the problem!

43 of 52

Python and Spark

Their data looks like this:
Pres_A : Percentage of preservative A in the mix
Pres_B : Percentage of preservative B in the mix
Pres_C : Percentage of preservative C in the mix
Pres_D : Percentage of preservative D in the mix
Spoiled: Label indicating whether or not the dog food batch was spoiled.

44 of 52

Python and Spark

Think carefully about what this problem is really asking you to solve.
While we will use Machine Learning to solve this, it won't be with your typical train/test split workflow. If this confuses you, skip ahead to the solution code along!

45 of 52

Tree Methods

Consulting Project

Solutions

46 of 52

Python and Spark

Our main task was to figure out which preservative chemical (A,B,C,D) was having an effect on the dog food being spoiled.
So how can we use machine learning models to solve this?

47 of 52

Python and Spark

Many machine learning models produce some sort of coefficient value for each feature involved, indicating their “importance” or predictive power.
Remember back to the documentation lecture for this section of the course!

48 of 52

Python and Spark

We mentioned that these tree method classifiers had a .featureImportances attribute available!
So we can create a model, fit it on all the data, and then check which feature (preservative) was causing the spoilage!

49 of 52

Python and Spark

.featureImportances returns:
SparseVector(4, {0: 0.0026, 1: 0.0089, 2: 0.9686, 3: 0.0199})
Corresponding to a features column:
Row(features=DenseVector([4.0, 2.0, 12.0, 3.0]), Spoiled=1.0)

50 of 52

Python and Spark

There are many different ways to solve this problem, including just using “pure” statistics instead of a machine learning model.

51 of 52

Python and Spark

Hopefully this consulting project shows how we can apply machine learning in a different way from previous examples.
In this case we don’t really care about test/train splits or deployments.

52 of 52

Python and Spark

What we really want to understand is the fundamental relationship between each feature column and the label itself.