1 of 50

Machine Learning

Introduction to Machine Learning

Created By:

Tom Mattson

Copyright © Tom Mattson. All rights reserved.

1

Copyright © Tom Mattson

2 of 50

Module Overview

Set the foundation for you to explore using machine learning in a responsible and ethical manner

  1. Machine learning concepts, terminology, and processes
  2. The types of research & problems where we can apply machine learning
  3. Ethical issues with machine learning

2

Copyright © Tom Mattson

3 of 50

Explanation or Prediction?

You are conducting research for a human resources department. You obtain data on the Myers-Briggs personality scores along with cultural background for all your employees. You have a final sample size of 2500 observations.

Using all 2500 observations, you run a linear regression or ANOVA to test the effect of the Myers-Briggs factors on employee performance. You find that extroverts with low power distance have a significantly greater likelihood of out performing introverts with high power distance (p<0.05). You make a similar conclusion for sensing versus intuition based on different levels of uncertainty avoidance (p<0.01). Your model explains 35% (R2) of the historical variance.

What type of analysis did you do?

3

Copyright © Tom Mattson

4 of 50

Explanation or Prediction?

When you get a regression result, what values do you interpret to draw your conclusions related to your hypotheses?

This type of analysis is certainly scientifically valuable, but it represents explanatory modelling and not predictive modelling.

In scientific publications and textbooks, we are often careless with our language when reporting regression results.

4

Copyright © Tom Mattson

5 of 50

Explanation or Prediction?

If I were to give you the entire population and you ran a regression, what would a “p-value” less than 0.05 represent?

5

Copyright © Tom Mattson

6 of 50

Explanation or Prediction?

6

Copyright © Tom Mattson

7 of 50

Theory Driven vs Data Driven?

Machine Learning is data-driven (bottom-up). There are no a-priori hypotheses (top-down) formulated from expert intuition or existing management theories that we are testing.

7

Copyright © Tom Mattson

8 of 50

Can Machines Learn?

“Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel (1959)

8

Copyright © Tom Mattson

9 of 50

Can Machines Learn?

Let’s say, I give you a data set with 1000 columns and 9 million rows. I then ask you to find patterns. Would you be able to do it? How would you begin?

9

Copyright © Tom Mattson

10 of 50

What is Machine Learning?

Machine learning is the process of using a variety of different algorithms (i.e., classifiers or regressors) to train a model or models to make accurate predictions about some phenomenon based on the patterns and anomalies in the historical data that get fed into the different algorithms.

10

Copyright © Tom Mattson

11 of 50

What is Machine Learning?

The ML terms regressor (algorithm) and regression (continuous targets) are different from an explanatory OLS regression.

Explanatory Regression

Machine Learning

Goal

Explain the relationship between variables, not trying to generalize to unseen data

Make the most accurate predictions as possible in unseen data. Do not care about the relationship between variables

Method

Analyze all data together

Split data between training, validation and testing

Assumptions

linearity, independence, homoscedasticity (equal variance), no multicollinearity, and normally distributed errors

Many algorithms have no assumptions about the underlying data distribution

Complexity

Simple, not good at modelling complex patterns, which makes interpretability easier

May range from simple (KNN or decision tree) to highly complex (neural networks). Very good with non-linear, complex patterns, which makes interpretability more challenging.

Model Tuning

Not done except possibly adding control variables

Hyperparameter tuning is almost always required.

In ML, if I say that I have a regression problem, then that means that my target is continuous. I use a “regressor” algorithm to train, validate, and test my model when the target is continuous.

11

Copyright © Tom Mattson

12 of 50

What is Machine Learning?

An algorithm is a set of step-by-step instructions or processes constructed with three basic components: 1) sequencing, 2) selection, and 3) iteration.

An algorithm may be used to solve a specific problem, to perform a computation, find a pattern, discover anomalies, or construct a machine learning model.

In machine learning, we use machine learning algorithms to construct predictive models.

What are algorithms?

12

Copyright © Tom Mattson

13 of 50

What is Machine Learning?

Different machine learning algorithms include:

Machine learning projects may take several forms: 1) build new algorithms, 2) refine existing algorithms or 3) use existing algorithms to build models to solve business problems.

13

Copyright © Tom Mattson

14 of 50

What is Machine Learning?

What is the difference between a machine learning algorithm and a machine learning model?

We would say I have a random forest model, which means that we used the random forest machine learning algorithm to train a model.

14

Copyright © Tom Mattson

15 of 50

What is Machine Learning?

As business analysts, we probably will not build or refine algorithms. Instead, we select an algorithm or multiple algorithms to construct one or more predictive models to solve a specific business problem.

Ensemble Machine Learning Models

“x” represents an algorithm in this case

Depending on the nature of our target (outcome variable), we can employ hard voting or soft voting when we have multiple models (ensemble).

15

Copyright © Tom Mattson

16 of 50

What is Machine Learning?

All machine learning algorithms and resulting models have error. We have two broad categories of model errors:

  1. reducible
  2. irreducible

We attempt to reduce the reducible error component by (among others) adding new data, modifying hyper-parameters and using different algorithms.

Simple is better than complex.

Complex is better than complicated.

Our goal is to construct the simplest model that is useful for our research question or business problem that reduces the reducible error as much as possible.

16

Copyright © Tom Mattson

17 of 50

What is Machine Learning?

17

Copyright © Tom Mattson

18 of 50

What is Machine Learning?

18

Copyright © Tom Mattson

19 of 50

Terminology

Labels: Two common uses. Labels are the codes assigned to each observation. In this sense, the labels become the targets. They are also the values or categories assigned to examples by a machine learning model. In classification, the labels are the categories (customer or not, employee or not). In regression, the labels are the real numbers (22K in profit).

Targets: The correct labels for each example. Conceptually similar to the outcome variable (e.g., the dependent variable) in a regression. Model should correctly predict the target.

There are no labels and targets in unsupervised machine learning.

Examples: Items, instances, rows, and observations used for training or evaluation.

Features or Dimensions: Set of attributes or properties of your data (i.e., the columns).

19

Copyright © Tom Mattson

20 of 50

Terminology

This image is an example and we probably have 5K other examples to train our model and 1K other examples to test our model.

We want our final ML models to have a high percentage of labels match the targets!

Target

Cat

In many instances, we have human labelers to label each example. What happens if the labelers disagree?

20

Copyright © Tom Mattson

21 of 50

Terminology

Column Vectors

Row Vectors

21

Copyright © Tom Mattson

22 of 50

Terminology

Variables are conceptually similar to targets, features or dimensions. It is something that is observable or measurable. An employee’s education level is a variable and where the employee went to school is another variable. The Hofstede cultural dimensions of an individual or country are variables. These vary from example (observation) to example (observation).

A parameter is a numerical property of the model (not of the example or the data point).

22

Copyright © Tom Mattson

23 of 50

Terminology

23

Copyright © Tom Mattson

24 of 50

Terminology

24

Copyright © Tom Mattson

25 of 50

Unsupervised Machine Learning Workflow

The workflow will vary for supervised versus unsupervised machine learning projects. Both start with understanding the business context and problem. Both also require data acquisition, cleaning, and preparation.

Data Acquisition

Data

Cleaning

Output

Evaluation

Deployment

After evaluation, we could go back to the beginning, tune the algorithm, select a different algorithm, or deploy the solution if the solution is “good enough”.

Monitor

Notice how there is typically no train-test split

25

Copyright © Tom Mattson

26 of 50

Supervised Machine Learning Workflow

26

Copyright © Tom Mattson

27 of 50

Data Collection

  • We often use historical or archival data, but that need not always be the case. We may certainly use our own data from surveys or experiments.
  • Different algorithms require more data than other algorithms.

Bad data are no better than no data

27

Copyright © Tom Mattson

28 of 50

Data Cleaning & Preparation

Data wrangling or data munging refer to the tasks required to get data ready for analyses. We spend considerable time performing these tasks.

28

Copyright © Tom Mattson

29 of 50

Data Cleaning & Preparation

Unit of analysis might be individuals, organizations, transactions, or countries.

We should not have the same observation appear multiple times in our table.

29

Copyright © Tom Mattson

30 of 50

Data Cleaning & Preparation

We might have to standardized our numerical data, dummy code our categorical data, create and n-gram matrix for our text data, or calculate time to target for date features.

Image Data

30

Copyright © Tom Mattson

31 of 50

Data Cleaning & Preparation

31

Copyright © Tom Mattson

32 of 50

Data Cleaning & Preparation

32

Copyright © Tom Mattson

33 of 50

Data Cleaning & Preparation

Feature reduction (form of feature engineering) using principal component analysis (PCA)

33

Copyright © Tom Mattson

34 of 50

Data Cleaning & Preparation

We could perform an unsupervised cluster analysis to engineer new features

Note: An unsupervised cluster analysis may be a result (e.g., segment customers) and/or used to engineer features to be used in a supervised machine learning problem.

34

Copyright © Tom Mattson

35 of 50

Data Cleaning & Preparation

35

Copyright © Tom Mattson

36 of 50

Data Cleaning & Preparation

Target leakage happens when you train your model on a dataset that includes features or information that would not be available at the time of prediction (i.e., you train your model on data that can only be collected in the future).

During the data preparation phase, we must make sure that we do not have any target leakage.

36

Copyright © Tom Mattson

37 of 50

Data Cleaning & Preparation

Are the outliers real (but inconvenient) observations? If so, we cannot remove them.

Are the outliers “bad” data (i.e., errors in the data)? If so, we can remove them.

37

Copyright © Tom Mattson

38 of 50

Model Training (Supervised)

Now that our data are clean and prepared, we are finally ready to train a model!

In the training step, we first randomly split our data into a training and a testing data set. The model is developed using only the training data. We don’t touch the test data set until we are ready to see how accurate our trained model is.

38

Copyright © Tom Mattson

39 of 50

Model Training (Supervised)

Instead of splitting our sample into a single training data set, we can use a process of cross validation, which is sometimes referred to as rotation estimation or out of sample testing.

  • Train the model on folds 1 to 4, validate on fold 5.
  • Train a new model using folds 1-3 & 5, validate on fold 2.
  • Repeat such that each fold is used for validation purposes

Training Data

Testing Data

39

Copyright © Tom Mattson

40 of 50

Model Training (Supervised)

We cannot have the same observation in both training and testing. If this happens, the entire process is invalidated ☹

We refer to this type of contamination as data leakage. That is, an observation leaks its way into both training and testing. The two data sets must be independent (and randomly created)

Observation “Mattson” sneaks his way into training & testing, which creates data leakage.

It is the conceptual equivalent to poison for machine learning. One small drop can pollute the entire body of water.

40

Copyright © Tom Mattson

41 of 50

Model Training (Supervised)

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 25, 'min_samples_split': 75, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': None, 'splitter': 'best'}

We must create a blueprint that tells the algorithm under what constraints it can use to train the model.

These are the model hyperparameters. As analysts and researchers, we can set these hyperparameters manually or use an algorithm to help us select the best (optimal) hyperparameters.

Sample Hyperparameters & Blueprint for machine learning algorithm

41

Copyright © Tom Mattson

42 of 50

Model Evaluation (Supervised)

Now that we trained and validated a model, we are ready to test the model using the uncontaminated (unseen) testing sample.

We want a model that balances bias and variance.

42

Copyright © Tom Mattson

43 of 50

Model Evaluation (Supervised)

An overfitted model is a model that fits the training data very well but does not generalize to other samples including our testing sample.

Low bias (high accuracy) in training but high difference in performance between training and testing is bad!

43

Copyright © Tom Mattson

44 of 50

Model Evaluation (Supervised)

Ways to Reduce High Bias

Ways to Reduce High Variance

44

Copyright © Tom Mattson

45 of 50

Model Evaluation (Supervised)

Continuous targets but also good for categorical targets

45

Copyright © Tom Mattson

46 of 50

Model Evaluation (Supervised)

Type 1 Error

Type 2 Error

Categorical targets

46

Copyright © Tom Mattson

47 of 50

Model Evaluation (Supervised)

You construct a model to predict employee well-being based on cultural values and work-life balance features. You use the random forest algorithm to construct an ensemble model that was 72% accurate in training, 60% accurate in validation, and 55% accurate in testing.

What additional information is needed?

If I say that your minimum threshold for accuracy for the model to be acceptable is 65% or 80%, what are your next steps?

47

Copyright © Tom Mattson

48 of 50

Model Evaluation (Supervised)

48

Copyright © Tom Mattson

49 of 50

Model Evaluation (Supervised)

1. Relevance: Chosen solution addresses a genuine problem.

2. Representativeness: Data are representative of the scenarios where the models will be deployed.

3. Value: Machine learning models should offer practical benefits over traditional methods.

4. Explainability: Machine learning models are interpretable and understandable to end-users.

5. Auditability: Machine learning models and processes should be verifiable and replicable by outsiders.

6. Equity: Machine learning models should be fair and not disproportionately benefit or harm certain demographics.

7. Accountability/Responsibility: Ensuring there are mechanisms in place for accountability and handling grievances related to machine learning predictions.

49

Copyright © Tom Mattson

50 of 50

Model Deployment (Supervised)

  • Develop detailed instructions outlining how the model is to be used (and how the model should not be used)
  • Clearly articulate to the users how the model was trained, validated, and tested
  • Monitor performance and determine key performance indicators of model performance
  • If performance drops below a threshold, identify and communicate the process to take the model offline
  • Determine and implement a model re-training schedule
  • Develop documentation to “explain” in plain English how the model makes its predictions
  • Monitor the use cases that the model is being used to ensure that the model is not having negative (possibly unethical) consequences

50

Copyright © Tom Mattson