1 of 50

Machine Learning

Introduction to Machine Learning

Created By:

Tom Mattson

1

2 of 50

Module Overview

Set the foundation for you to explore using machine learning in a responsible and ethical manner

Machine learning concepts, terminology, and processes
The types of research & problems where we can apply machine learning
Ethical issues with machine learning

2

3 of 50

Explanation or Prediction?

You are conducting research for a human resources department. You obtain data on the Myers-Briggs personality scores along with cultural background for all your employees. You have a final sample size of 2500 observations.

Using all 2500 observations, you run a linear regression or ANOVA to test the effect of the Myers-Briggs factors on employee performance. You find that extroverts with low power distance have a significantly greater likelihood of out performing introverts with high power distance (p<0.05). You make a similar conclusion for sensing versus intuition based on different levels of uncertainty avoidance (p<0.01). Your model explains 35% (R²) of the historical variance.

What type of analysis did you do?

3

4 of 50

Explanation or Prediction?

When you get a regression result, what values do you interpret to draw your conclusions related to your hypotheses?

This type of analysis is certainly scientifically valuable, but it represents explanatory modelling and not predictive modelling.

In scientific publications and textbooks, we are often careless with our language when reporting regression results.

4

5 of 50

Explanation or Prediction?

If I were to give you the entire population and you ran a regression, what would a “p-value” less than 0.05 represent?

5

6 of 50

Explanation or Prediction?

6

7 of 50

Theory Driven vs Data Driven?

Machine Learning is data-driven (bottom-up). There are no a-priori hypotheses (top-down) formulated from expert intuition or existing management theories that we are testing.

7

8 of 50

Can Machines Learn?

“Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel (1959)

8

9 of 50

Can Machines Learn?

Let’s say, I give you a data set with 1000 columns and 9 million rows. I then ask you to find patterns. Would you be able to do it? How would you begin?

9

10 of 50

What is Machine Learning?

Machine learning is the process of using a variety of different algorithms (i.e., classifiers or regressors) to train a model or models to make accurate predictions about some phenomenon based on the patterns and anomalies in the historical data that get fed into the different algorithms.

10

11 of 50

What is Machine Learning?

The ML terms regressor (algorithm) and regression (continuous targets) are different from an explanatory OLS regression.

	Explanatory Regression	Machine Learning
Goal	Explain the relationship between variables, not trying to generalize to unseen data	Make the most accurate predictions as possible in unseen data. Do not care about the relationship between variables
Method	Analyze all data together	Split data between training, validation and testing
Assumptions	linearity, independence, homoscedasticity (equal variance), no multicollinearity, and normally distributed errors	Many algorithms have no assumptions about the underlying data distribution
Complexity	Simple, not good at modelling complex patterns, which makes interpretability easier	May range from simple (KNN or decision tree) to highly complex (neural networks). Very good with non-linear, complex patterns, which makes interpretability more challenging.
Model Tuning	Not done except possibly adding control variables	Hyperparameter tuning is almost always required.

In ML, if I say that I have a regression problem, then that means that my target is continuous. I use a “regressor” algorithm to train, validate, and test my model when the target is continuous.

11

12 of 50

What is Machine Learning?

An algorithm is a set of step-by-step instructions or processes constructed with three basic components: 1) sequencing, 2) selection, and 3) iteration.

An algorithm may be used to solve a specific problem, to perform a computation, find a pattern, discover anomalies, or construct a machine learning model.

In machine learning, we use machine learning algorithms to construct predictive models.

What are algorithms?

12

13 of 50

What is Machine Learning?

Different machine learning algorithms include:

Machine learning projects may take several forms: 1) build new algorithms, 2) refine existing algorithms or 3) use existing algorithms to build models to solve business problems.

13

14 of 50

What is Machine Learning?

What is the difference between a machine learning algorithm and a machine learning model?

We would say I have a random forest model, which means that we used the random forest machine learning algorithm to train a model.

14

15 of 50

What is Machine Learning?

As business analysts, we probably will not build or refine algorithms. Instead, we select an algorithm or multiple algorithms to construct one or more predictive models to solve a specific business problem.

Ensemble Machine Learning Models

“x” represents an algorithm in this case

Depending on the nature of our target (outcome variable), we can employ hard voting or soft voting when we have multiple models (ensemble).

15

16 of 50

What is Machine Learning?

All machine learning algorithms and resulting models have error. We have two broad categories of model errors:

reducible
irreducible

We attempt to reduce the reducible error component by (among others) adding new data, modifying hyper-parameters and using different algorithms.

Simple is better than complex.

Complex is better than complicated.

Our goal is to construct the simplest model that is useful for our research question or business problem that reduces the reducible error as much as possible.

16

17 of 50

What is Machine Learning?

17

18 of 50

What is Machine Learning?

18

19 of 50

Terminology

Labels: Two common uses. Labels are the codes assigned to each observation. In this sense, the labels become the targets. They are also the values or categories assigned to examples by a machine learning model. In classification, the labels are the categories (customer or not, employee or not). In regression, the labels are the real numbers (22K in profit).

Targets: The correct labels for each example. Conceptually similar to the outcome variable (e.g., the dependent variable) in a regression. Model should correctly predict the target.

There are no labels and targets in unsupervised machine learning.

Examples: Items, instances, rows, and observations used for training or evaluation.

Features or Dimensions: Set of attributes or properties of your data (i.e., the columns).

19

20 of 50

Terminology

This image is an example and we probably have 5K other examples to train our model and 1K other examples to test our model.

We want our final ML models to have a high percentage of labels match the targets!

Target

Cat

In many instances, we have human labelers to label each example. What happens if the labelers disagree?

20

21 of 50

Terminology

Column Vectors

Row Vectors

21

22 of 50

Terminology

Variables are conceptually similar to targets, features or dimensions. It is something that is observable or measurable. An employee’s education level is a variable and where the employee went to school is another variable. The Hofstede cultural dimensions of an individual or country are variables. These vary from example (observation) to example (observation).

A parameter is a numerical property of the model (not of the example or the data point).

22

23 of 50

Terminology

23

24 of 50

Terminology

24

25 of 50

Unsupervised Machine Learning Workflow

The workflow will vary for supervised versus unsupervised machine learning projects. Both start with understanding the business context and problem. Both also require data acquisition, cleaning, and preparation.

Data Acquisition

Data

Cleaning

Output

Evaluation

Deployment

After evaluation, we could go back to the beginning, tune the algorithm, select a different algorithm, or deploy the solution if the solution is “good enough”.

Monitor

Notice how there is typically no train-test split

25

26 of 50

Supervised Machine Learning Workflow

26

27 of 50

Data Collection

We often use historical or archival data, but that need not always be the case. We may certainly use our own data from surveys or experiments.
Different algorithms require more data than other algorithms.

Bad data are no better than no data

27

28 of 50

Data Cleaning & Preparation

Data wrangling or data munging refer to the tasks required to get data ready for analyses. We spend considerable time performing these tasks.

28

29 of 50

Data Cleaning & Preparation

Unit of analysis might be individuals, organizations, transactions, or countries.

We should not have the same observation appear multiple times in our table.

29

30 of 50

Data Cleaning & Preparation

We might have to standardized our numerical data, dummy code our categorical data, create and n-gram matrix for our text data, or calculate time to target for date features.

Image Data

30

31 of 50

Data Cleaning & Preparation

31

32 of 50

Data Cleaning & Preparation

32

33 of 50

Data Cleaning & Preparation

Feature reduction (form of feature engineering) using principal component analysis (PCA)

33

34 of 50

Data Cleaning & Preparation

We could perform an unsupervised cluster analysis to engineer new features

Note: An unsupervised cluster analysis may be a result (e.g., segment customers) and/or used to engineer features to be used in a supervised machine learning problem.

34

35 of 50

Data Cleaning & Preparation

35

36 of 50

Data Cleaning & Preparation

Target leakage happens when you train your model on a dataset that includes features or information that would not be available at the time of prediction (i.e., you train your model on data that can only be collected in the future).

During the data preparation phase, we must make sure that we do not have any target leakage.

36

37 of 50

Data Cleaning & Preparation

Are the outliers real (but inconvenient) observations? If so, we cannot remove them.

Are the outliers “bad” data (i.e., errors in the data)? If so, we can remove them.

37

38 of 50

Model Training (Supervised)

Now that our data are clean and prepared, we are finally ready to train a model!

In the training step, we first randomly split our data into a training and a testing data set. The model is developed using only the training data. We don’t touch the test data set until we are ready to see how accurate our trained model is.

38

39 of 50

Model Training (Supervised)

Instead of splitting our sample into a single training data set, we can use a process of cross validation, which is sometimes referred to as rotation estimation or out of sample testing.

Train the model on folds 1 to 4, validate on fold 5.
Train a new model using folds 1-3 & 5, validate on fold 2.
Repeat such that each fold is used for validation purposes

Training Data

Testing Data

39

40 of 50

Model Training (Supervised)

We cannot have the same observation in both training and testing. If this happens, the entire process is invalidated ☹

We refer to this type of contamination as data leakage. That is, an observation leaks its way into both training and testing. The two data sets must be independent (and randomly created)

Observation “Mattson” sneaks his way into training & testing, which creates data leakage.

It is the conceptual equivalent to poison for machine learning. One small drop can pollute the entire body of water.

40

41 of 50

Model Training (Supervised)

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 25, 'min_samples_split': 75, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': None, 'splitter': 'best'}

We must create a blueprint that tells the algorithm under what constraints it can use to train the model.

These are the model hyperparameters. As analysts and researchers, we can set these hyperparameters manually or use an algorithm to help us select the best (optimal) hyperparameters.

Sample Hyperparameters & Blueprint for machine learning algorithm

41

42 of 50

Model Evaluation (Supervised)

Now that we trained and validated a model, we are ready to test the model using the uncontaminated (unseen) testing sample.

We want a model that balances bias and variance.

42

43 of 50

Model Evaluation (Supervised)

An overfitted model is a model that fits the training data very well but does not generalize to other samples including our testing sample.

Low bias (high accuracy) in training but high difference in performance between training and testing is bad!

43

44 of 50

Model Evaluation (Supervised)

Ways to Reduce High Bias

Ways to Reduce High Variance

44

45 of 50

Model Evaluation (Supervised)

Continuous targets but also good for categorical targets

45

46 of 50

Model Evaluation (Supervised)

Type 1 Error

Type 2 Error

Categorical targets

46

47 of 50

Model Evaluation (Supervised)

You construct a model to predict employee well-being based on cultural values and work-life balance features. You use the random forest algorithm to construct an ensemble model that was 72% accurate in training, 60% accurate in validation, and 55% accurate in testing.

What additional information is needed?

If I say that your minimum threshold for accuracy for the model to be acceptable is 65% or 80%, what are your next steps?

47

48 of 50

Model Evaluation (Supervised)

48

49 of 50

Model Evaluation (Supervised)

1. Relevance: Chosen solution addresses a genuine problem.

2. Representativeness: Data are representative of the scenarios where the models will be deployed.

3. Value: Machine learning models should offer practical benefits over traditional methods.

4. Explainability: Machine learning models are interpretable and understandable to end-users.

5. Auditability: Machine learning models and processes should be verifiable and replicable by outsiders.

6. Equity: Machine learning models should be fair and not disproportionately benefit or harm certain demographics.

7. Accountability/Responsibility: Ensuring there are mechanisms in place for accountability and handling grievances related to machine learning predictions.

49

50 of 50

Model Deployment (Supervised)

Develop detailed instructions outlining how the model is to be used (and how the model should not be used)
Clearly articulate to the users how the model was trained, validated, and tested
Monitor performance and determine key performance indicators of model performance
If performance drops below a threshold, identify and communicate the process to take the model offline
Determine and implement a model re-training schedule
Develop documentation to “explain” in plain English how the model makes its predictions
Monitor the use cases that the model is being used to ensure that the model is not having negative (possibly unethical) consequences

50