Machine Learning
Introduction to Machine Learning
Created By:
Tom Mattson
Copyright © Tom Mattson. All rights reserved.
1
Copyright © Tom Mattson
Module Overview
Set the foundation for you to explore using machine learning in a responsible and ethical manner
2
Copyright © Tom Mattson
Explanation or Prediction?
You are conducting research for a human resources department. You obtain data on the Myers-Briggs personality scores along with cultural background for all your employees. You have a final sample size of 2500 observations.
Using all 2500 observations, you run a linear regression or ANOVA to test the effect of the Myers-Briggs factors on employee performance. You find that extroverts with low power distance have a significantly greater likelihood of out performing introverts with high power distance (p<0.05). You make a similar conclusion for sensing versus intuition based on different levels of uncertainty avoidance (p<0.01). Your model explains 35% (R2) of the historical variance.
What type of analysis did you do?
3
Copyright © Tom Mattson
Explanation or Prediction?
When you get a regression result, what values do you interpret to draw your conclusions related to your hypotheses?
This type of analysis is certainly scientifically valuable, but it represents explanatory modelling and not predictive modelling.
In scientific publications and textbooks, we are often careless with our language when reporting regression results.
4
Copyright © Tom Mattson
Explanation or Prediction?
If I were to give you the entire population and you ran a regression, what would a “p-value” less than 0.05 represent?
5
Copyright © Tom Mattson
Explanation or Prediction?
6
Copyright © Tom Mattson
Theory Driven vs Data Driven?
Machine Learning is data-driven (bottom-up). There are no a-priori hypotheses (top-down) formulated from expert intuition or existing management theories that we are testing.
7
Copyright © Tom Mattson
Can Machines Learn?
“Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel (1959)
8
Copyright © Tom Mattson
Can Machines Learn?
Let’s say, I give you a data set with 1000 columns and 9 million rows. I then ask you to find patterns. Would you be able to do it? How would you begin?
9
Copyright © Tom Mattson
What is Machine Learning?
Machine learning is the process of using a variety of different algorithms (i.e., classifiers or regressors) to train a model or models to make accurate predictions about some phenomenon based on the patterns and anomalies in the historical data that get fed into the different algorithms.
10
Copyright © Tom Mattson
What is Machine Learning?
The ML terms regressor (algorithm) and regression (continuous targets) are different from an explanatory OLS regression.
| Explanatory Regression | Machine Learning |
Goal | Explain the relationship between variables, not trying to generalize to unseen data | Make the most accurate predictions as possible in unseen data. Do not care about the relationship between variables |
Method | Analyze all data together | Split data between training, validation and testing |
Assumptions | linearity, independence, homoscedasticity (equal variance), no multicollinearity, and normally distributed errors | Many algorithms have no assumptions about the underlying data distribution |
Complexity | Simple, not good at modelling complex patterns, which makes interpretability easier | May range from simple (KNN or decision tree) to highly complex (neural networks). Very good with non-linear, complex patterns, which makes interpretability more challenging. |
Model Tuning | Not done except possibly adding control variables | Hyperparameter tuning is almost always required. |
In ML, if I say that I have a regression problem, then that means that my target is continuous. I use a “regressor” algorithm to train, validate, and test my model when the target is continuous.
11
Copyright © Tom Mattson
What is Machine Learning?
An algorithm is a set of step-by-step instructions or processes constructed with three basic components: 1) sequencing, 2) selection, and 3) iteration.
An algorithm may be used to solve a specific problem, to perform a computation, find a pattern, discover anomalies, or construct a machine learning model.
In machine learning, we use machine learning algorithms to construct predictive models.
What are algorithms?
12
Copyright © Tom Mattson
What is Machine Learning?
Different machine learning algorithms include:
Machine learning projects may take several forms: 1) build new algorithms, 2) refine existing algorithms or 3) use existing algorithms to build models to solve business problems.
13
Copyright © Tom Mattson
What is Machine Learning?
What is the difference between a machine learning algorithm and a machine learning model?
We would say I have a random forest model, which means that we used the random forest machine learning algorithm to train a model.
14
Copyright © Tom Mattson
What is Machine Learning?
As business analysts, we probably will not build or refine algorithms. Instead, we select an algorithm or multiple algorithms to construct one or more predictive models to solve a specific business problem.
Ensemble Machine Learning Models
“x” represents an algorithm in this case
Depending on the nature of our target (outcome variable), we can employ hard voting or soft voting when we have multiple models (ensemble).
15
Copyright © Tom Mattson
What is Machine Learning?
All machine learning algorithms and resulting models have error. We have two broad categories of model errors:
We attempt to reduce the reducible error component by (among others) adding new data, modifying hyper-parameters and using different algorithms.
Simple is better than complex.
Complex is better than complicated.
Our goal is to construct the simplest model that is useful for our research question or business problem that reduces the reducible error as much as possible.
16
Copyright © Tom Mattson
What is Machine Learning?
17
Copyright © Tom Mattson
What is Machine Learning?
18
Copyright © Tom Mattson
Terminology
Labels: Two common uses. Labels are the codes assigned to each observation. In this sense, the labels become the targets. They are also the values or categories assigned to examples by a machine learning model. In classification, the labels are the categories (customer or not, employee or not). In regression, the labels are the real numbers (22K in profit).
Targets: The correct labels for each example. Conceptually similar to the outcome variable (e.g., the dependent variable) in a regression. Model should correctly predict the target.
There are no labels and targets in unsupervised machine learning.
Examples: Items, instances, rows, and observations used for training or evaluation.
Features or Dimensions: Set of attributes or properties of your data (i.e., the columns).
19
Copyright © Tom Mattson
Terminology
This image is an example and we probably have 5K other examples to train our model and 1K other examples to test our model.
We want our final ML models to have a high percentage of labels match the targets!
Target
Cat
In many instances, we have human labelers to label each example. What happens if the labelers disagree?
20
Copyright © Tom Mattson
Terminology
Column Vectors
Row Vectors
21
Copyright © Tom Mattson
Terminology
Variables are conceptually similar to targets, features or dimensions. It is something that is observable or measurable. An employee’s education level is a variable and where the employee went to school is another variable. The Hofstede cultural dimensions of an individual or country are variables. These vary from example (observation) to example (observation).
A parameter is a numerical property of the model (not of the example or the data point).
22
Copyright © Tom Mattson
Terminology
23
Copyright © Tom Mattson
Terminology
24
Copyright © Tom Mattson
Unsupervised Machine Learning Workflow
The workflow will vary for supervised versus unsupervised machine learning projects. Both start with understanding the business context and problem. Both also require data acquisition, cleaning, and preparation.
Data Acquisition
Data
Cleaning
Output
Evaluation
Deployment
After evaluation, we could go back to the beginning, tune the algorithm, select a different algorithm, or deploy the solution if the solution is “good enough”.
Monitor
Notice how there is typically no train-test split
25
Copyright © Tom Mattson
Supervised Machine Learning Workflow
26
Copyright © Tom Mattson
Data Collection
Bad data are no better than no data
27
Copyright © Tom Mattson
Data Cleaning & Preparation
Data wrangling or data munging refer to the tasks required to get data ready for analyses. We spend considerable time performing these tasks.
28
Copyright © Tom Mattson
Data Cleaning & Preparation
Unit of analysis might be individuals, organizations, transactions, or countries.
We should not have the same observation appear multiple times in our table.
29
Copyright © Tom Mattson
Data Cleaning & Preparation
We might have to standardized our numerical data, dummy code our categorical data, create and n-gram matrix for our text data, or calculate time to target for date features.
Image Data
30
Copyright © Tom Mattson
Data Cleaning & Preparation
31
Copyright © Tom Mattson
Data Cleaning & Preparation
32
Copyright © Tom Mattson
Data Cleaning & Preparation
Feature reduction (form of feature engineering) using principal component analysis (PCA)
33
Copyright © Tom Mattson
Data Cleaning & Preparation
We could perform an unsupervised cluster analysis to engineer new features
Note: An unsupervised cluster analysis may be a result (e.g., segment customers) and/or used to engineer features to be used in a supervised machine learning problem.
34
Copyright © Tom Mattson
Data Cleaning & Preparation
35
Copyright © Tom Mattson
Data Cleaning & Preparation
Target leakage happens when you train your model on a dataset that includes features or information that would not be available at the time of prediction (i.e., you train your model on data that can only be collected in the future).
During the data preparation phase, we must make sure that we do not have any target leakage.
36
Copyright © Tom Mattson
Data Cleaning & Preparation
Are the outliers real (but inconvenient) observations? If so, we cannot remove them.
Are the outliers “bad” data (i.e., errors in the data)? If so, we can remove them.
37
Copyright © Tom Mattson
Model Training (Supervised)
Now that our data are clean and prepared, we are finally ready to train a model!
In the training step, we first randomly split our data into a training and a testing data set. The model is developed using only the training data. We don’t touch the test data set until we are ready to see how accurate our trained model is.
38
Copyright © Tom Mattson
Model Training (Supervised)
Instead of splitting our sample into a single training data set, we can use a process of cross validation, which is sometimes referred to as rotation estimation or out of sample testing.
Training Data
Testing Data
39
Copyright © Tom Mattson
Model Training (Supervised)
We cannot have the same observation in both training and testing. If this happens, the entire process is invalidated ☹
We refer to this type of contamination as data leakage. That is, an observation leaks its way into both training and testing. The two data sets must be independent (and randomly created)
Observation “Mattson” sneaks his way into training & testing, which creates data leakage.
It is the conceptual equivalent to poison for machine learning. One small drop can pollute the entire body of water.
40
Copyright © Tom Mattson
Model Training (Supervised)
{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 25, 'min_samples_split': 75, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': None, 'splitter': 'best'}
We must create a blueprint that tells the algorithm under what constraints it can use to train the model.
These are the model hyperparameters. As analysts and researchers, we can set these hyperparameters manually or use an algorithm to help us select the best (optimal) hyperparameters.
Sample Hyperparameters & Blueprint for machine learning algorithm
41
Copyright © Tom Mattson
Model Evaluation (Supervised)
Now that we trained and validated a model, we are ready to test the model using the uncontaminated (unseen) testing sample.
We want a model that balances bias and variance.
42
Copyright © Tom Mattson
Model Evaluation (Supervised)
An overfitted model is a model that fits the training data very well but does not generalize to other samples including our testing sample.
Low bias (high accuracy) in training but high difference in performance between training and testing is bad!
43
Copyright © Tom Mattson
Model Evaluation (Supervised)
Ways to Reduce High Bias
Ways to Reduce High Variance
44
Copyright © Tom Mattson
Model Evaluation (Supervised)
Continuous targets but also good for categorical targets
45
Copyright © Tom Mattson
Model Evaluation (Supervised)
Type 1 Error
Type 2 Error
Categorical targets
46
Copyright © Tom Mattson
Model Evaluation (Supervised)
You construct a model to predict employee well-being based on cultural values and work-life balance features. You use the random forest algorithm to construct an ensemble model that was 72% accurate in training, 60% accurate in validation, and 55% accurate in testing.
What additional information is needed?
If I say that your minimum threshold for accuracy for the model to be acceptable is 65% or 80%, what are your next steps?
47
Copyright © Tom Mattson
Model Evaluation (Supervised)
48
Copyright © Tom Mattson
Model Evaluation (Supervised)
1. Relevance: Chosen solution addresses a genuine problem.
2. Representativeness: Data are representative of the scenarios where the models will be deployed.
3. Value: Machine learning models should offer practical benefits over traditional methods.
4. Explainability: Machine learning models are interpretable and understandable to end-users.
5. Auditability: Machine learning models and processes should be verifiable and replicable by outsiders.
6. Equity: Machine learning models should be fair and not disproportionately benefit or harm certain demographics.
7. Accountability/Responsibility: Ensuring there are mechanisms in place for accountability and handling grievances related to machine learning predictions.
49
Copyright © Tom Mattson
Model Deployment (Supervised)
50
Copyright © Tom Mattson