1 of 13

Intro to Machine Learning

Summer 2023

Melissa Haller

2 of 13

What is Machine Learning?

We’ve heard the terms Machine Learning and AI over and over again, but what is ML really?
According to IBM, machine learning is “a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy”
Rather than simply focus on uncovering patterns and correlations, machine learning is concerned with prediction and classification
There have been major advances in machine learning algorithms in the past decade

3 of 13

Types of Machine Learning

There are 3 major Machine Learning algorithm types
Supervised Learning: the algorithm “learns” with the help of user supervision. In practice, this usually means that labels or values have already been assigned to each observation in the model, and can be used to evaluate the model’s accuracy

E.g. Let’s say I have data on books and their genres, and I’m building a model to classify book genres. Since I already have genres assigned in my training dataset, this is considered supervised learning

With text or categories, we would build a classification model
With numeric values, we might use a regression model to do this (prediction with continuous variables)

4 of 13

Unsupervised Learning

Unsupervised learning is the training of machines or algorithms using information that is not labeled and it works without any guidance

For example, if I have data on books without genre information, I could use different book characteristics to try to determine how similar each book is, and use this information to train a model to classify each book

Types of unsupervised learning:

Clustering: A clustering problem is where the machine identifies the inherent groupings in the data - e.g. clustering books based on similar word choice
Association: An association problem is where we find the relationship between two events or items - e.g. making book recommendations based on other books users liked

5 of 13

Reinforcement Learning

Reinforcement Learning is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences
Unlike supervised learning where the feedback provided to the agent is correct set of actions for performing a task, reinforcement learning uses rewards and punishments as signals for positive and negative behavior
Example: video games! Think of a game like Pac Man - while traversing the grid or maze, Pac Man receives rewards from eating food and loses a life if he runs into a ghost

6 of 13

Machine Learning: Steps

Obtain and clean a dataset for analysis
Determine the correct machine learning algorithm for your task

I.e. Do I want to categorize my data? Do I want to make predictions using my data? Am I working with text or numeric data? How simple or complex will my model need to be?

Set aside a “training” and “testing” data set

The training data will be used to fit or build the model, while the testing data will be used to test for its accuracy
The ML algorithm ‘learns’ by trying to fit a function or model around the data that best fits or explains patterns in the data

Predict your results, and assess the quality of the model’s fit

7 of 13

Overfitting in ML

Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data
Why might overfitting happen?

The training dataset may be too small
The training data may be too noisy or have too many outliers or cases that are not representative of the broader population being studied

This is why we need to assess the fit of our model

8 of 13

Why Should I Care About Machine Learning?

There are many ways that ML can be implemented in the humanities:

Identifying key themes or categories in text data (newspapers, social media, books, letters, historical documents etc.)
Analyzing multimedia data (audio, images, etc.) and digital collections
ML and Natural Language Processing (NLP) for translation and linguistics

Even if you don’t plan to use ML, humanities scholars also have much to contribute to discussions of ethics and machine learning

AI will have impacts on culture, literature, language, etc.
More broadly, humanities can help us to better understand and center the human side of AI creation and its impacts

9 of 13

Caution with Machine Learning

I recommend using machine learning algorithms cautiously and critically

Results of a ML model are only as good as the data that you put into it
If there is already pre-existing bias in your data, the results of the model will reproduce these biases

Sometimes, ML feels a bit like a black box - it’s easy to use, but it’s not always easy to understand where the output is coming from, or what the results really mean
It’s important to recognize that ML is another tool in our toolbox, but we need to think carefully about the output of our models and their implications - ML algorithms can have real, tangible impacts on people’s lives

10 of 13

Part II: Machine Learning Models

One very popular machine learning algorithm is a random forest model

We can use it to predict categories or continuous variables using regression (we’ll focus on on categories today!)

In order to understand how random forests work, we must first understand the basic model that they are built on: decision trees

This type of model makes predictions based on how a series of questions are answered
Let’s look at an example - this might help make sense of the model

11 of 13

Example Decision Tree

This is a simple illustration of how decision trees work for prediction
This simple model predicts types of animals based on characteristics or features of the animals

Decision Node

Splitting

Tree Depth

12 of 13

Decision Tree vs. Random Forests

Decision trees are limited because they are prone to overfitting, and are extremely sensitive to outliers in the data
The random forest model improves upon this greatly using the following steps:

Take random samples from the original dataset
For each random sample, build a decision tree using a random subset of the predictor variables
Average the predictions of each tree to come up with a final model

As a result, random forest models tend to perform much better on test data, and are less prone to outliers

Downside: can be computationally intensive (i.e. slow) and we can’t visualize the final tree

13 of 13

Data for Practice

Elections Data: https://geodacenter.github.io/data-and-lab/county_election_2012_2016-variables/