1 of 13

Intro to Machine Learning

Summer 2023

Melissa Haller

2 of 13

What is Machine Learning?

  • We’ve heard the terms Machine Learning and AI over and over again, but what is ML really?
  • According to IBM, machine learning is “a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy”
  • Rather than simply focus on uncovering patterns and correlations, machine learning is concerned with prediction and classification
  • There have been major advances in machine learning algorithms in the past decade

3 of 13

Types of Machine Learning

  • There are 3 major Machine Learning algorithm types
  • Supervised Learning: the algorithm “learns” with the help of user supervision. In practice, this usually means that labels or values have already been assigned to each observation in the model, and can be used to evaluate the model’s accuracy
    • E.g. Let’s say I have data on books and their genres, and I’m building a model to classify book genres. Since I already have genres assigned in my training dataset, this is considered supervised learning
  • With text or categories, we would build a classification model
  • With numeric values, we might use a regression model to do this (prediction with continuous variables)

4 of 13

Unsupervised Learning

  • Unsupervised learning is the training of machines or algorithms using information that is not labeled and it works without any guidance
    • For example, if I have data on books without genre information, I could use different book characteristics to try to determine how similar each book is, and use this information to train a model to classify each book
  • Types of unsupervised learning:
    • Clustering: A clustering problem is where the machine identifies the inherent groupings in the data - e.g. clustering books based on similar word choice
    • Association: An association problem is where we find the relationship between two events or items - e.g. making book recommendations based on other books users liked

5 of 13

Reinforcement Learning

  • Reinforcement Learning is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences
  • Unlike supervised learning where the feedback provided to the agent is correct set of actions for performing a task, reinforcement learning uses rewards and punishments as signals for positive and negative behavior
  • Example: video games! Think of a game like Pac Man - while traversing the grid or maze, Pac Man receives rewards from eating food and loses a life if he runs into a ghost

6 of 13

Machine Learning: Steps

  • Obtain and clean a dataset for analysis
  • Determine the correct machine learning algorithm for your task
    • I.e. Do I want to categorize my data? Do I want to make predictions using my data? Am I working with text or numeric data? How simple or complex will my model need to be?
  • Set aside a “training” and “testing” data set
    • The training data will be used to fit or build the model, while the testing data will be used to test for its accuracy
    • The ML algorithm ‘learns’ by trying to fit a function or model around the data that best fits or explains patterns in the data
  • Predict your results, and assess the quality of the model’s fit

7 of 13

Overfitting in ML

  • Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data
  • Why might overfitting happen?
    • The training dataset may be too small
    • The training data may be too noisy or have too many outliers or cases that are not representative of the broader population being studied
  • This is why we need to assess the fit of our model

8 of 13

Why Should I Care About Machine Learning?

  • There are many ways that ML can be implemented in the humanities:
    • Identifying key themes or categories in text data (newspapers, social media, books, letters, historical documents etc.)
    • Analyzing multimedia data (audio, images, etc.) and digital collections
    • ML and Natural Language Processing (NLP) for translation and linguistics
  • Even if you don’t plan to use ML, humanities scholars also have much to contribute to discussions of ethics and machine learning
    • AI will have impacts on culture, literature, language, etc.
    • More broadly, humanities can help us to better understand and center the human side of AI creation and its impacts

9 of 13

Caution with Machine Learning

  • I recommend using machine learning algorithms cautiously and critically
    • Results of a ML model are only as good as the data that you put into it
    • If there is already pre-existing bias in your data, the results of the model will reproduce these biases
  • Sometimes, ML feels a bit like a black box - it’s easy to use, but it’s not always easy to understand where the output is coming from, or what the results really mean
  • It’s important to recognize that ML is another tool in our toolbox, but we need to think carefully about the output of our models and their implications - ML algorithms can have real, tangible impacts on people’s lives

10 of 13

Part II: Machine Learning Models

  • One very popular machine learning algorithm is a random forest model
    • We can use it to predict categories or continuous variables using regression (we’ll focus on on categories today!)
  • In order to understand how random forests work, we must first understand the basic model that they are built on: decision trees
    • This type of model makes predictions based on how a series of questions are answered
    • Let’s look at an example - this might help make sense of the model

11 of 13

Example Decision Tree

  • This is a simple illustration of how decision trees work for prediction
  • This simple model predicts types of animals based on characteristics or features of the animals

Decision Node

Splitting

Tree Depth

12 of 13

Decision Tree vs. Random Forests

  • Decision trees are limited because they are prone to overfitting, and are extremely sensitive to outliers in the data
  • The random forest model improves upon this greatly using the following steps:
    • Take random samples from the original dataset
    • For each random sample, build a decision tree using a random subset of the predictor variables
    • Average the predictions of each tree to come up with a final model
  • As a result, random forest models tend to perform much better on test data, and are less prone to outliers
    • Downside: can be computationally intensive (i.e. slow) and we can’t visualize the final tree

13 of 13

Data for Practice