Module 4 : Machine Learning
Session 4A : Introduction to AI and
Machine Learning
Dr Daniel Chalk
Do HSMAs dream of electric sheep?
What is AI?
AI (Artificial Intelligence) refers to the simulation of intelligence (or aspects of intelligence) by machines. Such simulated intelligence may include :
- learning (e.g. to make decisions, such as to classify)
- reasoning (e.g. applying logical rules to come to a conclusion)
- having coherent conversations
- perception (e.g. recognising images, speech or other sounds)
In the HSMA course, we mostly focus on the first of these : learning. Specifically we look at learning to simulate human decision making (this module), and learning of natural language patterns to extract information from written text (Natural Language Processing module)
Machine Learning
Machine Learning (ML) is a field of AI which is focused on developing algorithms that enable machines to learn by experience and by being shown data.
Often the purpose of developing ML algorithms is to make a prediction / classification given some data. This may be to provide decision support to human decision making by trying to learn inherent and possibly hidden patterns in data.
Machines should rarely replace human decision making, but should be seen as a Decision Support Tool.
Unless you’re building Skynet / SHODAN / HAL9000 / (insert name of AI with too much power that inevitably went wrong here)
Pros and Cons
Machine Learning allows us to :
- automate tasks undertaken by humans that would consume time and / or money resources
- try to better understand some of our own decision making, or hidden patterns in data
- allow for data to be used to generate insights
But you also need to be aware that :
- machines can and will get things wrong
- there may be ethical considerations, particularly in terms of getting things wrong (see Ethics session)
- you may end up with more questions than answers
Features
When we say that we want to teach a machine to make a prediction or classification, we mean that we want it to predict an output (or label) given a certain set of inputs. In ML terminology, inputs are referred to as features.
Examples :
- Predicting whether a patient has a certain condition or can receive a treatment based on aspects of their health, demographics, time since onset etc
- Predicting the likelihood of someone reoffending based on features such as the nature of their original crime, whether they served a custodial sentence etc
- Classifying whether the image presented is an image of a flower based on the combination of pixels
- Classifying whether the text presented is positive or negative in tone based on the words used, the structure of the text etc
Feature Weighting
Let’s imagine we are trying to classify whether or not someone is a member of our research team (PenCHORD) based on a number of features.
In Machine Learning, features become weighted by the algorithm being used, so that their relative contribution to making a prediction of the label can be determined. This allows us to determine which features are more or less important for making a prediction, as they have higher or lower weights respectively.
Looking at the above, which features do you think are more important for predicting if someone is a PenCHORDian?
Types of Learning
There are three main types of Machine Learning algorithm :
Supervised Learning algorithms are ones in which the machine is shown examples of features and their outputs (labels) (a labelled dataset), and improves itself by getting better at getting from a given set of input values to the correct output.* Supervised Learning problems tend to either be classification problems (predict a discrete “class”) or regression problems (predict a continuous value).
Unsupervised Learning algorithms do not have access to any label data, and instead try to find inherent structures in the data (e.g. clusters of similar data, anomalies etc). This is useful when we don’t have data that has the “right answer”, often because we don’t know.
Reinforcement Learning algorithms learn by taking actions that are either rewarded or punished, and the machine gradually learns to make actions that lead to higher reward.
*There is also Semi-Supervised Learning, in which we have label data for some examples, but not all
Classification vs Regression
Classification problems are ones in which we are trying to predict to which “class” an example belongs. Classifications may be binary (two classes – often True or False), or multi-class (multiple different classes). There is also multi label classification, where we are trying to predict more than one label / class for each example.
Examples :
- does this patient have condition x or not? (Binary classification)
- which condition does this patient have? (Multi-Class classification)
- what are the multiple conditions that this patient has? (Multi-Label classification)
Regression problems are ones in which we are trying to predict a continuous value as the output for an example. Often, the outputs represent a probability of something occurring or being true.
Examples :
- what is the probability this patient can safely receive this treatment?
- what is the probability that this patient will be imminently admitted to hospital?
Exercise 1
In your groups, you’ll now spend 15 minutes discussing potential applications for Machine Learning in your organisations. Keep a note of the ideas you discuss, as I’ll ask a couple groups to share their ideas when we return!
Train / Test Splits
When we’re trying to teach a machine to make a prediction using Supervised Learning, we need to present it with a set of data from which it will learn. This is known as the Training Set.
The Training Set contains examples of combinations of feature values, and the associated labels (ie – the “answers”).
But once the algorithm has “learned”, we also need a way of checking how well it’s learned. For that, we need to provide it with a set of data (complete with feature values and labels) that it has not yet seen, and against which it can check to see how well its learning translates beyond the data from which it has learned (generalisability). This is known as the Test Set.
Therefore, we split our data into some that is used for training, and some that is used for testing. This is known as a Train / Test Split
Train / Test Splits
Train / Validation / Test Splits
Later in the course, you will also hear about Validation Sets, and splitting your data into train / validation / test splits.
A Validation Set is a bit like a Test Set, but we use it whilst we’re still “tuning” our model to help us adjust the parameters for it. This differs from a Test Set, where the intention is for it to be used as our final measure of performance.
Validation Sets are commonly used when using Neural Networks for Machine Learning, which we’ll come onto later in the course.
One way to think about a validation set is a bit like taking a mock exam before the final exam (the Test Set). The result of the mock exam allows you to try to improve before the final exam.
Train / Validation / Test Splits
Overfitting
When training a ML algorithm, we need to be careful about training too much, and running into a problem known as overfitting. Let’s demonstrate using some examples.
Earlier, we said that, from this data, it looks like the most important features for determining if someone was a member of the PenCHORD team or not was whether they liked programming and had questionable fashion sense.
Let’s imagine that was our Training Set. Let’s now have a look at our Test Set.
Overfitting
In our Test Set, two people who like programming are not PenCHORDians compared to just one who is. Similarly, only one of the PenCHORDians has questionable fashion sense.
If this were our training set, we likely wouldn’t have come to the conclusions we did. Let’s now look at the data as a whole (training + test sets)
Overfitting
Looking at the whole dataset we see :
So what’s happened?
Overfitting
Basically, the “training” in this example (that concluded that programming and questionable fashion sense were the most important features) became too specific to the nuances of the data in the training set. These conclusions don’t hold in the wider data.
This is known as an overfit.
In Machine Learning, training is referred to as fitting (because we are trying to fit a model - essentially a line - that correctly classifies / predicts our data). Therefore, overfitting quite literally means “training too much”.
Overfitted models lack generalisability, and are typically useless for real-world use. However, we often tend to deliberately train a model too much at first, because :
Overfitting
Let’s consider another example. Study the following plot of two classifications of data - YES (ticks) and NO (crosses). Where would you say is the best dividing line between these classes?
Overfitting
A model that has been trained (fitted) well :
Overfitting
A model that has been overfitted :
On first inspection, it appears this model is very good. It correctly predicts 100% of Yes and No classifications in the training set. Now let’s add some points from the wider data…
Overfitting
We’ve learned a pattern that is too specific to the training data, and which isn’t generalisable. It’s overfitted.
Underfitting
Underfitting can also be a problem. This is the opposite problem – where you haven’t trained enough (or the model is otherwise too simple), and your model is rubbish on both the training set and the test set. Fortunately, because it’s rubbish on the training set too, that’s usually easier to spot!
Performance Metrics
When developing a ML model, it’s important that we know how well it’s working.
We can use various metrics to assess how well the model has learned, depending on the type of model we’re building.
Today we’ll look at a few of the basic ones for classifier models, but we’ll be covering other performance metrics later in the course.
Accuracy
When we’re building a Classifier (a ML model that makes a classification prediction), the three most commonly used basic metrics to assess performance are Accuracy, Precision and Recall.
Accuracy gives the percentage of predictions that the model makes correctly in the Test Set. E.g. “How often am I right?”. In other words :
Assuming the circles are the true values, and the words are the predictions of colours of the circles, what is the accuracy of these predictions?
Precision
Precision gives the percentage of correct predictions for the classification of interest, out of the total number of times this classification was predicted (both correctly and incorrectly). E.g. “when I say something is ‘blue’, how often am I right?”. So :
precisionc = true positivesc / (true positivesc + false positivesc)
Precision is a useful metric when classes aren’t evenly represented.
What is the precision for classifying blue circles in the above example?
Said it was blue and it is!
Said it was blue and it isn’t :(
Recall
Recall gives the percentage of correct predictions for the classification of interest, out of the total number of examples that truly have that classification. E.g. “What percentage of blues did I correctly find?”. So :
recallc = true positivesc / (true positivesc + false negativesc)
Recall, like Precision, is a useful metric when classes aren’t evenly represented.
Said it was blue and it is!
Said it wasn’t blue and I was wrong - it is :(
What is the recall for classifying green circles in the example above?
Accuracy, Precision and Recall
So, what do these metrics mean in reality? Imagine we are trying to build a classification model to determine whether someone has cancer or not.
A high accuracy would mean that the model correctly diagnosed people (as having or not having cancer) most of the time.
A high precision of cancer diagnoses would mean that most of the time, when it says someone has cancer, they do have cancer, and it is rarely telling someone they have cancer when they don’t.
A high recall of cancer diagnoses would mean that it’s picking up most cases of cancer, and rarely saying someone does not have cancer when they do.
Question – why might a high accuracy on its own be misleading?
World’s Best ML Model
Today I unveil to you the world’s best Machine Learning model. This model can predict whether an illegal felling of a tree in England will result in a conviction, and it is 99% accurate!!!!!
Here is the code. Take your time - it’s quite complex :
Don’t believe me? Well I tried it and it was correct 99% of the time!
Although, I’m not sure if this might have had something to do with it…
World’s Best ML Model
Accuracy on its own and can be a highly misleading performance metric for a ML model, particularly if one of the classes is significantly underrepresented. This happens a lot in healthcare problems (in most cases, someone does not go into hospital, and most people do not have a condition etc).
You should always look at other metrics besides accuracy to get a picture of the performance of your model. We will explore other metrics too during this course.
There are also things you can do to try to balance up data with an underrepresented class. One such thing is to create more data for the underrepresented class using synthetic data - we’ll cover that later in the module.
Metrics in Action
Exercise 2
Exercise 2
After a 10 minute break, you will work in your groups. Dan, having grown weary of the human race, has decided to relocate to a remote desert island. To keep himself occupied, he’s decided to pack a DVD player and a number of DVDs. We’ll suspend our disbelief about some of the logistical issues, such as lack of an electricity supply.
After 1 hour, you will be presented with a series of DVDs. Your task is to develop an “algorithm” that will attempt to classify whether or not Dan would want to take each DVD to the desert island or not. Dan enjoys a lot of films, but he’s got to be selective, as he can only pack so much. So he only wants to take the films he loves the most.
To help you develop your algorithm over the first hour, you will be provided with some “training data” – a list of films along with whether or not Dan decided they would make the “cut” to be taken onto the island. From this, you should decide which features you will incorporate into your training (e.g. things like director, genre, year of release etc). Use online resources (such as IMDb) to find out information about the films. For this exercise, your features should be boolean (e.g. “Is Director Steven Spielberg?”, “Was film released in 60s?”).
Your algorithm should give a weight score to each feature, such that a score is given to the feature if that feature is true (or false, or you could sometimes use negative scores). You then add up the scores for the feature values for that film, and use a threshold total score (which you decide) to determine whether the film is predicted to be a Desert Island DVD.
Exercise 2
Example :
You should use the training set to try to optimise your algorithm. You should use the three metrics of accuracy, precision and recall to monitor how well you are training and adjust your algorithm (by changing weights, throwing out unimportant features, bringing in new features, changing the threshold etc). But remember – be wary about overfitting (and underfitting!).
In 1 hour, you’ll be given the Test Set, and you’ll apply your algorithm to the Test Set examples, and will calculate your algorithm’s accuracy, precision and recall (you will have 30 minutes for this). Your algorithm cannot be changed once the test set has been released. At the end of the exercise, we’ll find out how you got on! Tip - you’ll need to work efficiently together in this exercise to gather data and make decisions.