Module 4 : Machine Learning
Session 4I : Synthetic Data using
SMOTE
Dr Daniel Chalk
I am the one and phoney
With Thanks
Big thanks to my colleague Mike Allen for developing the content on which much of this session is based.
Please do check out his excellent materials as further reading here :
https://bit.ly/titanic_machine_learning
Synthetic Data
Synthetic Data refers to data that is not real (it is “fake”). In Machine Learning, we can use AI-based methods to create Synthetic Data.
Why?
- Because we may want to share or publish data for use in a model, but the original data cannot be shared or published
- Because we may not have enough data for our Machine Learning model (e.g. for a particular outcome) and we need to create more to train and improve the performance of our model
Synthetic Data
Synthetic Data Approaches
There are a number of different approaches that we can take to generate Synthetic Data.
Today, we’re going to focus on the first of these, but we’ll also give a brief overview of the others.
- Oversampling techniques, such as SMOTE
- Variational Autoencoders (VAEs)
- Generative Adversarial Networks (GANs)
Variational Autoencoders
Variational Autoencoders (VAEs) are a type of Deep Neural Network. They are very complex, but basically work using a two-step structure in which they decode the distribution of the original data, and then encode it again using this distribution whilst trying to minimise reconstruction error and maximise the variance of the distribution in the encoding layer (competing objectives).
In other words, it is trying to create data that is similar enough but also different enough. You can tweak the parameters to get closer to one objective or the other. The decoding and encoding steps use neural networks.
These links give some quite nice explanations (the image below is from the second link) :
https://visualstudiomagazine.com/articles/2021/05/06/variational-autoencoder.aspx https://www.kdnuggets.com/2021/02/overview-synthetic-data-types-generation-methods.html
Generative Adversarial Networks
A Generative Adversarial Network (GAN) is an approach in which there are two networks working competitively – a Generator and a Discriminator.
The Generator network tries to create synthetic data from a random sample of real data. The Discriminator is presented with samples from the real data and the synthetic data from the generator, and has to try to work out which is real data, and which is synthetic.
The two networks are linked during training, so that the Generator knows how the Discriminator is making its predictions, and gradually gets better at making more realistic synthetic data. But at the same time, the Discriminator gradually gets better at making predictions with increasingly difficult to differentiate data samples. They reinforce each other and cause each to get increasingly better – both trying to “outsmart” the other.
SMOTE
Synthetic Minority Oversampling Technique (SMOTE) is an oversampling (or “data augmentation”) approach.
It is one of the most commonly used approaches for creating synthetic data for the purposes of augmenting the data in an under-represented class.
SMOTE works by creating new data points that are close to existing data points. This allows us to expand the number of data points we can use (or replace our real data points with synthetic data points).
SMOTE
We’re going to focus on SMOTE today.
Let’s have a look at how SMOTE works. Here we’ll use a 2-dimensional example (two features) to make this easier to visualise, but obviously in reality this would typically be happening in multi-dimensional space. However, the principle remains the same.
SMOTE
SMOTE
2. Find k nearest neighbours (default = 6)
SMOTE
3. Pick one of the k nearest neighbours at random
SMOTE
4. Generate a new data point at a random distance between the two
SMOTE
5. Repeat 1 – 4 until required number of new data points created
SMOTE
Quick question : why is it important that the distance between the two data points (at which the new synthetic data point will sit) is random?
Integer, Binary and Categorical Data
The data points that SMOTE generates are floating point numbers (eg 25.7, 6.17359).
But very often our features might contain data that is :
Integer (whole numbers)
Binary / Boolean (1 or 0, True / False)
Categorical (eg Module 1 / Module 2 / Module 3)
There are different ways in which we can deal with these kinds of data so we can generate synthetic data points for these kinds of features.
Integer, Binary and Categorical Data
Here’s how we’ll deal with each of these types of data.
Integer (whole numbers)
If we’ve generated synthetic data for a feature represented by integers, we’ll take the synthetic data floating point number, and round it to the nearest integer.
Binary / Boolean (1 or 0, True / False)
We’ll code our binary / boolean values as 0s and 1s before we generate the synthetic data. Then, we’ll round our synthetic data points to the nearest integer (as above).
Categorical (eg Module 1 / Module 2 / Module 3)
See next slide...
Categorical Data
Categorical (eg Module 1 / Module 2 / Module 3)
First, before we generate the synthetic data, we’ll One-Hot Encode the categorical data :
Categorical Data
Then we can generate the synthetic data. This will generate floating point numbers for each classification for a new data point (representing a new “patient” in the above), as they are separate features. So we’ll just find the category with the highest value, and set that value to 1, and set the values of the other categories to 0. For example :
Removing Duplicate / Close Data
When we generate synthetic data, we remove data points that are identical or very close to the real world data.
Remember, there are typically two reasons for creating Synthetic Data :
- Because we may want to share or publish data for use in a model, but the original data cannot be shared or published
- Because we may not have enough data for our Machine Learning model (e.g. for a particular outcome) and we need to create more to train and improve the performance of our model
For the first reason, we need to ensure that somebody can’t tell what the original data was. We need it to be a good fake, but not too good (or identical). For the second, we want to give the model more information than we currently have (values too close don’t do this)
Removing Duplicate / Close Data
So, once we’ve generated our synthetic data, we typically :
- look to remove synthetic data points that are identical to real world data
- remove the x% of synthetic data points that are closest to the real world data
The specific % we choose to remove will vary depending on our data (and the synthetic data that’s been generated). 10% is a good initial value to go with. You could also do a more in-depth analysis that looks at how close your synthetic values are.
Because we need to get rid of duplicates and close values, when we tell SMOTE we want synthetic data points, it’s useful to generate far more than we might need (e.g. double) to allow for the fact that we’ll have to remove some of them.
Removing Duplicate / Close Data
We identify the closeness of points using Cartesian distances – the same way in which nearest neighbours are selected in SMOTE.
imbalanced-learn
imbalanced-learn is a Python package which contains resampling techniques that are useful when you have imbalance between classes in your machine learning data. This includes an implementation of SMOTE, and it’s the implementation we will use in this session.
Total Data Points
The imbalanced-learn implementation of SMOTE requires us to give us the total number of data points needed for each class. This is the number of data points we have in the original data for a class + however many new (synthetic) data points we want to create for that class.
Total Data Points
SMOTE now knows it has 200 “slots” to fill for each class. First it fills the slots in each class with the original data for each class.
Total Data Points
Then it applies the SMOTE oversampling technique we described earlier to generate synthetic data points to fill the remaining slots in each class
Total Data Points
If we’re using SMOTE to generate additional data points to augment our data, we will use all the data.
But if we’re using the new data points to replace our data, we only want the synthetic data. We can just grab it from the point in the array after however many original data points we’ve got
Exercise 1
After a 10 minute break, you’ll work in your groups on the following tasks (note - you’ll need to activate the smote_hsma environment for this exercise) :
Exercise 1
3. This task is a little more advanced. Here, you will generate some new synthetic data from the same stroke data, but this time with the purpose of augmenting the number of data points for the Positive Class (Class 1). In the original data, you’ll see that there are 209 positive class examples, and 4,700 negative class examples. You will generate synthetic data for the positive class only to generate another 2,000 data points. Then you will add these to the original data, and try fitting a Logistic Regression model on this new dataset (with 2,209 positive class examples, and 4,700 negative class examples).
Each code cell in this section contains some comments describing broadly what the code should do, but you’ll need to work out how to do it. Think carefully about what you’re doing, and adapt the code you’ve seen accordingly.
Once you’ve trained a model using this new dataset, what do you conclude? Has this helped or hindered model performance in this instance?
You have a total of 1 hour 45 minutes to complete this exercise. When we come back, I’ll ask a couple groups to present what they came up with and talk about the results they got.