2 of 30

Module 4 : Machine Learning

Session 4I : Synthetic Data using

SMOTE

Dr Daniel Chalk

I am the one and phoney

3 of 30

With Thanks

Big thanks to my colleague Mike Allen for developing the content on which much of this session is based.

Please do check out his excellent materials as further reading here :

https://bit.ly/titanic_machine_learning

4 of 30

Synthetic Data

Synthetic Data refers to data that is not real (it is “fake”). In Machine Learning, we can use AI-based methods to create Synthetic Data.

Why?

- Because we may want to share or publish data for use in a model, but the original data cannot be shared or published

- Because we may not have enough data for our Machine Learning model (e.g. for a particular outcome) and we need to create more to train and improve the performance of our model

5 of 30

Synthetic Data

https://www.thispersondoesnotexist.com/

https://thesecatsdonotexist.com/

6 of 30

Synthetic Data Approaches

There are a number of different approaches that we can take to generate Synthetic Data.

Today, we’re going to focus on the first of these, but we’ll also give a brief overview of the others.

- Oversampling techniques, such as SMOTE

- Variational Autoencoders (VAEs)

- Generative Adversarial Networks (GANs)

7 of 30

Variational Autoencoders

Variational Autoencoders (VAEs) are a type of Deep Neural Network. They are very complex, but basically work using a two-step structure in which they decode the distribution of the original data, and then encode it again using this distribution whilst trying to minimise reconstruction error and maximise the variance of the distribution in the encoding layer (competing objectives).

In other words, it is trying to create data that is similar enough but also different enough. You can tweak the parameters to get closer to one objective or the other. The decoding and encoding steps use neural networks.

These links give some quite nice explanations (the image below is from the second link) :

https://visualstudiomagazine.com/articles/2021/05/06/variational-autoencoder.aspx https://www.kdnuggets.com/2021/02/overview-synthetic-data-types-generation-methods.html

8 of 30

Generative Adversarial Networks

A Generative Adversarial Network (GAN) is an approach in which there are two networks working competitively – a Generator and a Discriminator.

The Generator network tries to create synthetic data from a random sample of real data. The Discriminator is presented with samples from the real data and the synthetic data from the generator, and has to try to work out which is real data, and which is synthetic.

The two networks are linked during training, so that the Generator knows how the Discriminator is making its predictions, and gradually gets better at making more realistic synthetic data. But at the same time, the Discriminator gradually gets better at making predictions with increasingly difficult to differentiate data samples. They reinforce each other and cause each to get increasingly better – both trying to “outsmart” the other.

9 of 30

SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is an oversampling (or “data augmentation”) approach.

It is one of the most commonly used approaches for creating synthetic data for the purposes of augmenting the data in an under-represented class.

SMOTE works by creating new data points that are close to existing data points. This allows us to expand the number of data points we can use (or replace our real data points with synthetic data points).

10 of 30

SMOTE

We’re going to focus on SMOTE today.

Let’s have a look at how SMOTE works. Here we’ll use a 2-dimensional example (two features) to make this easier to visualise, but obviously in reality this would typically be happening in multi-dimensional space. However, the principle remains the same.

11 of 30

SMOTE

Pick a random data point

12 of 30

SMOTE

2. Find k nearest neighbours (default = 6)

13 of 30

SMOTE

3. Pick one of the k nearest neighbours at random

14 of 30

SMOTE

4. Generate a new data point at a random distance between the two

15 of 30

SMOTE

5. Repeat 1 – 4 until required number of new data points created

16 of 30

SMOTE

Quick question : why is it important that the distance between the two data points (at which the new synthetic data point will sit) is random?

17 of 30

Integer, Binary and Categorical Data

The data points that SMOTE generates are floating point numbers (eg 25.7, 6.17359).

But very often our features might contain data that is :

Integer (whole numbers)

Binary / Boolean (1 or 0, True / False)

Categorical (eg Module 1 / Module 2 / Module 3)

There are different ways in which we can deal with these kinds of data so we can generate synthetic data points for these kinds of features.

18 of 30

Integer, Binary and Categorical Data

Here’s how we’ll deal with each of these types of data.

Integer (whole numbers)

If we’ve generated synthetic data for a feature represented by integers, we’ll take the synthetic data floating point number, and round it to the nearest integer.

Binary / Boolean (1 or 0, True / False)

We’ll code our binary / boolean values as 0s and 1s before we generate the synthetic data. Then, we’ll round our synthetic data points to the nearest integer (as above).

Categorical (eg Module 1 / Module 2 / Module 3)

See next slide...

19 of 30

Categorical Data

Categorical (eg Module 1 / Module 2 / Module 3)

First, before we generate the synthetic data, we’ll One-Hot Encode the categorical data :

20 of 30

Categorical Data

Then we can generate the synthetic data. This will generate floating point numbers for each classification for a new data point (representing a new “patient” in the above), as they are separate features. So we’ll just find the category with the highest value, and set that value to 1, and set the values of the other categories to 0. For example :

21 of 30

Removing Duplicate / Close Data

When we generate synthetic data, we remove data points that are identical or very close to the real world data.

Remember, there are typically two reasons for creating Synthetic Data :

- Because we may want to share or publish data for use in a model, but the original data cannot be shared or published

- Because we may not have enough data for our Machine Learning model (e.g. for a particular outcome) and we need to create more to train and improve the performance of our model

For the first reason, we need to ensure that somebody can’t tell what the original data was. We need it to be a good fake, but not too good (or identical). For the second, we want to give the model more information than we currently have (values too close don’t do this)

22 of 30

Removing Duplicate / Close Data

So, once we’ve generated our synthetic data, we typically :

- look to remove synthetic data points that are identical to real world data

- remove the x% of synthetic data points that are closest to the real world data

The specific % we choose to remove will vary depending on our data (and the synthetic data that’s been generated). 10% is a good initial value to go with. You could also do a more in-depth analysis that looks at how close your synthetic values are.

Because we need to get rid of duplicates and close values, when we tell SMOTE we want synthetic data points, it’s useful to generate far more than we might need (e.g. double) to allow for the fact that we’ll have to remove some of them.

23 of 30

Removing Duplicate / Close Data

We identify the closeness of points using Cartesian distances – the same way in which nearest neighbours are selected in SMOTE.

24 of 30

imbalanced-learn

imbalanced-learn is a Python package which contains resampling techniques that are useful when you have imbalance between classes in your machine learning data. This includes an implementation of SMOTE, and it’s the implementation we will use in this session.

25 of 30

Total Data Points

The imbalanced-learn implementation of SMOTE requires us to give us the total number of data points needed for each class. This is the number of data points we have in the original data for a class + however many new (synthetic) data points we want to create for that class.

26 of 30

Total Data Points

SMOTE now knows it has 200 “slots” to fill for each class. First it fills the slots in each class with the original data for each class.

27 of 30

Total Data Points

Then it applies the SMOTE oversampling technique we described earlier to generate synthetic data points to fill the remaining slots in each class

28 of 30

Total Data Points

If we’re using SMOTE to generate additional data points to augment our data, we will use all the data.

But if we’re using the new data points to replace our data, we only want the synthetic data. We can just grab it from the point in the array after however many original data points we’ve got

29 of 30

Exercise 1

After a 10 minute break, you’ll work in your groups on the following tasks (note - you’ll need to activate the smote_hsma environment for this exercise) :

As a group, work through the notebook synthetic_data_SMOTE which walks you through an example where you generate a synthetic dataset of the Titanic data, and compare the performance using this data vs the real data in a Logistic Regression model. Have one person lead, read through everything carefully, and make sure everyone understands how it all works.
You have been provided with some new stroke data (different to that used in the previous sessions). This is a labelled stroke prediction dataset that contains a number of features along with whether the patient had a stroke or not. The data is from Kaggle (https://bit.ly/3IktstF) but I have preprocessed the data for you and provided you with the file processed_stroke.csv which is the one you should load and use. You should first look at the data and get a feel for it. Then, your task is to generate synthetic data from this data and compare Linear Regression model performance using the real data vs the synthetic data you create. The notebook Exercise_1_Template has been provided for you which contains prompts before each code cell to guide you through. For this task, you should go up to the point where it says “ADVANCED TASK”
Assuming you complete the previous two tasks, you should then move onto the final task. See the next slide…

30 of 30

Exercise 1

3. This task is a little more advanced. Here, you will generate some new synthetic data from the same stroke data, but this time with the purpose of augmenting the number of data points for the Positive Class (Class 1). In the original data, you’ll see that there are 209 positive class examples, and 4,700 negative class examples. You will generate synthetic data for the positive class only to generate another 2,000 data points. Then you will add these to the original data, and try fitting a Logistic Regression model on this new dataset (with 2,209 positive class examples, and 4,700 negative class examples).

Each code cell in this section contains some comments describing broadly what the code should do, but you’ll need to work out how to do it. Think carefully about what you’re doing, and adapt the code you’ve seen accordingly.

Once you’ve trained a model using this new dataset, what do you conclude? Has this helped or hindered model performance in this instance?

You have a total of 1 hour 45 minutes to complete this exercise. When we come back, I’ll ask a couple groups to present what they came up with and talk about the results they got.