JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 44

1

Wine Quality Dataset

Individual project – Beginner group

Yujin Kim (01815358)

2 of 44

Contents

Introduction to dataset
Data preprocessing with visualization
Modeling
Evaluation

2

3 of 44

1. Introduction to Wine Quality Dataset

The dataset is related to red variants of the Portuguese “Vinho Verde” wine.
The dataset describes the amount of various chemicals present in wine and their effect on its quality.

Challenges

Imbalanced classes

Goal

- Creating a classification model to predict wine quality

3

4 of 44

1. Introduction to Dataset

Load the data

4

5 of 44

1. Introduction to Dataset

Columns (Wine characteristics)
11 Input variables (based on physicochemical tests)
1 Output variable (based on sensory data)

5

1	Fixed acidity	7	Total sulfur dioxide
2	Volatile acidity	8	Density
3	Citric acid	9	pH
4	Residual sugar	10	Sulphates
5	Chlorides	11	Alcohol
6	Free sulfur dioxide	12	Quality (score between 0 to 10)

6 of 44

2. Data preprocessing

Check the dataset and check for null and missing values

6

7 of 44

2. Data preprocessing

Detect the outliers
Use IQR method of identifying outliers to set up a “fence”.
Any values fall outside of this fence are considered outliers
Minimum & maximum fence

7

8 of 44

2. Data preprocessing

Detect outliers to drop

Drop the outliers

Result

: 1599 -> 1562 rows 37 rows are dropped

8

9 of 44

2. Data preprocessing with visualization

Correlation of each values

9

10 of 44

2. Data preprocessing with visualization

Visualize the correlation matrix with heatmap

10

11 of 44

2. Data preprocessing with visualization

11

12 of 44

2. Data preprocessing with visualization

12

13 of 44

2. Data preprocessing with visualization

Correlation Heatmap with values

13

14 of 44

2. Data preprocessing with visualization

Find less correlated columns
Less than 0.1

3 columns detected

-> Residual sugar, free sulfur dioxide, pH

14

15 of 44

2. Data preprocessing

Drop less correlated columns

After that, we have 9 columns (12 -> 9)

15

16 of 44

2. Data preprocessing

Re-scaling

- First, check the skewness of each column

16

17 of 44

2. Data preprocessing

Re-scaling
For fixed acidity & total sulfur dioxide

-> log transformation

For chlorides & sulphates

17

18 of 44

2. Data preprocessing

Re-scaling
Finally, check the skewness of each column to see the difference

18

19 of 44

3. Modeling

Creating a dataset -> divide into input data(x) and output data(y)

As you can see, the quality dataset have imbalanced classes

19

There are much more normal wines then excellent or poor wines

20 of 44

3. Modeling Artificial Neural Network(ANN)

Seed fix

- Seed function

: used to save the state of a random function, so that it can generate same random numbers on multiple executions of the code on the same machine or on different machines

Define a loss function & optimizer
use Cross-Entropy loss
Adam for gradient descent with learning rate 0.001

20

21 of 44

3. Modeling – ANN1 model

Define an Artificial Neural Network:

- ANN1 model

- 7 hidden layers

21

22 of 44

3. Modeling

Train the model
Loop over the data iterator, and feed the inputs to the network and optimize
the following code collects the loss and accuracy calculated while training the model
made list of train accuracy and train loss to plot the ‘Train Accuracy vs Loss’ graph

22

23 of 44

3. Modeling

Test the network
Also, made list of test accuracy and test loss to plot the ‘Test Accuracy vs Loss’ graph

23

24 of 44

3. Modeling

Train and validate the network on the training data.

- At first, set the epoch to 1000

24

25 of 44

3. Modeling

Plotting the Accuracy graph
With ‘train_accu’ list and ‘eval_accu’ list from the above train and validation
Make a plot that shows train and valid accuracy

25

26 of 44

3. Modeling

Plotting the Losses graph
With ‘train_losses’ list and ‘eval_losses’ list from the above train and validation
Make a plot that shows train and valid losses

26

27 of 44

3. Modeling

In order to see them all together, plot the subplots
As the epoch is too large, it’s hard to see the tendency right away

27

28 of 44

3. Modeling

Decreased the number of epoch into 300 by splicing each lists, we can see the ‘Train vs Valid Loss & Accuracy’ in one sight.
Looks similar with the graph below

28

29 of 44

3. Modeling

Both of them show Overfitting

29

30 of 44

3. Modeling

From the AI Winter School, we have learned how to prevent overfitting on Deep learning.
First, Decrease the model complexity (hidden layers)

30

31 of 44

3. Modeling – ANN2 model

Decrease the model complexity (hidden layers)

- Decreased the hidden layer from 7 layers to 4 layers

31

32 of 44

3. Modeling – ANN2 model

Decreased epoch to 300
But the accuracy is still very low

32

51 -> 55

33 of 44

3. Modeling – ANN3 model

Decrease the model complexity (hidden layers)

- Decrease more hidden layer from 4 layers to 2 layers

33

34 of 44

3. Modeling – ANN3 model

The accuracy is still low

34

55 -> 60

35 of 44

3. Modeling – ANN3

Since we know that the data is imbalanced, we need to see the F1 score as well
F1 score should be used when the dataset is balanced
Also, it has ability to provide reliable results for a wide range of datasets whether imbalanced or not.
Accuracy on the other hand struggles to perform well outside of well- balanced datasets

35

0.363

36 of 44

3. Modeling – ANN4 model

Giving weights
To each classes to see if it works to increase the F1 score

36

37 of 44

3. Modeling – ANN4 model

Unfortunately, did not work

37

0.363

38 of 44

3. Modeling – ANN5 model

Divided the weights to 1 to see if there’s change
But even got worse

38

0.275

39 of 44

3. Modeling – ANN6 model

SMOTE oversampling
As changing the weight did not work well
Tried to oversample the classes that have small sample

39

40 of 44

3. Modeling – ANN6 model

Could observe the number of samples increased

- But F1 score still very bad

40

0.363

41 of 44

3. Modeling – ANN7 model

Stratified sampling
Slightly Better!

41

0.45

42 of 44

3. Modeling – RandomForest Classifier Model

Use RandomForest Classifier
Gives its classification report

42

0.58

43 of 44

3. Modeling - Gaussian Naïve Bayes Model

Use Gaussian Naïve Bayes
Gives its classification report

43

0.55

44 of 44

4. Evaluation

44

Model	Method	Accuracy	F1-score
ANN1	7 hidden layers	51.25
ANN2	4 hidden layers	55.00
ANN3	2 hidden layers	60.62
ANN4	Giving Weights	58.75	0.36
ANN5	Reciprocal Weights	58.75	0.27
ANN6	SMOTE oversampling	60.52	0.36
ANN7	Stratified sampling	57.31	0.45
Random Forest		63.05	0.58
Gaussian NB		56.66	0.55