1 of 44

1

Wine Quality Dataset

Individual project – Beginner group

Yujin Kim (01815358)

2 of 44

Contents

  1. Introduction to dataset
  2. Data preprocessing with visualization
  3. Modeling
  4. Evaluation

2

3 of 44

1. Introduction to Wine Quality Dataset

  • The dataset is related to red variants of the Portuguese “Vinho Verde” wine.
  • The dataset describes the amount of various chemicals present in wine and their effect on its quality.

Challenges

  • Imbalanced classes

Goal

- Creating a classification model to predict wine quality

3

4 of 44

1. Introduction to Dataset

  • Load the data

4

5 of 44

1. Introduction to Dataset

  • Columns (Wine characteristics)
  • 11 Input variables (based on physicochemical tests)
  • 1 Output variable (based on sensory data)

5

1

Fixed acidity

7

Total sulfur dioxide

2

Volatile acidity

8

Density

3

Citric acid

9

pH

4

Residual sugar

10

Sulphates

5

Chlorides

11

Alcohol

6

Free sulfur dioxide

12

Quality (score between 0 to 10)

6 of 44

2. Data preprocessing

  • Check the dataset and check for null and missing values

6

7 of 44

2. Data preprocessing

  • Detect the outliers
  • Use IQR method of identifying outliers to set up a “fence”.
  • Any values fall outside of this fence are considered outliers
  • Minimum & maximum fence

7

8 of 44

2. Data preprocessing

  • Detect outliers to drop

  • Drop the outliers

  • Result

: 1599 -> 1562 rows 37 rows are dropped

8

9 of 44

2. Data preprocessing with visualization

  • Correlation of each values

9

10 of 44

2. Data preprocessing with visualization

  • Visualize the correlation matrix with heatmap

10

11 of 44

2. Data preprocessing with visualization

11

12 of 44

2. Data preprocessing with visualization

12

13 of 44

2. Data preprocessing with visualization

  • Correlation Heatmap with values

13

14 of 44

2. Data preprocessing with visualization

  • Find less correlated columns
  • Less than 0.1

  • 3 columns detected

-> Residual sugar, free sulfur dioxide, pH

14

15 of 44

2. Data preprocessing

  • Drop less correlated columns

  • After that, we have 9 columns (12 -> 9)

15

16 of 44

2. Data preprocessing

  • Re-scaling

- First, check the skewness of each column

16

17 of 44

2. Data preprocessing

  • Re-scaling
  • For fixed acidity & total sulfur dioxide

-> log transformation

  • For chlorides & sulphates

17

18 of 44

2. Data preprocessing

  • Re-scaling
  • Finally, check the skewness of each column to see the difference

18

19 of 44

3. Modeling

  • Creating a dataset -> divide into input data(x) and output data(y)

  • As you can see, the quality dataset have imbalanced classes

19

There are much more normal wines then excellent or poor wines

20 of 44

3. Modeling Artificial Neural Network(ANN)

  • Seed fix

- Seed function

: used to save the state of a random function, so that it can generate same random numbers on multiple executions of the code on the same machine or on different machines

  • Define a loss function & optimizer
  • use Cross-Entropy loss
  • Adam for gradient descent with learning rate 0.001

20

21 of 44

3. Modeling – ANN1 model

  • Define an Artificial Neural Network:

- ANN1 model

- 7 hidden layers

21

22 of 44

3. Modeling

  • Train the model
  • Loop over the data iterator, and feed the inputs to the network and optimize
  • the following code collects the loss and accuracy calculated while training the model
  • made list of train accuracy and train loss to plot the ‘Train Accuracy vs Loss’ graph

22

23 of 44

3. Modeling

  • Test the network
  • Also, made list of test accuracy and test loss to plot the ‘Test Accuracy vs Loss’ graph

23

24 of 44

3. Modeling

  • Train and validate the network on the training data.

- At first, set the epoch to 1000

24

25 of 44

3. Modeling

  • Plotting the Accuracy graph
  • With ‘train_accu’ list and ‘eval_accu’ list from the above train and validation
  • Make a plot that shows train and valid accuracy

25

26 of 44

3. Modeling

  • Plotting the Losses graph
  • With ‘train_losses’ list and ‘eval_losses’ list from the above train and validation
  • Make a plot that shows train and valid losses

26

27 of 44

3. Modeling

  • In order to see them all together, plot the subplots
  • As the epoch is too large, it’s hard to see the tendency right away

27

28 of 44

3. Modeling

  • Decreased the number of epoch into 300 by splicing each lists, we can see the ‘Train vs Valid Loss & Accuracy’ in one sight.
  • Looks similar with the graph below

28

29 of 44

3. Modeling

  • Both of them show Overfitting

29

30 of 44

3. Modeling

  • From the AI Winter School, we have learned how to prevent overfitting on Deep learning.
  • First, Decrease the model complexity (hidden layers)

30

31 of 44

3. Modeling – ANN2 model

  • Decrease the model complexity (hidden layers)

- Decreased the hidden layer from 7 layers to 4 layers

31

32 of 44

3. Modeling – ANN2 model

  • Decreased epoch to 300
  • But the accuracy is still very low

32

51 -> 55

33 of 44

3. Modeling – ANN3 model

  • Decrease the model complexity (hidden layers)

- Decrease more hidden layer from 4 layers to 2 layers

33

34 of 44

3. Modeling – ANN3 model

  • The accuracy is still low

34

55 -> 60

35 of 44

3. Modeling – ANN3

  • Since we know that the data is imbalanced, we need to see the F1 score as well
  • F1 score should be used when the dataset is balanced
  • Also, it has ability to provide reliable results for a wide range of datasets whether imbalanced or not.
  • Accuracy on the other hand struggles to perform well outside of well- balanced datasets

35

0.363

36 of 44

3. Modeling – ANN4 model

  • Giving weights
  • To each classes to see if it works to increase the F1 score

36

 

37 of 44

3. Modeling – ANN4 model

  • Unfortunately, did not work

37

0.363

38 of 44

3. Modeling – ANN5 model

  • Divided the weights to 1 to see if there’s change
  • But even got worse

38

0.275

39 of 44

3. Modeling – ANN6 model

  • SMOTE oversampling
  • As changing the weight did not work well
  • Tried to oversample the classes that have small sample

39

40 of 44

3. Modeling – ANN6 model

  • Could observe the number of samples increased

- But F1 score still very bad

40

0.363

41 of 44

3. Modeling – ANN7 model

  • Stratified sampling
  • Slightly Better!

41

0.45

42 of 44

3. Modeling – RandomForest Classifier Model

  • Use RandomForest Classifier
  • Gives its classification report

42

0.58

43 of 44

3. Modeling - Gaussian Naïve Bayes Model

  • Use Gaussian Naïve Bayes
  • Gives its classification report

43

0.55

44 of 44

4. Evaluation

44

Model

Method

Accuracy

F1-score

ANN1

7 hidden layers

51.25

ANN2

4 hidden layers

55.00

ANN3

2 hidden layers

60.62

ANN4

Giving Weights

58.75

0.36

ANN5

Reciprocal Weights

58.75

0.27

ANN6

SMOTE oversampling

60.52

0.36

ANN7

Stratified sampling

57.31

0.45

Random Forest

63.05

0.58

Gaussian NB

56.66

0.55