1 of 52

Pima Indians Diabetes Dataset

Classification

Minjae Chung (01715597)

1

Pima Indians Diabetes.ipynb - Colaboratory

2 of 52

Introduction to dataset

768 Subjects

2

Pima Indians Diabetes Database | Kaggle

3 of 52

Introduction to dataset

3

4 of 52

Introduction to dataset

4

5 of 52

Why is it worth to analyze?

Association between diabetes and other health features

Predict diabetes after health medical examination with AI

Prevention and treatment

5

6 of 52

EDA (Exploratory Data Analysis)

Outlier Detection

17 Subjects with
at least 2 outliers

6

7 of 52

EDA (Exploratory Data Analysis)

Checking Null datatype

-> 0 Null Values

7

8 of 52

EDA (Exploratory Data Analysis)

Summary and statistics

8

9 of 52

EDA (Exploratory Data Analysis)

Correlation matrix between variables: Heatmap (1/2)

9

10 of 52

EDA (Exploratory Data Analysis)

Correlation matrix between variables: Heatmap (2/2)

10

11 of 52

EDA (Exploratory Data Analysis)

Glucose: 0.47
BMI: 0.29
Age: 0.24
Pregnancies: 0.22
Pedigree 0.17
Insulin 0.13
Skin Thickness 0.07

7. Blood Pressure 0.07

11

12 of 52

EDA (Exploratory Data Analysis)

Density curve of Glucose distribution
Subject with diabetes show higher glucose concentration

12

13 of 52

EDA (Exploratory Data Analysis)

Density plot of Age distribution

Outcome 0: No diabetes
Outcome 1: Diabetes

Subject with diabetes tends to have higher age

13

14 of 52

EDA (Exploratory Data Analysis)

Bar plot of Pregnancies
Higher chance of having diabetes with higher number of pregnancy

14

15 of 52

EDA (Exploratory Data Analysis)

Density curve of BMI distribution
Subjects with diabetes showed more overweight and obese state
Zero BMI values?

15

Healthy (18.5 to 24.9)

Overweight (25.0 to 29.9)

Obese (30.0 or higher)

(Centers for Disease Control and Prevention)

Mean BMI: 30 35

16 of 52

EDA (Exploratory Data Analysis)

16

Zero values are null values for particular features

17 of 52

Feature Engineering

17 Subjects
Outliers deleted
768 subjects

-> 751 subjects

17

18 of 52

Feature Engineering

Delete 2 features with least correlation

1. Skin Thickness
2. Blood Pressure

18

19 of 52

Feature Engineering

Converting zero values into null values
Glucose, Insulin, BMI

19

20 of 52

Feature Engineering

Replacing Nan to Mean value of the feature

20

21 of 52

Feature Engineering

Data Scaling
Equalizing the scale of features
-> Improving the performance of models

StandardScaler()
= scales features to have zero mean and unit variance.

21

22 of 52

Feature Engineering

Standardization

22

23 of 52

Artificial Neural Network 1

23

Input layer

Output layer

Hidden layer1

Hidden layer2

Hidden layer3

Neurons: 8 7 5 3 2

24 of 52

Artificial Neural Network 2

24

Input layer

Output layer

Hidden layer1

Hidden layer2

Hidden layer3

Hidden layer4

Hidden layer5

Neurons: 8 10 7 7 5 5 2

More Neurons and hidden layer with same number of neurons of the previous layer are added

25 of 52

Artificial Neural Network 3

25

Input layer

Output layer

HL1

HL2

HL3

HL4

HL5

HL6

HL7

HL8

HL9

HL10

HL11

8 15 10 10 7 7 7 7 7 7 5 5 2

Neurons

More Neurons and hidden layers added

26 of 52

Network Modelling / Optimizer = Adam

26

Model	Accuracy	Precision	Recall	F1-score
ANN1	0.7597	0.7428	0.4814	0.5842
ANN2	0.7792	0.7631	0.5370	0.6304
ANN3	0.8181	0.8095	0.6296	0.7083

ANN1 < ANN2 < ANN3

27 of 52

Model Evaluation (Original Data)

27

Model	Accuracy	Precision	Recall	F1-score
ANN3	0.8181	0.8095	0.6296	0.7083
SVC classifier	0.7272	0.4629	0.6578	0.5434
Random Forest classifier	0.7597	0.5925	0.6808	0.6336
Extra Trees classifier	0.7662	0.6296	0.6800	0.6538
Logistic Regression	0.7987	0.6111	0.7674	0.6804
Ensemble model	0.7727	0.5925	0.7111	0.6464

28 of 52

Confusion Matrix

28

29 of 52

Model Evaluation (Data modified)

29

Model	Recall – original.	Recall – modified.
ANN3	0.6296	0.7735
SVC classifier	0.6578	0.7500
Random Forest classifier	0.6808	0.7500
Extra Trees classifier	0.6800	0.7317
Logistic Regression	0.7674	0.7105
Ensemble model	0.7111	0.7631

30 of 52

Summary

Dataset:

Pregnancies, Glucose, Blood Pressure, Skin Thickness

Insulin, BMI, Diabetes Pedigree, Age, Outcome

Feature Engineering
Deleting outliers
Deleting features “Blood Pressure” & “Skin Thickness”
Replacing null values with mean values
Standard data scaling

Classifiers

ANN, SVC, Random Forest, Extra Trees, Logistic Regression, Ensemble

Evaluation metric -> Recall

30

31 of 52

Conclusion

ANN – Better evaluation score, when more neurons and hidden layers

ANN had best accuracy, precision and f1-score and Logistic Regression model had best Recall among classifiers

Data modification showed improvement on Recall except for Logistic Regression

31

32 of 52

Interpretability/ Explainable AI (XAI)

Machine learning -> Black box

Difficult interpretation

How each features affects the prediction of the model

32

33 of 52

Interpretability/ Explainable AI (XAI)

SHAP (SHapley Additive exPlanation)
Black-Box Model : f
Surrogate Model : g

33

34 of 52

SHAP value

Force Plot
Subject 0:

Positive SHAP value vs. Negative SHAP value

Glucose contributed the most

34

Prediction

35 of 52

SHAP value

35

Force Plot
Subject 300:

Positive SHAP value vs. Negative SHAP value

Glucose contributed the most

36 of 52

Glucose Dependence Plot

36

High values of Glucose variable have a high positive contribution on the prediction

37 of 52

Age Dependence Plot

37

High values of Age variable tends to have a high positive contribution on the prediction until age of 50.

38 of 52

SkinThickness Dependence Plot

38

Value of Skin Thickness variable are showing no correlation with SHAP value

39 of 52

Summary Plot

39

Order of global feature importance
Glucose contributed the most and Skin Thickness contributed the least

40 of 52

Summary Plot

1. Glucose
2. BMI
3. Age
4. DiabetesPedigree
5. Pregnancies
6. Insulin
7. Blood Pressure
8. SkinThickness

40

41 of 52

Summary Plot

1. Glucose
2. BMI
3. Age
4. DiabetesPedigree
5. Pregnancies
6. Insulin
7. Blood Pressure
8. Skin Thickness

41

Glucose: 0.47
BMI: 0.29
Age: 0.24
Pregnancies: 0.22
DiabetesPedigree 0.17
Insulin 0.13
Skin Thickness 0.07

7. Blood Pressure 0.07

42 of 52

Thank You for listening

42

43 of 52

SVC classifier (Support Vector Machine)�

Supervised Learning Algorithm
Support Vector
Hyperplane

Create decision boundary that

segregate n-dimensional space into

classes

43

Support Vector Machine (SVM) Algorithm - Javatpoint

44 of 52

Random Forest Classifier

Supervised Learning Algorithm
A number of decision trees on various subsets of given dataset and takes
Ensemble learning (majority voting)
Bagging (Boot strap + aggregation)

44

Machine Learning Random Forest Algorithm - Javatpoint

45 of 52

Extra Trees Classifier

Supervised Learning Algorithm

Variety of Random Forest

Subset of decision tree

Majority voting

No Boot strap sample = uses whole original sample

Random split points (RF=optimum split points)

45

What is the difference between Extra Trees and Random Forest? | Quantdare

46 of 52

Logistic Regression Model

Supervised Learning Algorithm
Predicting the categorical dependent variable using a given set of independent variables.
Probability between 0 ~ 1
Classification problems
Threshold

46

Logistic Regression in Machine Learning - Javatpoint

47 of 52

Ensemble Model

Process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

47

48 of 52

Feature Engineering

Histogram: Data distribution

48

Glucose

BMI

Insulin

49 of 52

Feature Engineering

Histogram: Data distribution

49

Glucose

BMI

Insulin

50 of 52

BMI Dependence Plot

50

High values of BMI variable have a high positive contribution on the prediction

51 of 52

Pregnancies Dependence Plot

51

High values of Pregnancies variable tends to have a high positive contribution

52 of 52

Optimizer

Gradient Descent – slow, local minimum

Gradient Descent + momentum – faster, saddle point

Adagrad (Adaptive Gradient algorithm) – slow, escape saddle point

RMSProp (Root Mean Square Propagation) – faster, decaying factor

Adam – Momentum + RMSProp

52