1 of 52

Pima Indians Diabetes Dataset

Classification

Minjae Chung (01715597)

1

2 of 52

Introduction to dataset

  • 768 Subjects

2

3 of 52

Introduction to dataset

  •  

3

4 of 52

Introduction to dataset

  •  

4

5 of 52

Why is it worth to analyze?

  • Association between diabetes and other health features

  • Predict diabetes after health medical examination with AI

  • Prevention and treatment

5

6 of 52

EDA (Exploratory Data Analysis)

Outlier Detection

  • 17 Subjects with
  • at least 2 outliers

6

7 of 52

EDA (Exploratory Data Analysis)

  • Checking Null datatype

-> 0 Null Values

7

8 of 52

EDA (Exploratory Data Analysis)

  • Summary and statistics

8

9 of 52

EDA (Exploratory Data Analysis)

  • Correlation matrix between variables: Heatmap (1/2)

9

10 of 52

EDA (Exploratory Data Analysis)

  • Correlation matrix between variables: Heatmap (2/2)

10

11 of 52

EDA (Exploratory Data Analysis)

  1. Glucose: 0.47
  2. BMI: 0.29
  3. Age: 0.24
  4. Pregnancies: 0.22
  5. Pedigree 0.17
  6. Insulin 0.13
  7. Skin Thickness 0.07

7. Blood Pressure 0.07

11

12 of 52

EDA (Exploratory Data Analysis)

  • Density curve of Glucose distribution
  • Subject with diabetes show higher glucose concentration

12

13 of 52

EDA (Exploratory Data Analysis)

  • Density plot of Age distribution
    • Outcome 0: No diabetes
    • Outcome 1: Diabetes
  • Subject with diabetes tends to have higher age

13

14 of 52

EDA (Exploratory Data Analysis)

  • Bar plot of Pregnancies
  • Higher chance of having diabetes with higher number of pregnancy

14

15 of 52

EDA (Exploratory Data Analysis)

  • Density curve of BMI distribution
  • Subjects with diabetes showed more overweight and obese state
  • Zero BMI values?

15

Healthy (18.5 to 24.9)

Overweight (25.0 to 29.9)

Obese (30.0 or higher)

(Centers for Disease Control and Prevention)

Mean BMI: 30 35

16 of 52

EDA (Exploratory Data Analysis)

16

  • Zero values are null values for particular features

17 of 52

Feature Engineering

  • 17 Subjects
  • Outliers deleted
  • 768 subjects

-> 751 subjects

17

18 of 52

Feature Engineering

  • Delete 2 features with least correlation

  • 1. Skin Thickness
  • 2. Blood Pressure

18

19 of 52

Feature Engineering

  • Converting zero values into null values
  • Glucose, Insulin, BMI

19

20 of 52

Feature Engineering

  • Replacing Nan to Mean value of the feature

20

21 of 52

Feature Engineering

  • Data Scaling
  • Equalizing the scale of features
  • -> Improving the performance of models

  • StandardScaler()
  • = scales features to have zero mean and unit variance.

21

22 of 52

Feature Engineering

  • Standardization

22

23 of 52

Artificial Neural Network 1

23

Input layer

Output layer

Hidden layer1

Hidden layer2

Hidden layer3

Neurons: 8 7 5 3 2

24 of 52

Artificial Neural Network 2

24

Input layer

Output layer

Hidden layer1

Hidden layer2

Hidden layer3

Hidden layer4

Hidden layer5

Neurons: 8 10 7 7 5 5 2

  • More Neurons and hidden layer with same number of neurons of the previous layer are added

25 of 52

Artificial Neural Network 3

25

Input layer

Output layer

HL1

HL2

HL3

HL4

HL5

HL6

HL7

HL8

HL9

HL10

HL11

8 15 10 10 7 7 7 7 7 7 5 5 2

Neurons

  • More Neurons and hidden layers added

26 of 52

Network Modelling / Optimizer = Adam

26

Model

Accuracy

Precision

Recall

F1-score

ANN1

0.7597

0.7428

0.4814

0.5842

ANN2

0.7792

0.7631

0.5370

0.6304

ANN3

0.8181

0.8095

0.6296

0.7083

  • ANN1 < ANN2 < ANN3

27 of 52

Model Evaluation (Original Data)

27

Model

Accuracy

Precision

Recall

F1-score

ANN3

0.8181

0.8095

0.6296

0.7083

SVC classifier

0.7272

0.4629

0.6578

0.5434

Random Forest classifier

0.7597

0.5925

0.6808

0.6336

Extra Trees classifier

0.7662

0.6296

0.6800

0.6538

Logistic Regression

0.7987

0.6111

0.7674

0.6804

Ensemble model

0.7727

0.5925

0.7111

0.6464

28 of 52

Confusion Matrix

28

29 of 52

Model Evaluation (Data modified)

29

Model

Recall – original.

Recall – modified.

ANN3

0.6296

0.7735

SVC classifier

0.6578

0.7500

Random Forest classifier

0.6808

0.7500

Extra Trees classifier

0.6800

0.7317

Logistic Regression

0.7674

0.7105

Ensemble model

0.7111

0.7631

30 of 52

Summary

  • Dataset:

Pregnancies, Glucose, Blood Pressure, Skin Thickness

Insulin, BMI, Diabetes Pedigree, Age, Outcome

  • Feature Engineering
  • Deleting outliers
  • Deleting features “Blood Pressure” & “Skin Thickness”
  • Replacing null values with mean values
  • Standard data scaling

  • Classifiers

ANN, SVC, Random Forest, Extra Trees, Logistic Regression, Ensemble

Evaluation metric -> Recall

30

31 of 52

Conclusion

  • ANN – Better evaluation score, when more neurons and hidden layers

  • ANN had best accuracy, precision and f1-score and Logistic Regression model had best Recall among classifiers

  • Data modification showed improvement on Recall except for Logistic Regression

31

32 of 52

Interpretability/ Explainable AI (XAI)

  • Machine learning -> Black box

  • Difficult interpretation

  • How each features affects the prediction of the model

32

33 of 52

Interpretability/ Explainable AI (XAI)

  • SHAP (SHapley Additive exPlanation)
  • Black-Box Model : f
  • Surrogate Model : g

33

34 of 52

SHAP value

  • Force Plot
  • Subject 0:

Positive SHAP value vs. Negative SHAP value

Glucose contributed the most

34

Prediction

35 of 52

SHAP value

35

  • Force Plot
  • Subject 300:

Positive SHAP value vs. Negative SHAP value

Glucose contributed the most

36 of 52

Glucose Dependence Plot

36

  • High values of Glucose variable have a high positive contribution on the prediction

37 of 52

Age Dependence Plot

37

  • High values of Age variable tends to have a high positive contribution on the prediction until age of 50.

38 of 52

SkinThickness Dependence Plot

38

  • Value of Skin Thickness variable are showing no correlation with SHAP value

39 of 52

Summary Plot

39

  • Order of global feature importance
  • Glucose contributed the most and Skin Thickness contributed the least

40 of 52

Summary Plot

  • 1. Glucose
  • 2. BMI
  • 3. Age
  • 4. DiabetesPedigree
  • 5. Pregnancies
  • 6. Insulin
  • 7. Blood Pressure
  • 8. SkinThickness

40

41 of 52

Summary Plot

  • 1. Glucose
  • 2. BMI
  • 3. Age
  • 4. DiabetesPedigree
  • 5. Pregnancies
  • 6. Insulin
  • 7. Blood Pressure
  • 8. Skin Thickness

41

  1. Glucose: 0.47
  2. BMI: 0.29
  3. Age: 0.24
  4. Pregnancies: 0.22
  5. DiabetesPedigree 0.17
  6. Insulin 0.13
  7. Skin Thickness 0.07

7. Blood Pressure 0.07

42 of 52

Thank You for listening

42

43 of 52

SVC classifier (Support Vector Machine)�

  • Supervised Learning Algorithm
  • Support Vector
  • Hyperplane

  • Create decision boundary that

segregate n-dimensional space into

classes

43

44 of 52

Random Forest Classifier

  • Supervised Learning Algorithm
  • A number of decision trees on various subsets of given dataset and takes
  • Ensemble learning (majority voting)
  • Bagging (Boot strap + aggregation)

44

45 of 52

Extra Trees Classifier

  • Supervised Learning Algorithm

  • Variety of Random Forest

  • Subset of decision tree

  • Majority voting

  • No Boot strap sample = uses whole original sample

  • Random split points (RF=optimum split points)

45

46 of 52

Logistic Regression Model

  • Supervised Learning Algorithm
  • Predicting the categorical dependent variable using a given set of independent variables.
  • Probability between 0 ~ 1
  • Classification problems
  • Threshold

46

47 of 52

Ensemble Model

  • Process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

47

48 of 52

Feature Engineering

  • Histogram: Data distribution

48

Glucose

BMI

Insulin

49 of 52

Feature Engineering

  • Histogram: Data distribution

49

Glucose

BMI

Insulin

50 of 52

BMI Dependence Plot

50

  • High values of BMI variable have a high positive contribution on the prediction

51 of 52

Pregnancies Dependence Plot

51

  • High values of Pregnancies variable tends to have a high positive contribution

52 of 52

Optimizer

  • Gradient Descent – slow, local minimum

  • Gradient Descent + momentum – faster, saddle point

  • Adagrad (Adaptive Gradient algorithm) – slow, escape saddle point

  • RMSProp (Root Mean Square Propagation) – faster, decaying factor

  • Adam – Momentum + RMSProp

52