Pima Indians Diabetes Dataset
Classification
Minjae Chung (01715597)
1
Introduction to dataset
2
Introduction to dataset
3
Introduction to dataset
4
Why is it worth to analyze?
5
EDA (Exploratory Data Analysis)
Outlier Detection
6
EDA (Exploratory Data Analysis)
-> 0 Null Values
7
EDA (Exploratory Data Analysis)
8
EDA (Exploratory Data Analysis)
9
EDA (Exploratory Data Analysis)
10
EDA (Exploratory Data Analysis)
7. Blood Pressure 0.07
11
EDA (Exploratory Data Analysis)
12
EDA (Exploratory Data Analysis)
13
EDA (Exploratory Data Analysis)
14
EDA (Exploratory Data Analysis)
15
Healthy (18.5 to 24.9)
Overweight (25.0 to 29.9)
Obese (30.0 or higher)
(Centers for Disease Control and Prevention)
Mean BMI: 30 35
EDA (Exploratory Data Analysis)
16
Feature Engineering
-> 751 subjects
17
Feature Engineering
18
Feature Engineering
19
Feature Engineering
20
Feature Engineering
21
Feature Engineering
22
Artificial Neural Network 1
23
Input layer
Output layer
Hidden layer1
Hidden layer2
Hidden layer3
Neurons: 8 7 5 3 2
Artificial Neural Network 2
24
Input layer
Output layer
Hidden layer1
Hidden layer2
Hidden layer3
Hidden layer4
Hidden layer5
Neurons: 8 10 7 7 5 5 2
Artificial Neural Network 3
25
Input layer
Output layer
HL1
HL2
HL3
HL4
HL5
HL6
HL7
HL8
HL9
HL10
HL11
8 15 10 10 7 7 7 7 7 7 5 5 2
Neurons
Network Modelling / Optimizer = Adam
26
Model | Accuracy | Precision | Recall | F1-score |
ANN1 | 0.7597 | 0.7428 | 0.4814 | 0.5842 |
ANN2 | 0.7792 | 0.7631 | 0.5370 | 0.6304 |
ANN3 | 0.8181 | 0.8095 | 0.6296 | 0.7083 |
Model Evaluation (Original Data)
27
Model | Accuracy | Precision | Recall | F1-score |
ANN3 | 0.8181 | 0.8095 | 0.6296 | 0.7083 |
SVC classifier | 0.7272 | 0.4629 | 0.6578 | 0.5434 |
Random Forest classifier | 0.7597 | 0.5925 | 0.6808 | 0.6336 |
Extra Trees classifier | 0.7662 | 0.6296 | 0.6800 | 0.6538 |
Logistic Regression | 0.7987 | 0.6111 | 0.7674 | 0.6804 |
Ensemble model | 0.7727 | 0.5925 | 0.7111 | 0.6464 |
Confusion Matrix
28
Model Evaluation (Data modified)
29
Model | Recall – original. | Recall – modified. |
ANN3 | 0.6296 | 0.7735 |
SVC classifier | 0.6578 | 0.7500 |
Random Forest classifier | 0.6808 | 0.7500 |
Extra Trees classifier | 0.6800 | 0.7317 |
Logistic Regression | 0.7674 | 0.7105 |
Ensemble model | 0.7111 | 0.7631 |
Summary
Pregnancies, Glucose, Blood Pressure, Skin Thickness
Insulin, BMI, Diabetes Pedigree, Age, Outcome
ANN, SVC, Random Forest, Extra Trees, Logistic Regression, Ensemble
Evaluation metric -> Recall
30
Conclusion
31
Interpretability/ Explainable AI (XAI)
32
Interpretability/ Explainable AI (XAI)
33
SHAP value
Positive SHAP value vs. Negative SHAP value
Glucose contributed the most
34
Prediction
SHAP value
35
Positive SHAP value vs. Negative SHAP value
Glucose contributed the most
Glucose Dependence Plot
36
Age Dependence Plot
37
SkinThickness Dependence Plot
38
Summary Plot
39
Summary Plot
40
Summary Plot
41
7. Blood Pressure 0.07
Thank You for listening
42
SVC classifier (Support Vector Machine)�
segregate n-dimensional space into
classes
43
Random Forest Classifier
44
Extra Trees Classifier
45
Logistic Regression Model
46
Ensemble Model
47
Feature Engineering
48
Glucose
BMI
Insulin
Feature Engineering
49
Glucose
BMI
Insulin
BMI Dependence Plot
50
Pregnancies Dependence Plot
51
Optimizer
52