3 of 23

Behind the Dataset

This dataset contains around 5100 observations; 12 attributes with each observation representing a patient. The attributes are information over the patient’s health history, demographics and most notably, whether the patient has had a stroke or not.

Our dependent variable is the binary variable ‘Stroke’, which will allow us to build models to accurately predict whether a patient will have a stroke.

Our independent variable is the Gender, Age, Heart Disease, Marriage Status, Work Type, Residence Type, Glucose Level, BMI

WHY TO RESEARCH? Our findings will enable doctors and patients to change their lifestyle and habits to decrease their risks of having a stroke.

The factor and numeric variables are placed below. These graphs include the breakdown for the categories and the range of numbers used for the factored and numeric variables respectively.

Table 1: Numeric Variables

Table 2: Factored Variables

4 of 23

Behind the Dataset & Techniques for Research

Stroke statistics

In 2020, 1 in 6 deaths from cardiovascular disease was due to stroke.1
Every 40 seconds, someone in the United States has a stroke. Every 3.5 minutes, someone dies of stroke.2
Every year, more than 795,000 people in the United States have a stroke. About 610,000 of these are first or new strokes.2
About 185,000 strokes—nearly 1 in 4—are in people who have had a previous stroke.2

Techniques used in this project:

For data analysis:

Cleaning data
Data visualization (Bar graphs, bar plots, histograms, and box plots)

For data prediction:

GLM model
Decision Tree Model
Random Forest Model
SVM w/ Linear Kernel

5 of 23

Data Cleansing

Changing BMI Datatype

Dropping “Unknown” Instances in Smoking Status Column

Check and Drop Any Other Missing Values

Dropping One Instance of “Other” in Genders Column

6 of 23

Data Visualization/

Descriptive Analysis

7 of 23

Stroke in different age groups

The data has a slight left skew and most of the data falls between 45 and 60.

The mean age is 42.87 and the median age is 44.

Overall there is a low number of stroke cases and the majority of them happen past the age of 45.

Plot 1: Patient Age vs Patient Count

8 of 23

female

We have more data for females.
We can see some stroke cases in groups younger than 40 years.
The youngest person with a stroke in the dataset is a female in her 30’s

male

The data has a slight left skew and most of the data falls between 45 and 60.
Has less data.
The data of the age group from 20 to 40 years old has less information in comparison with female group

Plot 2: Patient Age vs Patient Count(Filtered by Gender)

9 of 23

Plot 3: Age Distribution in Dataset

- Data is almost normally distributed.

- Modal age between 50-55 years.

10 of 23

Plot 5: Average Glucose Levels in Patients

11 of 23

The data is skewed to the left, implying that there is a higher concentration of strokes in patients older in age.

Very few cases of strokes were reported in users under the age of 40.

Plot 6: Patient Count vs Patient Age

12 of 23

Models & Results

13 of 23

Balancing the Dataset

Our original dataset had only approximately 5% of rows with a stroke, making it difficult to create accurate models
Used Synthetic Minority Oversampling Technique or SMOTE
Data set became almost 50-50 in rows that were strokes and rows that were not strokes

14 of 23

Creating Training/Testing Data

Training data is also known as training dataset, learning set, and training set. It's an essential component of every machine learning model and helps them make accurate predictions or perform a desired task.

Training data builds the machine learning model. It teaches what the expected output should look like.

Our data is split into 70% training and 30% testing

15 of 23

GLM Model

Used a GLM w/ Age, Heart Disease, Avg Glucose Level, Hypertension & Ever Married
Accuracy -> 0.7633
Sensitivity -> 0.8192
Specificity -> 0.7095

16 of 23

Decision Tree Model

Created a Decision Tree Model which ends up using Age & Ever Married
Accuracy -> 0.7678
Sensitivity -> 0.6547
Specificity -> 0.8763

17 of 23

Random Forest Model

Created a Random Forest Model
Accuracy -> 0.9009
Sensitivity -> 0.8663
Specificity -> 0.9332

18 of 23

SVM w/ Linear Kernel

Created a SVM model with a Linear Kernel
Accuracy <- 0.7913
Sensitivity <- 0.7408
Specificity <- 0.8399

19 of 23

Model Selection

Based on the Confusion Matrices of the four models that we created, we have chosen the Random Forest Model for our data.

Highest Accuracy -> 0.9009 (Predictability)

Highest Sensitivity -> 0.8663 (Least Type 2 Error)

Highest Specificity -> 0.9332 (Least Type 1 Error)

This model is the most accurate and has the least amount of Errors (Type 1 & Type 2)

21 of 23

Results

We have chosen the Random Forest Model as our model of choice for this project because it was the most accurate and showed the least amount of error from the four models we created. Using this random forest model, we should be able to better predict the likelihood of getting a stroke depending on different variables.

We have learned many lessons throughout this project. We learned that it is difficult to model a dataset that is unbalanced and learned how to solve that issue through the SMOTE technique. We learned that we need to further evaluate our models for even more accurate results. We have gained good project experience and have learned how to work in a group for a data analytics project. Thank you professor for all of your help and guidance this semester.

22 of 23

Lessons Learned

Difficulty of Using Dataset w/ Limited Response we wanted
Understood the Importance of Balancing our Dataset
Need to Further Our Models for Even More Accurate Results
Connecting the Research Question back to the Models
How to be more fluent in R and gained Project Experience
How to work in a group setting with a Data Analysis Project

1 of 23

2 of 23