Stroke Prediction using R
Group 1: Andres T. | Gregory H. | Maria S. | Jaycii C.
Introduction
Behind the Dataset
This dataset contains around 5100 observations; 12 attributes with each observation representing a patient. The attributes are information over the patient’s health history, demographics and most notably, whether the patient has had a stroke or not.
Our dependent variable is the binary variable ‘Stroke’, which will allow us to build models to accurately predict whether a patient will have a stroke.
Our independent variable is the Gender, Age, Heart Disease, Marriage Status, Work Type, Residence Type, Glucose Level, BMI
WHY TO RESEARCH? Our findings will enable doctors and patients to change their lifestyle and habits to decrease their risks of having a stroke.
The factor and numeric variables are placed below. These graphs include the breakdown for the categories and the range of numbers used for the factored and numeric variables respectively.
Table 1: Numeric Variables
Table 2: Factored Variables
Behind the Dataset & Techniques for Research
Stroke statistics
Techniques used in this project:
For data analysis:
For data prediction:
Data Cleansing
Changing BMI Datatype
Dropping “Unknown” Instances in Smoking Status Column
Check and Drop Any Other Missing Values
Dropping One Instance of “Other” in Genders Column
Data Visualization/
Descriptive Analysis
Stroke in different age groups
The data has a slight left skew and most of the data falls between 45 and 60.
The mean age is 42.87 and the median age is 44.
Overall there is a low number of stroke cases and the majority of them happen past the age of 45.
Plot 1: Patient Age vs Patient Count
female
male
Plot 2: Patient Age vs Patient Count(Filtered by Gender)
Plot 3: Age Distribution in Dataset
- Data is almost normally distributed.
- Modal age between 50-55 years.
Plot 5: Average Glucose Levels in Patients
The data is skewed to the left, implying that there is a higher concentration of strokes in patients older in age.
Very few cases of strokes were reported in users under the age of 40.
Plot 6: Patient Count vs Patient Age
Models & Results
Balancing the Dataset
Creating Training/Testing Data
Training data is also known as training dataset, learning set, and training set. It's an essential component of every machine learning model and helps them make accurate predictions or perform a desired task.
Training data builds the machine learning model. It teaches what the expected output should look like.
Our data is split into 70% training and 30% testing
GLM Model
Decision Tree Model
Random Forest Model
SVM w/ Linear Kernel
Model Selection
Based on the Confusion Matrices of the four models that we created, we have chosen the Random Forest Model for our data.
Highest Accuracy -> 0.9009 (Predictability)
Highest Sensitivity -> 0.8663 (Least Type 2 Error)
Highest Specificity -> 0.9332 (Least Type 1 Error)
This model is the most accurate and has the least amount of Errors (Type 1 & Type 2)
Conclusion
Results
We have chosen the Random Forest Model as our model of choice for this project because it was the most accurate and showed the least amount of error from the four models we created. Using this random forest model, we should be able to better predict the likelihood of getting a stroke depending on different variables.
We have learned many lessons throughout this project. We learned that it is difficult to model a dataset that is unbalanced and learned how to solve that issue through the SMOTE technique. We learned that we need to further evaluate our models for even more accurate results. We have gained good project experience and have learned how to work in a group for a data analytics project. Thank you professor for all of your help and guidance this semester.
Lessons Learned