Breast Cancer Analysis and Prediction
PRESENTED BY:
TABLE OF CONTENTS
Introduction
Breast cancer is the top cancer in women worldwide and is increasing particularly in developing countries where majority of cases are diagnosed in late stages.
Early detection in order to improve breast cancer outcomes and survival remains the cornerstone of breast cancer control.
Tests like the FNA (Fine Needle Aspirate) Biopsy are carried out on a breast mass to diagnose malignant (cancerous cells) or benign (non-cancerous cells).
Source: John Hopkins Medicine
26,310+
Number of new cases
12,000+
Deaths
53,000+
Prevalent cases(5-years)
Total Population
195.8mn
Useful Statistics - Nigeria
Source: Globocan 2018
More Useful Statistics
Source: John Hopkins Medicine
Step One
Step Two
Step Three
Step Four
Step Five
Visualize the data
Pre-process and train the data
Test the model
Evaluate the model
01
02
03
04
05
Get and analyze the dataset
The steps used in this classification problem include:
Steps
Aims & Objectives
This analysis aims to reduce the length at which it takes to detect a malignant or benign lump by training a model using datasets containing the various features recorded in lump samples from tested patients.
Methodology
1
Explore dataset
Visualization (Swarm Plot, Correlation matrix, frequency distribution etc.
Label Encoder
Scale the Features
Classification Algorithm
Model Evaluation
2
3
4
5
6
7
Train_Test_Split
Methodology….
.
Result
From the various machine learning classification algorithm, the RFC was observed to perform best (with 98% accuracy) in comparison with other classifiers on the test-data
�
Feature Selection and its Significance
Correlation matrix of selected features and its accuracies
Challenges
Conclusion
In this project, we successfully utilized most machine learning classification algorithm methods to understand how different classifiers give different accuracy results and observed that most feature selection techniques do not significantly change the algorithm accuracy and hence are only suitable for feature analysis/model simplification.
Conclusively, based on comparison of various machine learning classification algorithm (Logistic Regression, K Nearest Neighbors, Support Vector Classifier-Linear, Support Vector Classifier (Radial Basis Function), Gaussian Naive Bayes, Decision Tree Classifier, Random Forest Classifier(RFC)), the RFC was observed to perform best on the test data with 98% accuracy.
ML Classification Algorithm | |||
Algo | Status | Comment | Outcome |
Logistic Regression |
| | 95.80% |
KNeighbours Classifier |
| | 95.10% |
SVC_lin |
| | 97.20% |
SVC_rbf |
| | 96.50% |
GaussianNB |
| | 91.60% |
Decision Tree Classifier |
| | 95.80% |
Random Forest Classifier |
| | 98.60% |
Recommendations
THANK YOU!