1 of 16

Breast Cancer Analysis and Prediction

PRESENTED BY:

Blessing Orji
Chidinma Ukaegbu
Damilola Onanuga
Fatimah Salami
Genevieve Nwagwu
Kingsley Makinde
Motunrayo Koyejo
Taiwo Bello

2 of 16

TABLE OF CONTENTS

Introduction
Aims & Objectives
Methodology
Result
Feature Selection and its significance
Challenges
Conclusion
Recommendations

3 of 16

Introduction

Breast cancer is the top cancer in women worldwide and is increasing particularly in developing countries where majority of cases are diagnosed in late stages.

Early detection in order to improve breast cancer outcomes and survival remains the cornerstone of breast cancer control.

Tests like the FNA (Fine Needle Aspirate) Biopsy are carried out on a breast mass to diagnose malignant (cancerous cells) or benign (non-cancerous cells).

Source: John Hopkins Medicine

4 of 16

26,310+

Number of new cases

12,000+

Deaths

53,000+

Prevalent cases(5-years)

Total Population

195.8mn

Useful Statistics - Nigeria

Source: Globocan 2018

5 of 16

More Useful Statistics

Source: John Hopkins Medicine

6 of 16

Step One

Step Two

Step Three

Step Four

Step Five

Visualize the data

Pre-process and train the data

Test the model

Evaluate the model

01

02

03

04

05

Get and analyze the dataset

The steps used in this classification problem include:

Steps

7 of 16

Aims & Objectives

This analysis aims to reduce the length at which it takes to detect a malignant or benign lump by training a model using datasets containing the various features recorded in lump samples from tested patients.

8 of 16

Methodology

1

Explore dataset

Visualization (Swarm Plot, Correlation matrix, frequency distribution etc.

Label Encoder

Scale the Features

Classification Algorithm

Model Evaluation

2

3

4

5

6

7

Train_Test_Split

9 of 16

Methodology….

.

10 of 16

Result

From the various machine learning classification algorithm, the RFC was observed to perform best (with 98% accuracy) in comparison with other classifiers on the test-data

�

11 of 16

Feature Selection and its Significance

Correlation matrix of selected features and its accuracies

12 of 16

Challenges

Trying to understand which classifier to use or not use.
The use and understanding of evaluation metrics.
Cleaning the data.
Feature selection (which was best to use and/or to drop).
Translating the result.
Visualizations kept on breaking the notebook or keeping one stuck.

13 of 16

Conclusion

In this project, we successfully utilized most machine learning classification algorithm methods to understand how different classifiers give different accuracy results and observed that most feature selection techniques do not significantly change the algorithm accuracy and hence are only suitable for feature analysis/model simplification.

Conclusively, based on comparison of various machine learning classification algorithm (Logistic Regression, K Nearest Neighbors, Support Vector Classifier-Linear, Support Vector Classifier (Radial Basis Function), Gaussian Naive Bayes, Decision Tree Classifier, Random Forest Classifier(RFC)), the RFC was observed to perform best on the test data with 98% accuracy.

ML Classification Algorithm
Algo	Status	Comment	Outcome
Logistic Regression			95.80%
KNeighbours Classifier			95.10%
SVC_lin			97.20%
SVC_rbf			96.50%
GaussianNB			91.60%
Decision Tree Classifier			95.80%
Random Forest Classifier			98.60%

14 of 16

Recommendations

We recommend that the RFC be deployed for use in hospitals for the prediction of Malignant and Benign lumps in patients.
In addition, further research can be done in using an ensemble of the classifiers for research prediction of benign or malignant cancer.
Also, a model could be embedded in an app using computer vision to make sense of mammography scans to give benign or malignant diagnosis.

15 of 16

THANK YOU!