Logistic Regression
TD ONE - Understanding and Preparing the Data
Fall 2025
Table of Contents
The 3 Types of ML
What is Logistic Regression?
Picking a Dataset
Pandas Methods and Attributes
Finding Null Values
Removing Null Values
Encoding
Checking data types
Next Lesson
2
3
4
6
9
11
12
13
15
16
Three types of Machine Learning
Logistic regression is supervised learning!
3
What is Logistic Regression?
4
Linear Regression vs Logistic Regression
5
Picking a dataset
In this slideshow I will be using the Breast Cancer Dataset to build and test a classification model to predict whether a tumor is malignant or benign.
6
Create a folder for TD and add your chosen dataset to it
7
Importing Python Libraries
pandas
numpy
Sklearn (scikit-learn)
8
Pandas DataFrame Methods and Attributes
Methods:
Attributes:
9
Reading and exploring data
10
Find null values
This dataset has no nulls
11
Remove Null Values
12
Encoding
13
cont.
14
Checking the data types
15
Next Lesson
16
Logistic Regression
TD TWO - Model Tuning and Evaluation
Fall 2025
Target and Predictor Variables
Target Variable:
Predictor Variable:
The target is the answer you want to find, and the predictors are the clues you use to find it.
18
Target and Predictor Variables
In the Breast Cancer dataset, my target variable is the diagnosis column. I’m trying to predict whether the diagnosis is Malignant or Benign.
My predictor variables include every column except the diagnosis column. I’ll use these predictors to train the model to determine whether a tumor is malignant or benign.
Splitting the data into x (predictors) and y (target):
Your target variable is also known as the dependent variable, while your predictors are your independent variables.
They are also referred to as features (predictors) and labels (targets) in machine learning.
19
Normalizing your data
Adjusting the values of your predictor variables so they’re all on a similar scale.
For example in the breast cancer dataset:
In this dataset, some features are in the 10s, others in the 100s, 1000s, or decimal values.
Because these values are on very different scales, a model might unintentionally give more attention to features with larger numbers, like area_mean, even though smaller-scale features like smoothness_mean are just as important.
20
Normalizing your data
Scikit-learn provides a tool called StandardScaler that helps you standardize your data.
For each column (feature), it calculates the mean and standard deviation, then standardizes the values by subtracting the mean and dividing by the standard deviation.
For example:�Say you have a feature “Age” with values:�[20,40,60,80]�Mean = 50�Standard deviation ≈ 25
After using the StandardScaler, you get: �[−1.2,−0.4,0.4,1.2]
Now, all values are centered around 0 and scaled based on how far they are from the mean, making every feature comparable and model-friendly.
21
Normalizing your data
22
Splitting our data
We split our data into training and testing sets to measure how well our model performs on new, unseen data, not just the data it learned from.
This step is important because if we train and test on the same data, the model might memorize instead of learning — a problem called overfitting.
By splitting the data, we can check if our model can generalize to new, real-world examples.
23
Splitting our data
The train_test_split() function from Scikit-learn’s model_selection module is used to split your dataset into a training set and a testing set.
This helps create separate datasets for training the model and evaluating how well it performs on unseen data.
24
Training the model
After splitting the data, we use the training set to teach the model how to recognize patterns and relationships between the predictors and the target variable.
In logistic regression, the model learns the best coefficients (or weights) for each feature so it can make accurate predictions.
Once the model is trained, we use the testing set to see how well it performs on new, unseen data.
This process helps the model learn patterns during training and then test whether it can generalize those patterns to real-world data.
25
Training the model
26
Training the model
27
Next Lesson
28