2 of 28

Table of Contents

The 3 Types of ML

What is Logistic Regression?

Picking a Dataset

Pandas Methods and Attributes

Finding Null Values

Removing Null Values

Encoding

Checking data types

Next Lesson

3 of 28

Three types of Machine Learning

Logistic regression is supervised learning!

4 of 28

What is Logistic Regression?

A machine learning algorithm used when the answer we want is a binary outcome (Yes or No, etc.)
Linear regression predicts a number while Logistic regression predicts probability between 0 and 1
To do this it takes the input features and squeezes it through an S-shaped curve (sigmoid)
Finally, we set a threshold:

If probability ≥ 0.5 → predict Yes
If probability < 0.5 → predict No

5 of 28

Linear Regression vs Logistic Regression

6 of 28

Picking a dataset

Categorical target variable + binary
Numerical or categorical predictors
Each row in the dataset should be an independent case (e.g. one patient, one email, one transaction
Decent number of rows and not imbalanced

In this slideshow I will be using the Breast Cancer Dataset to build and test a classification model to predict whether a tumor is malignant or benign.

7 of 28

Create a folder for TD and add your chosen dataset to it

Pick one of the given datasets from Kaggle, download it, and add it to a folder where it’s easily accessible
Open Jupyter and create a new file - this is where you will be doing all your coding!

8 of 28

Importing Python Libraries

pandas

numpy

Sklearn (scikit-learn)

Used for working with arrays
Faster than using lists

Used for working with datasets
Can analyze, clean, explore, and manipulate data

Tool for predictive data analysis
We will use it for classification

9 of 28

Pandas DataFrame Methods and Attributes

Methods:

.head() : shows the first 5 rows in the dataset
.isnull() : checks for missing values
.sum() : adds numerical values or counts true values
.drop() : removes values
.info() : shows a summary of the data types and null values

Attributes:

.shape : returns (rows, columns)
.column : returns column names
.index : returns row index
.dtypes : returns data type of columns

10 of 28

Reading and exploring data

Use .head() to look at the first five rows of your data
Take note of any null values

11 of 28

Find null values

This dataset has no nulls

12 of 28

Remove Null Values

The id column will be removed because it does not contribute to predicting if a tumor is malignant or benign.

13 of 28

Encoding

Machine learning models can’t work with text labels so we must change it to numerical values
We encoded the target variable: malignant = 1, benign = 0.

14 of 28

cont.

Checking to see if the data is split correctly into 0 and 1’s
Checking if the data is unbalanced

15 of 28

Checking the data types

All values should be numerical

16 of 28

Next Lesson

Splitting our data into target and predictor variables
Normalizing our data
Training our model

17 of 28

Logistic Regression

TD TWO - Model Tuning and Evaluation

Fall 2025

18 of 28

Target and Predictor Variables

Target Variable:

This is what you’re trying to predict or explain. For example, if you want to predict whether a customer will buy a product, the target variable would be “Buy” (Yes/No).

Predictor Variable:

These are the inputs or features that help make the prediction. In the same example, predictors could be things like age or income.

The target is the answer you want to find, and the predictors are the clues you use to find it.

19 of 28

Target and Predictor Variables

In the Breast Cancer dataset, my target variable is the diagnosis column. I’m trying to predict whether the diagnosis is Malignant or Benign.

My predictor variables include every column except the diagnosis column. I’ll use these predictors to train the model to determine whether a tumor is malignant or benign.

Splitting the data into x (predictors) and y (target):

Your target variable is also known as the dependent variable, while your predictors are your independent variables.

They are also referred to as features (predictors) and labels (targets) in machine learning.

20 of 28

Normalizing your data

Adjusting the values of your predictor variables so they’re all on a similar scale.

For example in the breast cancer dataset:

In this dataset, some features are in the 10s, others in the 100s, 1000s, or decimal values.

Because these values are on very different scales, a model might unintentionally give more attention to features with larger numbers, like area_mean, even though smaller-scale features like smoothness_mean are just as important.

21 of 28

Normalizing your data

Scikit-learn provides a tool called StandardScaler that helps you standardize your data.

For each column (feature), it calculates the mean and standard deviation, then standardizes the values by subtracting the mean and dividing by the standard deviation.

For example:�Say you have a feature “Age” with values:�[20,40,60,80]�Mean = 50�Standard deviation ≈ 25

After using the StandardScaler, you get: �[−1.2,−0.4,0.4,1.2]

Now, all values are centered around 0 and scaled based on how far they are from the mean, making every feature comparable and model-friendly.

22 of 28

Normalizing your data

23 of 28

Splitting our data

We split our data into training and testing sets to measure how well our model performs on new, unseen data, not just the data it learned from.

Training set: The model learns patterns from this data.
Testing set: The model is evaluated on this part, which it hasn’t seen before.

This step is important because if we train and test on the same data, the model might memorize instead of learning — a problem called overfitting.

By splitting the data, we can check if our model can generalize to new, real-world examples.

24 of 28

Splitting our data

The train_test_split() function from Scikit-learn’s model_selection module is used to split your dataset into a training set and a testing set.

It randomly divides your features (x) and labels (y) based on the ratio you choose.
It takes in x_scaled (features) and y (labels).
test_size=0.3 → 30% of the data is used for testing, 70% for training.
random_state=42 → ensures the split is the same every time (you can use any number).

This helps create separate datasets for training the model and evaluating how well it performs on unseen data.

25 of 28

Training the model

After splitting the data, we use the training set to teach the model how to recognize patterns and relationships between the predictors and the target variable.

In logistic regression, the model learns the best coefficients (or weights) for each feature so it can make accurate predictions.

Once the model is trained, we use the testing set to see how well it performs on new, unseen data.

This process helps the model learn patterns during training and then test whether it can generalize those patterns to real-world data.

26 of 28

Training the model

LogisticRegression is a classification model that uses the sigmoid function to estimate probabilities between two classes.
lr = LogisticRegression() creates an empty model, it hasn’t learned anything yet.
lr.fit(x_train, y_train) trains the model by finding the best coefficients that separate the classes.
y_pred = lr.predict(x_test) uses the trained model to make predictions on unseen data.
We use .fit() to learn relationships and .predict() to test those learned relationships on new data.

27 of 28

Training the model

accuracy_score is a function from Scikit-learn that measures how accurate your model’s predictions are.
It compares the true labels (y_test) with the predicted labels (y_pred).
Formula:�Accuracy=Number of correct predictions/Total predictions
Finally, we print the accuracy score to evaluate how well the model performs on unseen data.

1 of 28

2 of 28

3 of 28

4 of 28

5 of 28

6 of 28

7 of 28

8 of 28

9 of 28

10 of 28

11 of 28

12 of 28

13 of 28

14 of 28

15 of 28

16 of 28

17 of 28

18 of 28

19 of 28

20 of 28

21 of 28

22 of 28

23 of 28

24 of 28

25 of 28

26 of 28

27 of 28

28 of 28