Tidy datasets are all alike,
But every messy dataset is messy in its own way.
–Hadley Wickham
https://r4ds.had.co.nz/tidy-data.html
Data Cleaning (P1)
Lecture 7
CMSC 320: Introduction to Data Science
2025
What we will learn
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset
Why do we need Data Cleaning?
We often get data that is “messy”
These can impede and affect our analysis + results!
Looking for issues
Sometimes, it’s simple and obvious
Sometimes it’s non-obvious!
Our first step is cleaning!
Where to start
Look to see what you have.
🐼 Pandas has some nice features to help us do that!
Where to start
We need to know if there are any potential issues in the data.
🐼 Pandas has some nice features to help us do that!
Data Cleaning Principles
Slide Courtesy: Karl Broman, UWisconsin–Madison
???????????????
Fig: A genetics project where almost 20% of the DNA samples had been mislabeled.
The dots indicate the correct DNA was placed in the well, but the arrows point from where a sample should have been to where it actually was placed.
Focus on the labels (which are more likely correct), rather than the position of variables in a file (which are more likely to change).
The Easy Stuff
Data Typing
Sometimes the data is in the wrong format!
What to do:
Check if All IDs Are Unique df['ID'].is_unique: return T/F
Check the total number of unique IDs: df['ID'].nunique()
Oftentimes you may need to merge various tables.
Combining and Merging Data Sources
What to do: Making sure the file formats all match. → check that columns have consistent data types, matching column names
Combining and Merging Data Sources
Check Column Names And Data Types
# Load datasets
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
# Check for matching column names
common_columns = df1.columns.intersection(df2.columns)
print("Common columns:", common_columns)
# Check for inconsistent data types in common columns
for column in common_columns:
if df1[column].dtype != df2[column].dtype:
print(f"Column '{column}' has inconsistent types: df1 is
{df1[column].dtype}, df2 is {df2[column].dtype}")
Pandas code to check for consistent data types in common columns across multiple DataFrames
Combining and Merging Data Sources
Resolve Data Type Inconsistencies
# Convert a column to a consistent type (e.g., string)
df1['column'] = df1['column'].astype(str)
df2['column'] = df2['column'].astype(str)
Combining and Merging Data Sources
Check for Missing Columns
Decide how to handle missing columns
Combining and Merging Data Sources
Resolve Data Conflicts
If the same column has different values in df1 and df2, decide how to resolve the conflict. For example:
visdat (https://docs.ropensci.org/visdat/) provides a heatmap indicating which data points are missing, and also the variable types.
Naniar (http://naniar.njtierney.com/) provides a scatterplot that includes the cases that are missing one or both variables.
Data cleaning is not a single step in the analysis chain; rather, it is an ongoing process that you will need to continually revisit as you delve deeper into the data. Keep an eye out for hints of problems, and arrange your work with the expectation that you’ll need to re-run everything at some point.
Evolving Labeling Schemes
Situations where the labels or categories used to classify data change over time.
This happens when how things are labeled changes halfway through.
Example: one field gets split into two subfields, or people decide it makes more sense to change how something is recorded midway through.
You’ll have to cope with these as best you can (Coping strategies)
Evolving Labeling Schemes
For example, one field gets split into two subfields, or people decide it makes more sense to change how something is recorded midway through.
You’ll have to cope with these as best you can (Coping strategies)
Challenges with Inconsistent Labels:
Example: evolving labeling scheme for product quality ratings.
SCENARIO: Suppose you are analyzing customer reviews for a product over a period of time, and you notice a change in the way product quality is rated during the analysis.
Initial Labeling Scheme (Phase 1):
Example: evolving labeling scheme for product quality ratings.
However, midway through the project, the company decides to change the labeling scheme to include more detailed quality ratings, introducing new subcategories such as "Excellent," "Very Good," "Good," "Fair," and "Poor".
Evolving Labeling Scheme (Phase 2):
The updated labeling scheme now includes "Excellent," "Very Good," "Good," "Fair," and "Poor".
Example: evolving labeling scheme for product quality ratings: Managing Data Changes
Possible Coping Strategies:
Step 1: Splitting the Dataset (by Phrases):
To accommodate changes in the labeling scheme, consider dividing the dataset into two separate groups:
This division helps maintain consistency in your analysis for each phase.
Example: evolving labeling scheme for product quality ratings: Managing Data Changes
Possible Coping Strategies:
Step 2: Infer/ Interpret Old Ratings:
For records in Phase 1, where you only have "Good," "Excellent," and "Fair" ratings, you can infer how they might map to the new labeling scheme.
Keep in mind that this mapping might depend on what makes the most sense in your specific situation.
Practical Example: Evolving labeling scheme
Step 1: Create a Mapping Dictionary:
Practical Example: Evolving labeling scheme
Step 2: Apply the Mapping
Dropping Columns in a DataFrame
Not all the categories of data in a dataset are useful!
Remove unwanted columns: examples:
to_drop = ['Edition Statement',
... 'Corporate Author',
... 'Corporate Contributors',
... 'Former owner',
... 'Engraver',
... 'Contributors',
... 'Issuance type',
... 'Shelfmarks']
>>> df.drop(to_drop, inplace=True, axis=1)
df.drop(columns=to_drop, inplace=True)
Duplicate Records
Duplicated Records
Sometimes people will put a bunch of duplicate records in your system!
ALWAYS CHECK FOR THIS
Example: Duplicated Records
Example: Duplicated Records
Outlier Detection
What defines an outlier?
The formal definition of an outlier may vary depending on the context and statistical method employed.
One common approach is to identify outliers as data points that deviate significantly from the mean, often defined as being several standard deviations away.
What defines an outlier?
Outliers are data points that significantly deviate from the majority of the data in a dataset.
Detecting outliers is important because they can skew statistical analyses and machine learning models.
Outlier Detection: Some common approaches
You can find outliers multiple ways. Looking for extreme z-scores is one way
A simple box and whisker plot is another
Outlier Detection using Z-Scores
Z-scores are a statistical measure that quantifies how far a data point is from the mean (average) of a dataset in terms of standard deviations.
Example:
Outlier Detection using Z-Scores
Step 01: Calculate the Mean and Standard Deviation:
Test Scores: [80, 85, 88, 90, 92, 95, 98, 100, 150]
Example:
Outlier Detection using Z-Scores
Step 02: Calculate Z-scores for each test score using the formula:
Test Scores: [80, 85, 88, 90, 92, 95, 98, 100, 150]
Example:
Outlier Detection using Z-Scores
Step 03: Identify Extreme Z-scores:
Common threshold used for an extreme z-score: “greater than 2 or less than -2”, suggests that a data point is significantly far from the mean.
Test Scores: [80, 85, 88, 90, 92, 95, 98, 100, 150]
Pandas code for detecting outlier
IsolationForest
Anomaly Detection
Algorithm
Should we remove an outlier always?
It depends!
Ultimately, when doing a data science project, you have some goal in mind. Remove the outlier if it hurts that goal. Consider:
whether to remove an outlier depends on how it affects your project's objectives and if it's an essential or unusual part of your data.
The Context of the Outlier: Is the outlier an unusual event that will likely never happen again? (Something like “light intensity during an eclipse”).
→If it is, removing the outlier may be appropriate.
Evaluate Impact on Model Performance: Is the outlier something that will actually hurt your model?
(You don’t really need a model factoring in Jeff Bezos when doing financial transactions.
Assess Relevance: Is the outlier something that is a core element of your data? (Profits during a sale, wait times during traffic); We should not be remove without careful considerations.
Next Class:
Missing Data
Check this Article: https://realpython.com/python-data-cleaning-numpy-pandas/