Tidy datasets are all alike,
But every messy dataset is messy in its own way.
–Hadley Wickham
https://r4ds.had.co.nz/tidy-data.html
understanding and fixing messy data.
Data Cleaning (P1)
CMSC 320: Introduction to Data Science
What we will learn
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset
Why do we need Data Cleaning?
We often get data that is “messy”
These can impede and affect our analysis + results!
Looking for issues
Sometimes, it’s simple and obvious
Sometimes it’s non-obvious!
Our first step is cleaning!
Where to start
Look to see what you have. 🐼 Pandas has some nice features to help us do that!
Slide Courtesy: Karl Broman, UWisconsin–Madison
Fig: A genetics project where almost 20% of the DNA samples had been mislabeled.
The dots indicate the correct DNA was placed in the well, but the arrows point from where a sample should have been to where it actually was placed.
Focus on the labels (which are more likely correct), rather than the position of variables in a file (which are more likely to change).
Data Typing
The Easy Stuff
Sometimes the data is in the wrong format!
What to do:
Check if All IDs Are Unique df['ID'].is_unique: return T/F
Check the total number of unique IDs: df['ID'].nunique()
Oftentimes you may need to merge various tables.
Ensure file formats match before merging:
Combining and Merging Data Sources
Combining and Merging Data Sources
# Load datasets
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
# Check for matching column names
common_columns = df1.columns.intersection(df2.columns)
print("Common columns:", common_columns)
# Check for inconsistent data types in common columns
for column in common_columns:
if df1[column].dtype != df2[column].dtype:
print(f"Column '{column}' has inconsistent types: df1 is
{df1[column].dtype}, df2 is {df2[column].dtype}")
Pandas code to check for consistent data types in common columns across multiple DataFrames
# Convert a column to a consistent type (e.g., string)
df1['column'] = df1['column'].astype(str)
df2['column'] = df2['column'].astype(str)
(b) Resolve Data Type Inconsistencies
Combining and Merging Data Sources
(c) Check for Missing Columns
Decide how to handle missing columns
Combining and Merging Data Sources
Resolve Data Conflicts
If the same column has different values in df1 and df2, decide how to resolve the conflict. For example:
visdat (https://docs.ropensci.org/visdat/) provides a heatmap indicating which data points are missing, and also the variable types.
Naniar (http://naniar.njtierney.com/) provides a scatterplot that includes the cases that are missing one or both variables.
Data cleaning is not a single step in the analysis chain; rather, it is an ongoing process that you will need to continually revisit as you delve deeper into the data. Keep an eye out for hints of problems, and arrange your work with the expectation that you’ll need to re-run everything at some point.
Evolving Labeling Schemes
Labels or categories may change over time, creating inconsistencies in the dataset.
Examples
How to Handle
Challenges with Inconsistent Labels:
Example: evolving labeling scheme for product quality ratings.
SCENARIO: Suppose you are analyzing customer reviews for a product over a period of time, and you notice a change in the way product quality is rated during the analysis.
Initial Labeling Scheme (Phase 1):
Example: evolving labeling scheme for product quality ratings.
However, midway through the project, the company decides to change the labeling scheme to include more detailed quality ratings, introducing new subcategories such as "Excellent," "Very Good," "Good," "Fair," and "Poor".
Evolving Labeling Scheme (Phase 2):
The updated labeling scheme now includes "Excellent," "Very Good," "Good," "Fair," and "Poor".
Example: evolving labeling scheme for product quality ratings: Managing Data Changes
Possible Coping Strategies:
Step 1: Splitting the Dataset (by Phrases):
To accommodate changes in the labeling scheme, consider dividing the dataset into two separate groups:
This division helps maintain consistency in your analysis for each phase.
Example: evolving labeling scheme for product quality ratings: Managing Data Changes
Possible Coping Strategies:
Step 2: Infer/ Interpret Old Ratings:
For records in Phase 1, where you only have "Good," "Excellent," and "Fair" ratings, you can infer how they might map to the new labeling scheme.
Keep in mind that this mapping might depend on what makes the most sense in your specific situation.
Practical Example: Evolving labeling scheme
Step 1: Create a Mapping Dictionary:
Practical Example: Evolving labeling scheme
Step 2: Apply the Mapping
Dropping Columns in a DataFrame
Not all the categories of data in a dataset are useful!
Remove unwanted columns: examples:
to_drop = ['Edition Statement',
... 'Corporate Author',
... 'Corporate Contributors',
... 'Former owner',
... 'Engraver',
... 'Contributors',
... 'Issuance type',
... 'Shelfmarks']
>>> df.drop(to_drop, inplace=True, axis=1)
df.drop(columns=to_drop, inplace=True)
Duplicate Records
Duplicated Records
Example: Duplicated Records:
Sometimes people will put a bunch of duplicate records in your system!
ALWAYS CHECK FOR THIS
Example: Solution to the Duplicated Records
Outlier Detection
Outliers are data points that differ significantly from most values in a dataset.
Detecting outliers is important because they can skew statistical analyses and machine learning models.
Outlier Detection: Some common approaches
You can find outliers multiple ways. Looking for extreme z-scores is one way
A simple box and whisker plot is another
A z-score measures how far a data point is from the mean in units of standard deviations.
Z-Score Example: Detecting an Outlier
Data: Test Scores = [80, 85, 88, 90, 92, 95, 98, 100, 150]
Step 1:Compute statistics
Step 2: Compute z-score
For score 150: z=(150 - 100.44) / 21.11 ≈ 2.35
Step 3 : Identify outlier
|z| > 2 → possible outlier (moderate rule)�
|z| > 3 → strong outlier (more conservative, very common in statistics)
Pandas code for detecting outlier
IsolationForest
Anomaly Detection
Algorithm
It depends!
Should we remove an outlier always?
Ultimately, when doing a data science project, you have some goal in mind. Remove the outlier if it hurts that goal. Consider:
whether to remove an outlier depends on how it affects your project's objectives and if it's an essential or unusual part of your data.
Context
If it’s a rare, one-time event (e.g., eclipse data), removal may be appropriate.
Model Impact
Remove it if it distorts or harms model performance.
Relevance
Keep it if it represents a meaningful or expected part of the data (e.g., sale spikes, traffic delays).
Next Class:
Missing Data
Check this Article: https://realpython.com/python-data-cleaning-numpy-pandas/