1 of 38

Tidy datasets are all alike,

But every messy dataset is messy in its own way.

Hadley Wickham

https://r4ds.had.co.nz/tidy-data.html

understanding and fixing messy data.

Data Cleaning (P1)

  • Fardina F. Alam

CMSC 320: Introduction to Data Science

2 of 38

What we will learn

  1. What is Data Cleaning
  2. How to clean data
  3. Duplicated records
  4. Outlier detection, z-score
  5. Data missing at random / Data missing not at random
    • Different types of imputation
  6. Incorrect Data

3 of 38

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset

4 of 38

Why do we need Data Cleaning?

We often get data that is “messy

  • Missing values
  • Weird outliers
  • Columns that need to be combined
  • Multiple tables need to be joined
  • Inconsistency:
    • Ex. Mixed date format usage
      • "MM/DD/YYYY," "YYYY-MM-DD," and "DD-Mon-YY"
      • 01/15/2023, 2023-01-15, Jan-15-23

These can impede and affect our analysis + results!

5 of 38

Looking for issues

Sometimes, it’s simple and obvious

  • Maybe a word is misspelled in a categorical
  • You can fix this using apply() in 🐼 Pandas

Sometimes it’s non-obvious!

  • Not all data cleaning issues are immediately apparent or easy to solve.

Our first step is cleaning!

6 of 38

7 of 38

Where to start

Look to see what you have. 🐼 Pandas has some nice features to help us do that!

8 of 38

Slide Courtesy: Karl Broman, UWisconsin–Madison

9 of 38

10 of 38

Fig: A genetics project where almost 20% of the DNA samples had been mislabeled.

The dots indicate the correct DNA was placed in the well, but the arrows point from where a sample should have been to where it actually was placed.

11 of 38

Focus on the labels (which are more likely correct), rather than the position of variables in a file (which are more likely to change).

12 of 38

Data Typing

The Easy Stuff

Sometimes the data is in the wrong format!

What to do:

  • Date-Time data (very common)
    • 🐼 Pandas offers a robust datetime library
  • Other data types
    • Use df[COLUMN].astype(SOME_TYPE) for straightforward conversions

  • More complex issues
    • Utilize df[COLUMN].apply(conversion_function)

13 of 38

Check if All IDs Are Unique df['ID'].is_unique: return T/F

Check the total number of unique IDs: df['ID'].nunique()

14 of 38

Oftentimes you may need to merge various tables.

Ensure file formats match before merging:

  • check column names and data types are consistent�
  • fix formatting differences and resolve data conflicts�
  • choose the appropriate join type (inner, outer, etc.)�
  • always sanity check the final result.

Combining and Merging Data Sources

15 of 38

Combining and Merging Data Sources

  1. Check Column Names And Data Types

# Load datasets

df1 = pd.read_csv('file1.csv')

df2 = pd.read_csv('file2.csv')

# Check for matching column names

common_columns = df1.columns.intersection(df2.columns)

print("Common columns:", common_columns)

# Check for inconsistent data types in common columns

for column in common_columns:

if df1[column].dtype != df2[column].dtype:

print(f"Column '{column}' has inconsistent types: df1 is

{df1[column].dtype}, df2 is {df2[column].dtype}")

Pandas code to check for consistent data types in common columns across multiple DataFrames

# Convert a column to a consistent type (e.g., string)

df1['column'] = df1['column'].astype(str)

df2['column'] = df2['column'].astype(str)

(b) Resolve Data Type Inconsistencies

16 of 38

Combining and Merging Data Sources

(c) Check for Missing Columns

Decide how to handle missing columns

17 of 38

Combining and Merging Data Sources

Resolve Data Conflicts

If the same column has different values in df1 and df2, decide how to resolve the conflict. For example:

  • Prioritize values from one DataFrame.

  • Combine values (e.g., use non-null values): Use combine_first() to fill missing values in df1 with values from df2:

18 of 38

visdat (https://docs.ropensci.org/visdat/) provides a heatmap indicating which data points are missing, and also the variable types.

Naniar (http://naniar.njtierney.com/) provides a scatterplot that includes the cases that are missing one or both variables.

19 of 38

Data cleaning is not a single step in the analysis chain; rather, it is an ongoing process that you will need to continually revisit as you delve deeper into the data. Keep an eye out for hints of problems, and arrange your work with the expectation that you’ll need to re-run everything at some point.

20 of 38

21 of 38

Evolving Labeling Schemes

Labels or categories may change over time, creating inconsistencies in the dataset.

Examples

  • One field splits into multiple categories� Example: “Failure” → “Catastrophic failure” and “Partial failure”�
  • Category definitions are updated� Example: “Electronics” → “Consumer Electronics”�

How to Handle

  • Split the dataset based on when the labeling change occurred�
  • Map or infer old labels to match the new system when possible

Challenges with Inconsistent Labels:

  • Mixed labels — old and new categories create confusion�
  • Integration problems — mismatches when merging datasets�
  • Loss of context — older labels become unclear without documentation�
  • Biased analysis — inconsistent labels can distort results and visualizations

22 of 38

Example: evolving labeling scheme for product quality ratings.

SCENARIO: Suppose you are analyzing customer reviews for a product over a period of time, and you notice a change in the way product quality is rated during the analysis.

Initial Labeling Scheme (Phase 1):

23 of 38

Example: evolving labeling scheme for product quality ratings.

However, midway through the project, the company decides to change the labeling scheme to include more detailed quality ratings, introducing new subcategories such as "Excellent," "Very Good," "Good," "Fair," and "Poor".

Evolving Labeling Scheme (Phase 2):

The updated labeling scheme now includes "Excellent," "Very Good," "Good," "Fair," and "Poor".

24 of 38

Example: evolving labeling scheme for product quality ratings: Managing Data Changes

Possible Coping Strategies:

Step 1: Splitting the Dataset (by Phrases):

To accommodate changes in the labeling scheme, consider dividing the dataset into two separate groups:

  1. One containing data from the initial phase (Phase 1).
  2. Another for the data collected after the labeling scheme change (Phase 2).

This division helps maintain consistency in your analysis for each phase.

25 of 38

Example: evolving labeling scheme for product quality ratings: Managing Data Changes

Possible Coping Strategies:

Step 2: Infer/ Interpret Old Ratings:

For records in Phase 1, where you only have "Good," "Excellent," and "Fair" ratings, you can infer how they might map to the new labeling scheme.

  • For instance, you might consider mapping "Good" to "Very Good" and "Fair" to "Fair" in the updated scheme.

Keep in mind that this mapping might depend on what makes the most sense in your specific situation.

26 of 38

Practical Example: Evolving labeling scheme

Step 1: Create a Mapping Dictionary:

27 of 38

Practical Example: Evolving labeling scheme

Step 2: Apply the Mapping

  • Document Changes: Maintain a data dictionary or change log to track how labeling schemes have evolved over time.
  • Automate Label Updates: Use scripts or pipelines to automatically update labels when new schemes are introduced.
  • Always preserve historical context for traceability.

28 of 38

Dropping Columns in a DataFrame

Not all the categories of data in a dataset are useful!

Remove unwanted columns: examples:

to_drop = ['Edition Statement',

... 'Corporate Author',

... 'Corporate Contributors',

... 'Former owner',

... 'Engraver',

... 'Contributors',

... 'Issuance type',

... 'Shelfmarks']

>>> df.drop(to_drop, inplace=True, axis=1)

df.drop(columns=to_drop, inplace=True)

29 of 38

Duplicate Records

30 of 38

Duplicated Records

Example: Duplicated Records:

Sometimes people will put a bunch of duplicate records in your system!

  • If they are exact duplicates: just do df.drop_duplicates()
  • If they differ slightly → identify the correct records and remove the rest

ALWAYS CHECK FOR THIS

31 of 38

Example: Solution to the Duplicated Records

32 of 38

Outlier Detection

Outliers are data points that differ significantly from most values in a dataset.

  • They are often identified as values several standard deviations away from the mean.
  • They may represent: errors, anomalies, rare but real events

Detecting outliers is important because they can skew statistical analyses and machine learning models.

33 of 38

Outlier Detection: Some common approaches

You can find outliers multiple ways. Looking for extreme z-scores is one way

A simple box and whisker plot is another

A z-score measures how far a data point is from the mean in units of standard deviations.

  • Large positive or negative z-scores indicate values far from the mean (Common rule: ∣z∣>2 or ∣z∣>3 → possible outlier)
  • Such values may be potential outliers

34 of 38

Z-Score Example: Detecting an Outlier

Data: Test Scores = [80, 85, 88, 90, 92, 95, 98, 100, 150]

Step 1:Compute statistics

  • Mean = 100.44
  • Standard deviation ≈ 21.11�

Step 2: Compute z-score

For score 150: z=(150 - 100.44) / 21.11 ≈ 2.35

Step 3 : Identify outlier

  • Common rule: |z| > 2∣ → potential outlier�
  • Since 2.35 > 2 → 150 is a potential outlier

|z| > 2 → possible outlier (moderate rule)�

|z| > 3 → strong outlier (more conservative, very common in statistics)

35 of 38

Pandas code for detecting outlier

36 of 38

IsolationForest

Anomaly Detection

Algorithm

37 of 38

It depends!

Should we remove an outlier always?

Ultimately, when doing a data science project, you have some goal in mind. Remove the outlier if it hurts that goal. Consider:

whether to remove an outlier depends on how it affects your project's objectives and if it's an essential or unusual part of your data.

Context

If it’s a rare, one-time event (e.g., eclipse data), removal may be appropriate.

Model Impact

Remove it if it distorts or harms model performance.

Relevance

Keep it if it represents a meaningful or expected part of the data (e.g., sale spikes, traffic delays).

38 of 38

Next Class:

Missing Data

Check this Article: https://realpython.com/python-data-cleaning-numpy-pandas/