1 of 55

Tidy datasets are all alike,

But every messy dataset is messy in its own way.

Hadley Wickham

https://r4ds.had.co.nz/tidy-data.html

2 of 55

Data Cleaning (P1)

Lecture 7

CMSC 320: Introduction to Data Science

2025

  • Fardina F. Alam

3 of 55

What we will learn

  1. What is Data Cleaning
  2. How to clean data
  3. Duplicated records
  4. Outlier detection, z-score
  5. Data missing at random / Data missing not at random
    • Different types of imputation
  6. Incorrect Data

4 of 55

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset

5 of 55

Why do we need Data Cleaning?

We often get data that is “messy

  • Missing values
  • Weird outliers
  • Columns that need to be combined
  • Multiple tables need to be joined
  • Inconsistency:
    • Ex. Mixed date format usage
      • "MM/DD/YYYY," "YYYY-MM-DD," and "DD-Mon-YY"
      • 01/15/2023, 2023-01-15, Jan-15-23

These can impede and affect our analysis + results!

6 of 55

Looking for issues

Sometimes, it’s simple and obvious

  • Maybe a word is misspelled in a categorical
  • You can fix this using apply() in 🐼 Pandas

Sometimes it’s non-obvious!

  • Not all data cleaning issues are immediately apparent or easy to solve.

Our first step is cleaning!

7 of 55

8 of 55

Where to start

Look to see what you have.

🐼 Pandas has some nice features to help us do that!

9 of 55

Where to start

We need to know if there are any potential issues in the data.

🐼 Pandas has some nice features to help us do that!

10 of 55

Data Cleaning Principles

Slide Courtesy: Karl Broman, UWisconsin–Madison

11 of 55

12 of 55

???????????????

13 of 55

Fig: A genetics project where almost 20% of the DNA samples had been mislabeled.

The dots indicate the correct DNA was placed in the well, but the arrows point from where a sample should have been to where it actually was placed.

14 of 55

Focus on the labels (which are more likely correct), rather than the position of variables in a file (which are more likely to change).

15 of 55

16 of 55

The Easy Stuff

17 of 55

Data Typing

Sometimes the data is in the wrong format!

What to do:

  • Date-Time data (very common)
    • 🐼 Pandas offers a robust datetime library
  • Other data types
    • Use df[COLUMN].astype(SOME_TYPE) for straightforward conversions
  • More complex issues
    • Utilize df[COLUMN].apply(conversion_function)

18 of 55

Check if All IDs Are Unique df['ID'].is_unique: return T/F

Check the total number of unique IDs: df['ID'].nunique()

19 of 55

Oftentimes you may need to merge various tables.

20 of 55

Combining and Merging Data Sources

What to do: Making sure the file formats all match. → check that columns have consistent data types, matching column names

  • This can be a little tricky sometimes.
      • various data format issues, resolving data conflicts, and deciding on the appropriate type of join (e.g., inner join, outer join) to use when merging tables.
  • Remember to sanity check your work!

21 of 55

Combining and Merging Data Sources

Check Column Names And Data Types

# Load datasets

df1 = pd.read_csv('file1.csv')

df2 = pd.read_csv('file2.csv')

# Check for matching column names

common_columns = df1.columns.intersection(df2.columns)

print("Common columns:", common_columns)

# Check for inconsistent data types in common columns

for column in common_columns:

if df1[column].dtype != df2[column].dtype:

print(f"Column '{column}' has inconsistent types: df1 is

{df1[column].dtype}, df2 is {df2[column].dtype}")

Pandas code to check for consistent data types in common columns across multiple DataFrames

22 of 55

Combining and Merging Data Sources

Resolve Data Type Inconsistencies

# Convert a column to a consistent type (e.g., string)

df1['column'] = df1['column'].astype(str)

df2['column'] = df2['column'].astype(str)

23 of 55

Combining and Merging Data Sources

Check for Missing Columns

Decide how to handle missing columns

24 of 55

Combining and Merging Data Sources

Resolve Data Conflicts

If the same column has different values in df1 and df2, decide how to resolve the conflict. For example:

  • Prioritize values from one DataFrame.

  • Combine values (e.g., use non-null values): Use combine_first() to fill missing values in df1 with values from df2:

25 of 55

visdat (https://docs.ropensci.org/visdat/) provides a heatmap indicating which data points are missing, and also the variable types.

Naniar (http://naniar.njtierney.com/) provides a scatterplot that includes the cases that are missing one or both variables.

26 of 55

27 of 55

Data cleaning is not a single step in the analysis chain; rather, it is an ongoing process that you will need to continually revisit as you delve deeper into the data. Keep an eye out for hints of problems, and arrange your work with the expectation that you’ll need to re-run everything at some point.

28 of 55

29 of 55

30 of 55

Evolving Labeling Schemes

Situations where the labels or categories used to classify data change over time.

This happens when how things are labeled changes halfway through.

Example: one field gets split into two subfields, or people decide it makes more sense to change how something is recorded midway through.

  • Example Scenario: Initially, a column that typically records "failure" is modified midway through the project to include "catastrophic failure" and "partial failure".
  • Example Scenario: A product category system that gets updated (e.g., "Electronics" → "Consumer Electronics").

You’ll have to cope with these as best you can (Coping strategies)

  • Divide the dataset in two (based on the time of change)?
  • Infer old ratings (based on available data)?

31 of 55

Evolving Labeling Schemes

For example, one field gets split into two subfields, or people decide it makes more sense to change how something is recorded midway through.

  • Example Scenario: Initially, a column that typically records "failure" is modified midway through the project to include "catastrophic failure" and "partial failure".

You’ll have to cope with these as best you can (Coping strategies)

  • Divide the dataset in two (based on the time of change)?
  • Infer old ratings (based on available data)?

Challenges with Inconsistent Labels:

  • Coexistence of Labels: Old and new labels can cause confusion.
  • Data Integration Issues: Merging data leads to mismatches.
  • Loss of Context: Old labels lose meaning without proper documentation.
  • Skewed Analysis: Inconsistent labels distort analysis and visualizations.

32 of 55

Example: evolving labeling scheme for product quality ratings.

SCENARIO: Suppose you are analyzing customer reviews for a product over a period of time, and you notice a change in the way product quality is rated during the analysis.

Initial Labeling Scheme (Phase 1):

33 of 55

Example: evolving labeling scheme for product quality ratings.

However, midway through the project, the company decides to change the labeling scheme to include more detailed quality ratings, introducing new subcategories such as "Excellent," "Very Good," "Good," "Fair," and "Poor".

Evolving Labeling Scheme (Phase 2):

The updated labeling scheme now includes "Excellent," "Very Good," "Good," "Fair," and "Poor".

34 of 55

Example: evolving labeling scheme for product quality ratings: Managing Data Changes

Possible Coping Strategies:

Step 1: Splitting the Dataset (by Phrases):

To accommodate changes in the labeling scheme, consider dividing the dataset into two separate groups:

  1. One containing data from the initial phase (Phase 1).
  2. Another for the data collected after the labeling scheme change (Phase 2).

This division helps maintain consistency in your analysis for each phase.

35 of 55

Example: evolving labeling scheme for product quality ratings: Managing Data Changes

Possible Coping Strategies:

Step 2: Infer/ Interpret Old Ratings:

For records in Phase 1, where you only have "Good," "Excellent," and "Fair" ratings, you can infer how they might map to the new labeling scheme.

  • For instance, you might consider mapping "Good" to "Very Good" and "Fair" to "Fair" in the updated scheme.

Keep in mind that this mapping might depend on what makes the most sense in your specific situation.

36 of 55

Practical Example: Evolving labeling scheme

Step 1: Create a Mapping Dictionary:

37 of 55

Practical Example: Evolving labeling scheme

Step 2: Apply the Mapping

  • Document Changes: Maintain a data dictionary or change log to track how labeling schemes have evolved over time.
  • Automate Label Updates: Use scripts or pipelines to automatically update labels when new schemes are introduced.
  • Always preserve historical context for traceability.

38 of 55

Dropping Columns in a DataFrame

Not all the categories of data in a dataset are useful!

Remove unwanted columns: examples:

to_drop = ['Edition Statement',

... 'Corporate Author',

... 'Corporate Contributors',

... 'Former owner',

... 'Engraver',

... 'Contributors',

... 'Issuance type',

... 'Shelfmarks']

>>> df.drop(to_drop, inplace=True, axis=1)

df.drop(columns=to_drop, inplace=True)

39 of 55

Duplicate Records

40 of 55

Duplicated Records

Sometimes people will put a bunch of duplicate records in your system!

  • If they are exact duplicates: just do df.drop_duplicates()
  • If they are duplicates where some of them are subtly different: identify the ones that are the true values and drop the rest

ALWAYS CHECK FOR THIS

41 of 55

Example: Duplicated Records

42 of 55

Example: Duplicated Records

43 of 55

Outlier Detection

44 of 55

What defines an outlier?

The formal definition of an outlier may vary depending on the context and statistical method employed.

One common approach is to identify outliers as data points that deviate significantly from the mean, often defined as being several standard deviations away.

45 of 55

What defines an outlier?

Outliers are data points that significantly deviate from the majority of the data in a dataset.

  • Example: unusually high or low values that may be indicative of errors, anomalies, or rare events.

Detecting outliers is important because they can skew statistical analyses and machine learning models.

46 of 55

Outlier Detection: Some common approaches

You can find outliers multiple ways. Looking for extreme z-scores is one way

A simple box and whisker plot is another

47 of 55

Outlier Detection using Z-Scores

Z-scores are a statistical measure that quantifies how far a data point is from the mean (average) of a dataset in terms of standard deviations.

  • A high positive or negative z-score suggests that a data point is far from the mean and is a potential outlier.

48 of 55

Example:

Outlier Detection using Z-Scores

Step 01: Calculate the Mean and Standard Deviation:

  • Mean (μ) = (80 + 85 + 88 + 90 + 92 + 95 + 98 + 100 + 150) / 9 = 100.44
  • Standard Deviation (σ) ≈ 21.11 (rounded for simplicity)

Test Scores: [80, 85, 88, 90, 92, 95, 98, 100, 150]

49 of 55

Example:

Outlier Detection using Z-Scores

Step 02: Calculate Z-scores for each test score using the formula:

  • For the test score of 150:
    • Z = (150 - 100.44) / 21.11 ≈ 2.35
  • For the other scores, calculate their respective z-scores.

Test Scores: [80, 85, 88, 90, 92, 95, 98, 100, 150]

50 of 55

Example:

Outlier Detection using Z-Scores

Step 03: Identify Extreme Z-scores:

Common threshold used for an extreme z-score:greater than 2 or less than -2”, suggests that a data point is significantly far from the mean.

  • Here, the z-score for the score of 150 is approximately 2.35, which is greater than 2.
      • Indicates that the score of 150 is significantly above the mean and is a potential outlier.

Test Scores: [80, 85, 88, 90, 92, 95, 98, 100, 150]

51 of 55

Pandas code for detecting outlier

52 of 55

IsolationForest

Anomaly Detection

Algorithm

53 of 55

Should we remove an outlier always?

54 of 55

It depends!

Ultimately, when doing a data science project, you have some goal in mind. Remove the outlier if it hurts that goal. Consider:

whether to remove an outlier depends on how it affects your project's objectives and if it's an essential or unusual part of your data.

The Context of the Outlier: Is the outlier an unusual event that will likely never happen again? (Something like “light intensity during an eclipse”).

→If it is, removing the outlier may be appropriate.

Evaluate Impact on Model Performance: Is the outlier something that will actually hurt your model?

(You don’t really need a model factoring in Jeff Bezos when doing financial transactions.

Assess Relevance: Is the outlier something that is a core element of your data? (Profits during a sale, wait times during traffic); We should not be remove without careful considerations.

55 of 55

Next Class:

Missing Data

Check this Article: https://realpython.com/python-data-cleaning-numpy-pandas/