1 of 17

Cleaning Data

2 of 17

Why?

Raw data is often messy, inconsistent, and difficult for computers to process.

3 of 17

Example

Recall: how did you read files with Python?

---�iris-virginica, 3.4, width, 5.6, length�iris versicolor, “3.7”, 2.6, width, length�IRIS VIRGINICA, “Width 2.3”, “Length 8.2”

4 of 17

Data Quality

5 of 17

Validity

How strong is the data? Does it relate to the question at hand? Does it accurately reflect the real world?

---�iris-virginica, 3.4, width, 105.6, length�lilium-tigrinum, 2.4, width, 6.1, length

6 of 17

Accuracy

How close are our data points to the “true” values?

Very hard to discern without outside sources
A collection of “true” values might not be readily available

---�iris-virginica, 3.4, width, 5.6, length

7 of 17

Completeness

Are there any missing components of our dataset?

Incomplete datasets introduce biases
Impossible to fix by cleaning

---�iris-virginica, 3.4, width, N/A, length

8 of 17

Consistency

Do any data points in our dataset contradict each other?

Cleaning requires knowing which data point to trust�

---�horace-mann, 231 W246 St�horace-mann, 305 W243 St

9 of 17

Uniformity

Are the same measures being used for all the data points?�

---�iris-virginica, 3.4, width, 5.6, length�iris-virginica, 8.6, width, 14.2, length

10 of 17

Methods

11 of 17

Standardization

This addresses consistency and uniformity. Think about our labs using csv. What makes data “nice” to work with?

Consistent formatting
Same data types in same locations (float, String, int, etc)
Same units of measure

12 of 17

Data Validation

This addresses validity and accuracy. This involves going through each data point and looking for issues. Mark the data points with issues, but do not remove them yet.

Typos, capitalization, formatting
Strange values
Field restrictions
Value accuracy

13 of 17

Missing Data

It’s a judgment call, but it is often very handy to keep data points with some missing values.

Fill in any missing data that can be filled in
Replace missing values with a standardized placeholder
Treat the placeholder as another category
Keep track of how many values are missing for a property

14 of 17

Removing Unwanted Data

At this point we have a good sense of which data we’re confident in, and we’ve marked data to look at more closely.

Remove duplicate data points
Remove irrelevant observations / properties
Remove data that is beyond repair

Too many missing values
Can’t recover correct values

Record what data is removed, and the reason for removal!

15 of 17

Outliers

Once we’ve done all of the other checks, we might still have data that seems a little far-fetched. These data points are called outliers.

Do not remove unless there is strong evidence that the outliers are fake
Keep an eye on outliers as you analyze your data

16 of 17

Transparency (always!)

Whenever we alter raw data, we introduce bias, since we are now working with a non-random subset of the our original sample.

Be careful interpreting results
Be transparent about what cleaning methods were used
Record any edits that were made to the data

17 of 17

Key Concept

It’s a lot easier to record good data compared to cleaning bad data!

Keep data quality in mind as you gather data.