Cleaning Data
Why?
Raw data is often messy, inconsistent, and difficult for computers to process.
Example
Recall: how did you read files with Python?
---�iris-virginica, 3.4, width, 5.6, length�iris versicolor, “3.7”, 2.6, width, length�IRIS VIRGINICA, “Width 2.3”, “Length 8.2”
Data Quality
Validity
How strong is the data? Does it relate to the question at hand? Does it accurately reflect the real world?
---�iris-virginica, 3.4, width, 105.6, length�lilium-tigrinum, 2.4, width, 6.1, length
Accuracy
How close are our data points to the “true” values?
---�iris-virginica, 3.4, width, 5.6, length
Completeness
Are there any missing components of our dataset?
---�iris-virginica, 3.4, width, N/A, length
Consistency
Do any data points in our dataset contradict each other?
---�horace-mann, 231 W246 St�horace-mann, 305 W243 St
Uniformity
Are the same measures being used for all the data points?�
---�iris-virginica, 3.4, width, 5.6, length�iris-virginica, 8.6, width, 14.2, length
Methods
Standardization
This addresses consistency and uniformity. Think about our labs using csv. What makes data “nice” to work with?
Data Validation
This addresses validity and accuracy. This involves going through each data point and looking for issues. Mark the data points with issues, but do not remove them yet.
Missing Data
It’s a judgment call, but it is often very handy to keep data points with some missing values.
Removing Unwanted Data
At this point we have a good sense of which data we’re confident in, and we’ve marked data to look at more closely.
Record what data is removed, and the reason for removal!
Outliers
Once we’ve done all of the other checks, we might still have data that seems a little far-fetched. These data points are called outliers.
Transparency (always!)
Whenever we alter raw data, we introduce bias, since we are now working with a non-random subset of the our original sample.
Key Concept
It’s a lot easier to record good data compared to cleaning bad data!
Keep data quality in mind as you gather data.