Cleaning Data with
12:00 - 1:00 am
dscommons@uvic.ca
For the UVic Libraries
Digital Scholarship Commons
What we will cover in today’s lesson:
Activities: https://lib.uvic.ca/or
All data are not created equal...
Data arise from many sources
> Automatically generated
> Downloaded from online database/repository
> Scraped from the web
> Shared from colleagues
> Manually created and managed
Signs of “messy” data:
> missing values
> inconsistent values/formats
> unclear column headings
> poor/absent structure
Signs of “high integrity” data:
> validity (conformation to defined rules/constraints)
> accuracy (conformity of a measure to “true” value)
> completeness (degree to which all measures known)
> consistency (degree to which measures are equivalent -� i.e. do not contradict)
> uniformity (degree to which measures use same units)
https://en.wikipedia.org/wiki/Data_cleansing
Signs of “high integrity” data:
> validity
> accuracy
> completeness
> consistency
> uniformity
2017 Data Scientist Survey
https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport.pdf
2017 Data Scientist Survey
https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport.pdf
KEEP
CALM
AND
CLEAN YOUR
DATA
> Free, open source tool
> Visualize and manipulate large quantities of data
> Key functions: Data Normalization
Column Reorganization
Faceting and Clustering
Tracking Processes
Exporting Data
Samuel Maclure Architectural Drawings
McTaggart Cowan Mammal Skeletons
Workshop Badges
If you’d like to earn a DSC badge, email xxx to dscommons@uvic.ca
UNCONFIRMED IF BADGE READY TO BE PROMOTED
Questions or Comments?