1 of 16

Cleaning Data with

12:00 - 1:00 am

dscommons@uvic.ca

uvic.ca/library/dsc/

For the UVic Libraries

Digital Scholarship Commons

2 of 16

What we will cover in today’s lesson:

Activities: https://lib.uvic.ca/or

  1. What is messy vs clean data
  2. OpenRefine for key cleaning practices:
    • Analysing the frequency of values
    • Clustering and standardizing values
    • Separating values in the same field
    • Joining multiple values in separate fields

3 of 16

All data are not created equal...

4 of 16

Data arise from many sources

> Automatically generated

> Downloaded from online database/repository

> Scraped from the web

> Shared from colleagues

> Manually created and managed

5 of 16

Signs of “messy” data:

> missing values

> inconsistent values/formats

> unclear column headings

> poor/absent structure

6 of 16

Signs of “high integrity” data:

> validity (conformation to defined rules/constraints)

> accuracy (conformity of a measure to “true” value)

> completeness (degree to which all measures known)

> consistency (degree to which measures are equivalent -� i.e. do not contradict)

> uniformity (degree to which measures use same units)

https://en.wikipedia.org/wiki/Data_cleansing

7 of 16

Signs of “high integrity” data:

> validity

> accuracy

> completeness

> consistency

> uniformity

8 of 16

9 of 16

10 of 16

KEEP

CALM

AND

CLEAN YOUR

DATA

11 of 16

> Free, open source tool

> Visualize and manipulate large quantities of data

> Key functions: Data Normalization

Column Reorganization

Faceting and Clustering

Tracking Processes

Exporting Data

12 of 16

13 of 16

Samuel Maclure Architectural Drawings

14 of 16

McTaggart Cowan Mammal Skeletons

15 of 16

Workshop Badges

If you’d like to earn a DSC badge, email xxx to dscommons@uvic.ca

UNCONFIRMED IF BADGE READY TO BE PROMOTED

16 of 16

Questions or Comments?

dscommons@uvic.ca

uvic.ca/library/dsc/