1 of 14

Cleaning Data with

12:00 - 1:00 am

dscommons@uvic.ca

uvic.ca/library/dsc/

For the UVic Libraries

Digital Scholarship Commons

2 of 14

What we will cover in today’s lesson:

  1. What is messy vs clean data
  2. OpenRefine for key cleaning practices:
    • Analysing the frequency of values
    • Clustering and standardizing values
    • Separating values in the same field
    • Joining multiple values in separate fields

3 of 14

All data are not created equal...

4 of 14

Data arise from many sources

> Automatically generated

> Downloaded from online database/repository

> Scraped from the web

> Shared from colleagues

> Manually created and managed

5 of 14

Signs of “messy” data:

> missing values

> inconsistent values/formats

> unclear column headings

> poor/absent structure

6 of 14

Signs of “high integrity” data:

> validity (conformation to defined rules/constraints)

> accuracy (conformity of a measure to “true” value)

> completeness (degree to which all measures known)

> consistency (degree to which measures are equivalent -� i.e. do not contradict)

> uniformity (degree to which measures use same units)

https://en.wikipedia.org/wiki/Data_cleansing

7 of 14

Signs of “high integrity” data:

> validity

> accuracy

> completeness

> consistency

> uniformity

8 of 14

9 of 14

10 of 14

KEEP

CALM

AND

CLEAN YOUR

DATA

11 of 14

> Free, open source tool

> Visualize and manipulate large quantities of data

> Key functions: Data Normalization

Column Reorganization

Faceting and Clustering

Tracking Processes

Exporting Data

12 of 14

13 of 14

Samuel Maclure Architectural Drawings

14 of 14

McTaggart Cowan Mammal Skeletons