1 of 19

Open Data Quality Working Group

30 May 2019

1

2 of 19

Open Data Quality Scoring

The second meeting of the working group will focus on the elements related to data quality scoring, primarily related to:

Choose the parameters to use during data quality scoring
Prioritize the weighting of the parameters
Develop criteria / decision tree for data quality scoring
Discuss minimal viable product (MVP) and version development

During our first work group session we discussed elements that we consider important for data quality

The OD team has successfully used scoring decision trees to gain understanding and prioritize complex work tasks (examples provided)

The OD team has thought of some ideas related to quality scoring, examples and discussion for our second working group session

3 of 19

Data Quality Perspectives

3

1

4 of 19

Data Quality - An Open Data Perspective

Change the 80:20 rule-of-thumb -- 80% of time cleaning data; 20% time analyzing data

Focus on providing users an understanding of the overall state of the data, rather than the individual accuracy of the data

Provide users an understanding of areas where data quality may be required for each dataset

Currently structured data only, future to include unstructured, real-time, etc.

4

5 of 19

Tidy Data as a Principle

The principles of tidy data provide a standard way to organise data values within a dataset

Tidy data is a standard way of mapping the meaning of a dataset to its structure

A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

Each variable forms a column

Each observation forms a row

Each type of observational unit forms a table

Link to Tidy Data Examples Paper on Tidy Data

5

6 of 19

Identified dimensions used to determine data quality via a literature review of academic journals and a horizon scan of industry sources

6

What DQ means to Data Consumers

Journal of MIS

DQ framework for health care

Journal of Decision Systems

DQ Framework

Bank of England

Open Data & Data Quality Report

Open Data Enterprise

Open Data & Metadata Quality

European Union

Relevancy

Accuracy

Timeliness

Accessibility

Comparability

Coherence

Precision

Completeness

Reliability

Interpretability

Metadata

Machine Readability

Granularity

Non-Redundancy

Credibility

7 of 19

Identified relevant dimensions for Open Data since the idea is whether the data fits its purpose, rather than meeting every one (we call this “unicorn” data)

7

Credibility

Outside scope

Interpretability

Metadata

Machine Readability

Granularity

Accessibility

Relevance

Comparability

Completeness

Coherence

Accuracy

Reliability

Precision

Non-Redundancy

DIMENSIONS FOR OPEN DATA

DIMENSIONS FOR DATA PUBLISHERS

Timeliness

8 of 19

Definition of data quality dimensions identified

8

Accessibility

Is the data accessible now and over time?

Relevancy

How useful is the data for solving the problem at hand?

Accuracy

How representative to reality is the data?

Timeliness

How close to production time is the publication of the data?

Comparability

Is the data following accepted standards?

Coherence

Is the data not containing contradictions? It is generally easier to show cases of incoherence than to prove coherence

Precision

How exact is the data?

Completeness

How much of the data is missing?

Reliability

How trustworthy is the data?

Interpretability

How easy is the data to understand (given the structure of the data)?

Metadata

Is the data well described?

Machine Readability

How easy is it for users to work with the data?

Granularity

How atomic is the data? What are the scale/level of details within the data?

Non-Redundancy

Does the records of the data represent unique items?

Credibility

How trustworthy is the source of the data?

9 of 19

Converging on Dimensions & Metrics

9

2

10 of 19

Settled on dimensions and identified key metrics to signal performance, but we’ve yet to determine how to measure and weight some of them

10

Open data dimensions	User-friendly dimensions	Example metrics
	Metadata Is the data well described?	Metadata field filled \| Type: True/False High quality metadata content \| Type: Unknown
	Granularity How atomic is the data?	Degree of aggregation in open dataset compared to source system \| Type: Unknown Row count difference between source system and published data \| Type: Number
	Accessibility Is the data easy to access?	Data can be accessed directly via API \| Type: True/False
	Completeness How much data is missing?	Percentage of observations with missing values \| Type: Number
	Usability How easy is it to work with the data?	Ability to join with other datasets \| Type: Unknown Valid geometry \| Type: True/False Long dataset shape (for visualization in OD Portal) \| Type: Unknown Geo slivers \| Type: Number Nested JSON fields, number or depth \| Type: Number
	Usability How easy is it to work with the data?
	Interpretability How easy is it to understand the data?	Percentage of columns with constant values \| Type: Number Consistency within dataset (e.g. terms, structures) \| Type: Unknown
	Freshness How close to creation time is the data published?	Gap between collection and publication datetime \| Type: Number Gap between published refresh rate and date last refreshed \| Type: Number

Interpretability

Metadata

Machine Readability

Granularity

Accessibility

Relevance

Comparability

Completeness

Timeliness

Moved to prioritization framework

11 of 19

Weighting Approach

Proof of concept demo (Jupyter Notebook)

11

12 of 19

Example data-driven decision making in Open Data: Dataset Migration

12

3

13 of 19

Migrating datasets in mixed formats, from multiple sources and stages of readiness, using different collection methods presents a significant challenge

13

292+

Datasets

1300+

Individual

Files

15+

File Formats

5+

Publishing

Methods

CURRENT CATALOGUE

SAMPLE ISSUES

― Unable to visualize data in new portal due to dataset shape

― Data is not current to the specified refresh rate

― Dataset contains missing values or columns

― No visibility into data refresh status

― Cannot automate in pipeline due to file format (e.g. PDF)

― Data extract does not reflect source, making lineage difficult

14 of 19

To address these issues, the migration approach is to score datasets according to 6 criteria, group them in batches, and migrate them from easiest to hardest

14

SOURCE CONNECTIVITY Ability to extract directly from source system
MACHINE READABILITY Ability to extract in formats for analysis and manipulation
USAGE Degree to which a dataset is downloaded
FRESHNESS Currency of data to the established refresh rate
GRANULARITY Closeness of data to atomic level of detail
OWNERSHIP Level of proprietary or similar licensing requirements

Batch 1

Migration�Schedule

Dataset

Prioritization

Migration

Criteria

Current

Catalogue

Batch 2

Batch 3

Dec

2018

15 of 19

Data Migration Scoring and Rationale

16 of 19

Dataset Migration Batches

17 of 19

Dataset by Owner and Batch

18 of 19

Tracking Data Migration (Internal and External audiences)

19 of 19

Open Floor