1 of 19

Open Data Quality Working Group

30 May 2019

1

2 of 19

Open Data Quality Scoring

The second meeting of the working group will focus on the elements related to data quality scoring, primarily related to:

  1. Choose the parameters to use during data quality scoring
  2. Prioritize the weighting of the parameters
  3. Develop criteria / decision tree for data quality scoring
  4. Discuss minimal viable product (MVP) and version development

During our first work group session we discussed elements that we consider important for data quality

The OD team has successfully used scoring decision trees to gain understanding and prioritize complex work tasks (examples provided)

The OD team has thought of some ideas related to quality scoring, examples and discussion for our second working group session

3 of 19

Data Quality Perspectives

3

1

4 of 19

Data Quality - An Open Data Perspective

Change the 80:20 rule-of-thumb -- 80% of time cleaning data; 20% time analyzing data

Focus on providing users an understanding of the overall state of the data, rather than the individual accuracy of the data

Provide users an understanding of areas where data quality may be required for each dataset

Currently structured data only, future to include unstructured, real-time, etc.

4

5 of 19

Tidy Data as a Principle

The principles of tidy data provide a standard way to organise data values within a dataset

Tidy data is a standard way of mapping the meaning of a dataset to its structure

A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

Each variable forms a column

Each observation forms a row

Each type of observational unit forms a table

Link to Tidy Data Examples Paper on Tidy Data

5

6 of 19

Identified dimensions used to determine data quality via a literature review of academic journals and a horizon scan of industry sources

6

What DQ means to Data Consumers

Journal of MIS

DQ framework for health care

Journal of Decision Systems

DQ Framework

Bank of England

Open Data & Data Quality Report

Open Data Enterprise

Open Data & Metadata Quality

European Union

Relevancy

Accuracy

Timeliness

Accessibility

Comparability

Coherence

Precision

Completeness

Reliability

Interpretability

Metadata

Machine Readability

Granularity

Non-Redundancy

Credibility

7 of 19

Identified relevant dimensions for Open Data since the idea is whether the data fits its purpose, rather than meeting every one (we call this “unicorn” data)

7

Credibility

Outside scope

Interpretability

Metadata

Machine Readability

Granularity

Accessibility

Relevance

Comparability

Completeness

Coherence

Accuracy

Reliability

Precision

Non-Redundancy

DIMENSIONS FOR OPEN DATA

DIMENSIONS FOR DATA PUBLISHERS

Timeliness

8 of 19

Definition of data quality dimensions identified

8

Accessibility

Is the data accessible now and over time?

Relevancy

How useful is the data for solving the problem at hand?

Accuracy

How representative to reality is the data?

Timeliness

How close to production time is the publication of the data?

Comparability

Is the data following accepted standards?

Coherence

Is the data not containing contradictions? It is generally easier to show cases of incoherence than to prove coherence

Precision

How exact is the data?

Completeness

How much of the data is missing?

Reliability

How trustworthy is the data?

Interpretability

How easy is the data to understand (given the structure of the data)?

Metadata

Is the data well described?

Machine Readability

How easy is it for users to work with the data?

Granularity

How atomic is the data? What are the scale/level of details within the data?

Non-Redundancy

Does the records of the data represent unique items?

Credibility

How trustworthy is the source of the data?

9 of 19

Converging on Dimensions & Metrics

9

2

10 of 19

Settled on dimensions and identified key metrics to signal performance, but we’ve yet to determine how to measure and weight some of them

10

Open data

dimensions

User-friendly

dimensions

Example

metrics

Metadata

Is the data well described?

Metadata field filled | Type: True/False

High quality metadata content | Type: Unknown

Granularity

How atomic is the data?

Degree of aggregation in open dataset compared to source system | Type: Unknown

Row count difference between source system and published data | Type: Number

Accessibility

Is the data easy to access?

Data can be accessed directly via API | Type: True/False

Completeness

How much data is missing?

Percentage of observations with missing values | Type: Number

Usability

How easy is it to work with the data?

Ability to join with other datasets | Type: Unknown

Valid geometry | Type: True/False

Long dataset shape (for visualization in OD Portal) | Type: Unknown

Geo slivers | Type: Number

Nested JSON fields, number or depth | Type: Number

Interpretability

How easy is it to understand the data?

Percentage of columns with constant values | Type: Number

Consistency within dataset (e.g. terms, structures) | Type: Unknown

Freshness

How close to creation time is the data published?

Gap between collection and publication datetime | Type: Number

Gap between published refresh rate and date last refreshed | Type: Number

Interpretability

Metadata

Machine Readability

Granularity

Accessibility

Relevance

Comparability

Completeness

Timeliness

Moved to prioritization framework

11 of 19

Weighting Approach

Proof of concept demo (Jupyter Notebook)

11

12 of 19

Example data-driven decision making in Open Data: Dataset Migration

12

3

13 of 19

Migrating datasets in mixed formats, from multiple sources and stages of readiness, using different collection methods presents a significant challenge

13

292+

Datasets

1300+

Individual

Files

15+

File Formats

5+

Publishing

Methods

CURRENT CATALOGUE

SAMPLE ISSUES

― Unable to visualize data in new portal due to dataset shape

― Data is not current to the specified refresh rate

― Dataset contains missing values or columns

― No visibility into data refresh status

― Cannot automate in pipeline due to file format (e.g. PDF)

― Data extract does not reflect source, making lineage difficult

14 of 19

To address these issues, the migration approach is to score datasets according to 6 criteria, group them in batches, and migrate them from easiest to hardest

14

SOURCE CONNECTIVITY

Ability to extract directly from source system

MACHINE READABILITY

Ability to extract in formats for analysis and manipulation

USAGE

Degree to which a dataset is downloaded

FRESHNESS

Currency of data to the established refresh rate

GRANULARITY

Closeness of data to atomic level of detail

OWNERSHIP

Level of proprietary or similar licensing requirements

Batch 1

Migration�Schedule

Dataset

Prioritization

Migration

Criteria

Current

Catalogue

Batch 2

Batch 3

Dec

2018

Archive

15 of 19

Data Migration Scoring and Rationale

16 of 19

Dataset Migration Batches

17 of 19

Dataset by Owner and Batch

18 of 19

Tracking Data Migration (Internal and External audiences)

19 of 19

Open Floor