Open Data Quality Working Group
30 May 2019
1
Open Data Quality Scoring
The second meeting of the working group will focus on the elements related to data quality scoring, primarily related to:
During our first work group session we discussed elements that we consider important for data quality
The OD team has successfully used scoring decision trees to gain understanding and prioritize complex work tasks (examples provided)
The OD team has thought of some ideas related to quality scoring, examples and discussion for our second working group session
Data Quality Perspectives
3
1
Data Quality - An Open Data Perspective
Change the 80:20 rule-of-thumb -- 80% of time cleaning data; 20% time analyzing data
Focus on providing users an understanding of the overall state of the data, rather than the individual accuracy of the data
Provide users an understanding of areas where data quality may be required for each dataset
Currently structured data only, future to include unstructured, real-time, etc.
4
Tidy Data as a Principle
The principles of tidy data provide a standard way to organise data values within a dataset
Tidy data is a standard way of mapping the meaning of a dataset to its structure
A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table
Link to Tidy Data Examples Paper on Tidy Data
5
Identified dimensions used to determine data quality via a literature review of academic journals and a horizon scan of industry sources
6
What DQ means to Data Consumers
Journal of MIS
DQ framework for health care
Journal of Decision Systems
DQ Framework
Bank of England
Open Data & Data Quality Report
Open Data Enterprise
Open Data & Metadata Quality
European Union
Relevancy
Accuracy
Timeliness
Accessibility
Comparability
Coherence
Precision
Completeness
Reliability
Interpretability
Metadata
Machine Readability
Granularity
Non-Redundancy
Credibility
Identified relevant dimensions for Open Data since the idea is whether the data fits its purpose, rather than meeting every one (we call this “unicorn” data)
7
Credibility
Outside scope
Interpretability
Metadata
Machine Readability
Granularity
Accessibility
Relevance
Comparability
Completeness
Coherence
Accuracy
Reliability
Precision
Non-Redundancy
DIMENSIONS FOR OPEN DATA
DIMENSIONS FOR DATA PUBLISHERS
Timeliness
Definition of data quality dimensions identified
8
Accessibility
Is the data accessible now and over time?
Relevancy
How useful is the data for solving the problem at hand?
Accuracy
How representative to reality is the data?
Timeliness
How close to production time is the publication of the data?
Comparability
Is the data following accepted standards?
Coherence
Is the data not containing contradictions? It is generally easier to show cases of incoherence than to prove coherence
Precision
How exact is the data?
Completeness
How much of the data is missing?
Reliability
How trustworthy is the data?
Interpretability
How easy is the data to understand (given the structure of the data)?
Metadata
Is the data well described?
Machine Readability
How easy is it for users to work with the data?
Granularity
How atomic is the data? What are the scale/level of details within the data?
Non-Redundancy
Does the records of the data represent unique items?
Credibility
How trustworthy is the source of the data?
Converging on Dimensions & Metrics
9
2
Settled on dimensions and identified key metrics to signal performance, but we’ve yet to determine how to measure and weight some of them
10
Open data dimensions | User-friendly dimensions | Example metrics |
| Metadata Is the data well described? | Metadata field filled | Type: True/False High quality metadata content | Type: Unknown |
| Granularity How atomic is the data? | Degree of aggregation in open dataset compared to source system | Type: Unknown Row count difference between source system and published data | Type: Number |
| Accessibility Is the data easy to access? | Data can be accessed directly via API | Type: True/False |
| Completeness How much data is missing? | Percentage of observations with missing values | Type: Number |
| Usability How easy is it to work with the data? | Ability to join with other datasets | Type: Unknown Valid geometry | Type: True/False Long dataset shape (for visualization in OD Portal) | Type: Unknown Geo slivers | Type: Number Nested JSON fields, number or depth | Type: Number |
| Interpretability How easy is it to understand the data? | Percentage of columns with constant values | Type: Number Consistency within dataset (e.g. terms, structures) | Type: Unknown |
| Freshness How close to creation time is the data published? | Gap between collection and publication datetime | Type: Number Gap between published refresh rate and date last refreshed | Type: Number |
Interpretability
Metadata
Machine Readability
Granularity
Accessibility
Relevance
Comparability
Completeness
Timeliness
Moved to prioritization framework
Weighting Approach
Proof of concept demo (Jupyter Notebook)
11
Example data-driven decision making in Open Data: Dataset Migration
12
3
Migrating datasets in mixed formats, from multiple sources and stages of readiness, using different collection methods presents a significant challenge
13
292+
Datasets
1300+
Individual
Files
15+
File Formats
5+
Publishing
Methods
CURRENT CATALOGUE
SAMPLE ISSUES
― Unable to visualize data in new portal due to dataset shape
― Data is not current to the specified refresh rate
― Dataset contains missing values or columns
― No visibility into data refresh status
― Cannot automate in pipeline due to file format (e.g. PDF)
― Data extract does not reflect source, making lineage difficult
To address these issues, the migration approach is to score datasets according to 6 criteria, group them in batches, and migrate them from easiest to hardest
14
SOURCE CONNECTIVITY Ability to extract directly from source system |
MACHINE READABILITY Ability to extract in formats for analysis and manipulation |
USAGE Degree to which a dataset is downloaded |
FRESHNESS Currency of data to the established refresh rate |
GRANULARITY Closeness of data to atomic level of detail |
OWNERSHIP Level of proprietary or similar licensing requirements |
Batch 1
Migration�Schedule
Dataset
Prioritization
Migration
Criteria
Current
Catalogue
Batch 2
Batch 3
Dec
2018
Archive
Data Migration Scoring and Rationale
Dataset Migration Batches
Dataset by Owner and Batch
Tracking Data Migration (Internal and External audiences)
Open Floor