1 of 26

Darwin Core Hour

connecting biodiversity data communities

Data Management Interest Group

Webinar Series

2 of 26

Darwin Core Hour

Webinar Series

A Bite from The Core

Testing for Data Quality

Lee Belbin

&

Arthur Chapman

TDWG

DQIG

3 of 26

Previous Darwin Core Webinars

John Wieczorek “Even Simple is Hard” (Chapter 2) Paula Zermoglio “Controlled Vocabularies” (Chapter 3)

TDWG

DQIG

4 of 26

Previous Darwin Core Webinars

John Wieczorek “Where am I Exactly?” (Chapter 6)

TDWG

DQIG

5 of 26

Previous Darwin Core Webinars

Kyle Braak, Andrea Hahn and Deb Paul: “Aggregators – GBIF and iDigBio” (Chapter 7)

TDWG

DQIG

6 of 26

TDWG Data Quality Interest Group

Task Groups

    • TG1: Framework for Data Quality
    • TG2: Data Quality Tests and Assertions
    • TG3: Use Case Library
    • [Controlled Vocabularies]
    • [Data Fitness for Use in Research on Invasive Alien Species

Formally Established at TDWG2014 (Jönköping, Sweden)

TDWG

DQIG

7 of 26

TG1 - Framework on Data Quality

Your text here

Your text here

Your text here

Your text here

Veiga, A.K. et al., A conceptual framework for quality assessment and management of biodiversity data. PLOS ONE 12 (6): https://doi.org/10.1371/journal.pone.0178731

TDWG

DQIG

8 of 26

Data Quality : Fitness for Use

Example: verbatimEventDate = 03/06/1989

This is ambiguous:

USA -> 6 March 1989

Australia -> 3 June 1989

Use 1: Species Distribution Model:

Data OK

Use 2: Phenology

Data NOT OK

Improve: Check other records in dataset

Check other records by same collector

TDWG

DQIG

9 of 26

Task Group 2�Tests and Assertions

Abby Benson

Alex Thompson

Allan Koch Veiga

Anne-Sophie Archambeau

Arthur Chapman

Arturo H. Ariño

Bertram Ludaescher

Christian Gendreau

Dairo Escobar

Daniel Amariles

Daniel Lins

Danny Velez Velandia

(Dave Watts)

David Bloom

David Fichtmueller

Debbie Paul

Dimitri Brosens

Dmitry Schigel

Elspeth Haston

Gloeckler Falko

Hanna Koivula

James Macklin

Javier Otegui

John Wiezorek

Lee Belbin

Luiz Gadelha

Marie-Elise Lecoq

Matt Collins

Matthias Obst

Nelyda Beltran

Nicolas Noe

Paul J. Morris

Paula Zermoglio

Rui Figueira

Shelley James

Sophie Pamerlon

Vijay Barve �(9/37)

TDWG

DQIG

10 of 26

Basics

Originally tasked to review “Tools, services and workflows” but as these components of ‘data quality’ (fitness for use) depend on the results of ‘automated’ tests, it seemed wise to me to focus on them.

TDWG

DQIG

11 of 26

The Task

If we could come up with a suite of core tests that were implemented by data collector to data aggregator, we move one step closer to interoperability��or to put it in a practical form…��The tests run at the Atlas of Living Australia (the GBIF Australian Node) that are associated with records are ignored by GBIF as they run their own (different) tests.

TDWG

DQIG

12 of 26

Agencies Searched for �Tests-Assertions

ALA

BISON

CRIA

GBIF

iDigBio

OBIS

VertNet

TDWG

DQIG

13 of 26

An (easy) Example

Test 12a: The value given for DwC:day is less than 1 or greater than 31����Most are not quite that simple ☺

TDWG

DQIG

14 of 26

Principles�(derived in the process)

4. The tests and resulting assertions will be based on Darwin Core Terms

TDWG

DQIG

15 of 26

Principles�(derived in the process)

5. Darwin Core terms are either verbatim (can't validate but can interpret/use to improve other term values) or bound by vocabulary* or extents (therefore checkable) ... therefore each DwC term has valid test type!

*Thank you Paula!

TDWG

DQIG

16 of 26

Principles�(derived in the process)

6. Criteria for test inclusion�

    • Informative
    • Easy to implement
    • Mandatory for amendments
    • In use with decent % of hits

TDWG

DQIG

17 of 26

Principles�(derived in the process)

11. Missing (null) Darwin Core Terms will not normally create an assertion unless all name, all spatial or all temporal terms are missing values

TDWG

DQIG

18 of 26

Principles�(derived in the process)

15. We anticipate non-core domain-specific tests. ��For example OBIS may want to implement “minimum depth in meters is greater than indicated on GEBCO chart”.

TDWG

DQIG

19 of 26

Fields Describing the Tests

Term

Description

Test#

Local test number

GUID

Globally Unique Identifier

Original test ID

Test identifier from original source

Variable name

Name to be used in code

Description (test - FAIL)

Description if the test fails

Description (test - PASS)*

Description if test passes (TG1)

Specification

Technical description

Record Resolution

Single or multiple record dependency

Term Resolution

Single or multiple term dependency

Data Dependency

Internal (Darwin Core) occurrence records or external data

Warning Type

Nature of the issue

Example

At least one simple example

Darwin Core Class

The DwC Class focus for the test

Darwin Core Terms

The DwC Terms used for the test

DQ Dimension

Conformance, consistency, completeness, resolution etc (TG1)

Source

The source of the test (agency or individual)

References

Directly related references

TDWG

DQIG

20 of 26

Ambiguous e.g. 03/06/1989 (i.e. 3 June or 6 March?)

Amended e.g. Latitude made negative to conform with country

Incomplete e.g. Event date only year and month

Inconsistent e.g. Lat/long is inconsistent with country

Invalid e.g. Day=32

Unlikely e.g. Event Date = 01/01/1900

Warnings

TDWG

DQIG

21 of 26

#

157

GUID

6d0a0c10-5e4a-4759-b448-88932f399812

IDs

 

Variable

EVENTDATE_FROM_VERBATIM

Description (test - FAIL)

Event date (dwc:eventDate) was interpreted from other date fields (dwc:verbatimEventDate, dwc:year, dwc:month and dwc:day)

Specification

eventDate=interpret(all event date fields)

Record Resolution

SingleRecord

Term Resolution

MultiTerm

Data Dependency

Internal

Example

day=2,month=3,year=2013, therefore eventDate=2013-03-02

Darwin Core Class

Event

Darwin Core Terms

eventDate, verbatimEventDate, year, month, day

DQ Dimension

Completeness

Warning Type

Amended

Source

VertNet

References

 

Mechanisms

event_date_qc

Link to Specification Source Code

TDWG

DQIG

22 of 26

#

1

GUID

453844ae-9df4-439f-8e24-c52498eca84a

IDs

 

Variable

TESTS_FLAGGED_REPORT

Description

Number of tests that have been run against the record that were flagged

Specification (Technical Description)

e.g. Count the number of tests that resulted in "TRUE" (i.e., where a test has failed)

Record Resolution

SingleRecord

Term Resolution

MultiTerm

Data Dependency

Internal

Example

7 issues were flagged

Darwin Core Class

All

Darwin Core Terms

All

DQ Dimension

Reliability?

Warning Type

Report

Source

Lee Belbin

TDWG

DQIG

23 of 26

#

59

GUID

620749b9-7d9c-4890-97d2-be3d1cde6da8

IDs

45

Variable

DECIMAL_LAT_LONG_CONVERTED

Description (test - FAIL)

Decimal latitude and longitude and geodeticDatum were converted from another datum, with resulting implications for coordinate uncertainty and precision

Specification (Technical Description)

georeference was converted to a new datum

Record Resolution

SingleRecord

Term Resolution

MultiTerm

Data Dependency

External

Example

decimalLatitude=-23.712, decimalLongitude=139.923, geodetiDatum=GDA94 converted to decimalLatitude=23.712, decimalLongitude=139.923, geodeticDatum=WGS84(EPSG4326)

Darwin Core Class

Location

Darwin Core Terms

decimalLatitude, decimalLongitude, geodeticDatum

DQ Dimension

Conformance

Warning Type

Amended

Source

ALA, GBIF

TDWG

DQIG

24 of 26

#

147

GUID

f01fb3f9-2f7e-418b-9f51-adf50f202aea

IDs

 

Variable

SCIENTIFIC_NAME_ADDED

Description (test - FAIL)

The scientific name (dwc:scientificName) has been added by concatenating genus, specificEpithet, infraspecificEpithet and scientificNameAuthorship

Specification (Technical Description)

scientificName=genus+specificEpithet+scientificNameAuthorship

Record Resolution

SingleRecord

Term Resolution

MultiTerm

Data Dependency

Internal

Example

scientificName="Harpullia pendula F.Muell." from genus="Harpullia" + specificEpithet="pendula" + scientificNameAuthorship="F.Muell."

Darwin Core Class

Taxon

Darwin Core Terms

scientificName, genus, specificEpithet, infraspecificEpithet , scientificNameAuthorship

DQ Dimension

Completeness

Warning Type

Amended

Source

iDigBio

TDWG

DQIG

25 of 26

Next Steps?

  1. Currently 117 tests/assertions including some ad hoc additions so request sent to TG2 to score utility (by September 8)
  2. Need a test data suite that exercise all tests (by TDWG 2017)
  3. Need generic demonstration code for all tests (ideally by TDWG 2017)
  4. There is a commitment by GBIF, iDigBio and the ALA to implement these tests/assertions as soon as possible (dependent on 1-3). There is also a commitment by the same agencies to better align basic interfaces and data queries and exports.
  5. Development of the Darwin Core Controlled Vocabularies (TG4)
  6. TDWG 2017 all day meeting on ‘data quality’ to resolve any outstanding issues
  7. Submit the suite to TDWG as a standard (first quarter 2018)

TDWG

DQIG

26 of 26

The End

TDWG

DQIG