Darwin Core Hour
connecting biodiversity data communities
Data Management Interest Group
Webinar Series
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Darwin Core Hour
Webinar Series
A Bite from The Core
Testing for Data Quality
Lee Belbin
&
Arthur Chapman
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Previous Darwin Core Webinars
John Wieczorek “Even Simple is Hard” (Chapter 2) Paula Zermoglio “Controlled Vocabularies” (Chapter 3)
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Previous Darwin Core Webinars
John Wieczorek “Where am I Exactly?” (Chapter 6)
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Previous Darwin Core Webinars
Kyle Braak, Andrea Hahn and Deb Paul: “Aggregators – GBIF and iDigBio” (Chapter 7)
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
TDWG Data Quality Interest Group
Git Hub:
Task Groups
Formally Established at TDWG2014 (Jönköping, Sweden)
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
TG1 - Framework on Data Quality
Your text here
Your text here
Your text here
Your text here
Veiga, A.K. et al., A conceptual framework for quality assessment and management of biodiversity data. PLOS ONE 12 (6): https://doi.org/10.1371/journal.pone.0178731
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Data Quality : Fitness for Use
Example: verbatimEventDate = 03/06/1989
This is ambiguous:
USA -> 6 March 1989
Australia -> 3 June 1989
Use 1: Species Distribution Model:
Data OK
Use 2: Phenology
Data NOT OK
Improve: Check other records in dataset
Check other records by same collector
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Task Group 2�Tests and Assertions
Abby Benson
Alex Thompson
Allan Koch Veiga
Anne-Sophie Archambeau
Arthur Chapman
Arturo H. Ariño
Bertram Ludaescher
Christian Gendreau
Dairo Escobar
Daniel Amariles
Daniel Lins
Danny Velez Velandia
(Dave Watts)
David Bloom
David Fichtmueller
Debbie Paul
Dimitri Brosens
Dmitry Schigel
Elspeth Haston
Gloeckler Falko
Hanna Koivula
James Macklin
Javier Otegui
John Wiezorek
Lee Belbin
Luiz Gadelha
Marie-Elise Lecoq
Matt Collins
Matthias Obst
Nelyda Beltran
Nicolas Noe
Paul J. Morris
Paula Zermoglio
Rui Figueira
Shelley James
Sophie Pamerlon
Vijay Barve �(9/37)
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Basics
Originally tasked to review “Tools, services and workflows” but as these components of ‘data quality’ (fitness for use) depend on the results of ‘automated’ tests, it seemed wise to me to focus on them.
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
The Task
If we could come up with a suite of core tests that were implemented by data collector to data aggregator, we move one step closer to interoperability��or to put it in a practical form…��The tests run at the Atlas of Living Australia (the GBIF Australian Node) that are associated with records are ignored by GBIF as they run their own (different) tests.
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Agencies Searched for �Tests-Assertions
ALA
BISON
CRIA
GBIF
iDigBio
OBIS
VertNet
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
An (easy) Example
Test 12a: The value given for DwC:day is less than 1 or greater than 31����Most are not quite that simple ☺
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Principles�(derived in the process)
4. The tests and resulting assertions will be based on Darwin Core Terms
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Principles�(derived in the process)
5. Darwin Core terms are either verbatim (can't validate but can interpret/use to improve other term values) or bound by vocabulary* or extents (therefore checkable) ... therefore each DwC term has valid test type!
*Thank you Paula!
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Principles�(derived in the process)
6. Criteria for test inclusion�
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Principles�(derived in the process)
11. Missing (null) Darwin Core Terms will not normally create an assertion unless all name, all spatial or all temporal terms are missing values
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Principles�(derived in the process)
15. We anticipate non-core domain-specific tests. ��For example OBIS may want to implement “minimum depth in meters is greater than indicated on GEBCO chart”.
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Fields Describing the Tests
Term | Description |
Test# | Local test number |
GUID | Globally Unique Identifier |
Original test ID | Test identifier from original source |
Variable name | Name to be used in code |
Description (test - FAIL) | Description if the test fails |
Description (test - PASS)* | Description if test passes (TG1) |
Specification | Technical description |
Record Resolution | Single or multiple record dependency |
Term Resolution | Single or multiple term dependency |
Data Dependency | Internal (Darwin Core) occurrence records or external data |
Warning Type | Nature of the issue |
Example | At least one simple example |
Darwin Core Class | The DwC Class focus for the test |
Darwin Core Terms | The DwC Terms used for the test |
DQ Dimension | Conformance, consistency, completeness, resolution etc (TG1) |
Source | The source of the test (agency or individual) |
References | Directly related references |
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Ambiguous e.g. 03/06/1989 (i.e. 3 June or 6 March?)
Amended e.g. Latitude made negative to conform with country
Incomplete e.g. Event date only year and month
Inconsistent e.g. Lat/long is inconsistent with country
Invalid e.g. Day=32
Unlikely e.g. Event Date = 01/01/1900
Warnings
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
# | 157 |
GUID | 6d0a0c10-5e4a-4759-b448-88932f399812 |
IDs |
|
Variable | EVENTDATE_FROM_VERBATIM |
Description (test - FAIL) | Event date (dwc:eventDate) was interpreted from other date fields (dwc:verbatimEventDate, dwc:year, dwc:month and dwc:day) |
Specification | eventDate=interpret(all event date fields) |
Record Resolution | SingleRecord |
Term Resolution | MultiTerm |
Data Dependency | Internal |
Example | day=2,month=3,year=2013, therefore eventDate=2013-03-02 |
Darwin Core Class | Event |
Darwin Core Terms | eventDate, verbatimEventDate, year, month, day |
DQ Dimension | Completeness |
Warning Type | Amended |
Source | VertNet |
References |
|
Mechanisms | event_date_qc |
Link to Specification Source Code |
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
# | 1 |
GUID | 453844ae-9df4-439f-8e24-c52498eca84a |
IDs |
|
Variable | TESTS_FLAGGED_REPORT |
Description | Number of tests that have been run against the record that were flagged |
Specification (Technical Description) | e.g. Count the number of tests that resulted in "TRUE" (i.e., where a test has failed) |
Record Resolution | SingleRecord |
Term Resolution | MultiTerm |
Data Dependency | Internal |
Example | 7 issues were flagged |
Darwin Core Class | All |
Darwin Core Terms | All |
DQ Dimension | Reliability? |
Warning Type | Report |
Source | Lee Belbin |
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
# | 59 |
GUID | 620749b9-7d9c-4890-97d2-be3d1cde6da8 |
IDs | 45 |
Variable | DECIMAL_LAT_LONG_CONVERTED |
Description (test - FAIL) | Decimal latitude and longitude and geodeticDatum were converted from another datum, with resulting implications for coordinate uncertainty and precision |
Specification (Technical Description) | georeference was converted to a new datum |
Record Resolution | SingleRecord |
Term Resolution | MultiTerm |
Data Dependency | External |
Example | decimalLatitude=-23.712, decimalLongitude=139.923, geodetiDatum=GDA94 converted to decimalLatitude=23.712, decimalLongitude=139.923, geodeticDatum=WGS84(EPSG4326) |
Darwin Core Class | Location |
Darwin Core Terms | decimalLatitude, decimalLongitude, geodeticDatum |
DQ Dimension | Conformance |
Warning Type | Amended |
Source | ALA, GBIF |
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
# | 147 |
GUID | f01fb3f9-2f7e-418b-9f51-adf50f202aea |
IDs |
|
Variable | SCIENTIFIC_NAME_ADDED |
Description (test - FAIL) | The scientific name (dwc:scientificName) has been added by concatenating genus, specificEpithet, infraspecificEpithet and scientificNameAuthorship |
Specification (Technical Description) | scientificName=genus+specificEpithet+scientificNameAuthorship |
Record Resolution | SingleRecord |
Term Resolution | MultiTerm |
Data Dependency | Internal |
Example | scientificName="Harpullia pendula F.Muell." from genus="Harpullia" + specificEpithet="pendula" + scientificNameAuthorship="F.Muell." |
Darwin Core Class | Taxon |
Darwin Core Terms | scientificName, genus, specificEpithet, infraspecificEpithet , scientificNameAuthorship |
DQ Dimension | Completeness |
Warning Type | Amended |
Source | iDigBio |
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
Next Steps?
TDWG
DQIG
Webinar Survey: https://tinyurl.com/yagqg2mj Questions: https://tinyurl.com/zja2muz
TG2 Worksheet: http://bit.ly/2uY3FoA
GitHub: https://github.com/tdwg/bdq
The End
TDWG
DQIG