Linked Data breakout session
1
Jonathan Yu, Mark Hedley�CSIRO and UK Met Office
25 May 2016 | EarthCube netCDF-CF Workshop
Session agenda
2 |
We’re not data poor
“90% of the world’s data has been produced over the last two years”
Problem�Users - find the right data, access, use it, (cite it?)
Data providers – collect data, describe, publish, (update)
3 |
Problem
Data
XML
CSV
JSON
netCDF
HDF
THREDDS
Data implicit in
webpages
Common formats
not well handled�by machines across�formats and sources
Discovering, accessing, parsing data held in databases and via APIs
netCDF conventions – level of agreement
CF
Individual
Scientist/Researcher
Agencies
Working group/
Committee/CoP
Project
Teams
Company
International/
Global bodies
ACDD
Argo
eReefs
Streamflow forecast
Internal �but shared
Local
Private
Community wide
(intl’)
Cross organisation
OceanSites
SeaDataNet
RAF
My.C.
Challenges with conventions
Keeping up to date/Updating them/Need something now
Suitability – which one?
Validation – have I done the right thing?
Compatibility between versions
Cost/Benefits of adopting conventions – why should I?
Tooling – help me adopt the convention? Make data useful...
6 |
netCDF not alone in these challenges�
7 |
Data
netCDF
HDF
json-ld
geojson
json-rpc
seadas
csv-au-geo
Numerous bespoke
csv…
Too many
CF
ACDD
OceanSites
RAF
Temperature
Project/initiative
conventions
Numerous bespoke
json
Linked Data / Web of Data
Method to connect related data and semantics using web links (HTTP URIs)
Data is self-describing
Standardised – HTTP + RDF
Applications can then�lookup embedded web links�to get more info, find more �connections, and infer �new insights from the data
8 |
Linked Data principles
9 |
http://linkeddata.org/
http://linkeddatabook.com
Linked Open Data Cloud
10 |
570 Datasets
295 Datasets
95 Datasets
Science/Domain vocabularies as Linked Data
Vocabulary services | Cox & Yu
11 |
JSON-LD
{
"name": "John Lennon",
"born": "1940-10-09",
"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}
13 |
JSON-LD
Decorators
CSV-on-the-web
Linked data for CSV tabular data
Add context to tables via metadata file
Use cases: documentation, validation, transformation (e.g. RDF, JSON, XML…), annotate semantics, enhanced discovery
Towards linked data conventions for delivery of environmental data using netCDF | Jonathan Yu
14 |
How is all this Linked Data stuff relevant for netCDF and CF conventions?�
Have all the building blocks to enhance netCDF(-CF)!
15 |
CSV-on-the-web
Vocabularies
as Linked Data
Linked Data
General approach
Principles
Tools
Patterns for linkifying
common formats
Content for annotating�metadata
Tools to create, manage,�publish vocabs
‘linkify’ netCDF!
5 stars of Linked Open Data
16 |
netCDF(-CF) online
netCDF(-CF) online�+ additional context - links to standardised vocab URLs e.g. NERC vocabs, QUDT, Observable Properties vocabs, CF standard name URLs online, DBPedia URLs
★ | make your stuff available on the web (whatever format) |
★★ | make it available as structured data (e.g. excel instead of image scan of a table) |
★★★ | non-proprietary format (e.g. csv instead of excel) |
★★★★ | use URLs to identify things, so that people can point at your stuff |
★★★★★ | link your data to other people’s data to provide context |
Benefits of linkifying netCDF(-CF)
1. Improve discoverability and reduce ambiguity
2. Improve data integration �
3. Potential to translate netCDF to other formats
4. Potentially ease metadata generation
5. Easier to build applications
17 |
Increased discoverability -> More usage -> Greater impact
Current examples / thought exercises
18 |
#1: Injecting vocabulary URIs in netCDF headers using special attributes
Towards linked data conventions for delivery of environmental data using netCDF | Jonathan Yu
19
THREDDS
THREDDS Catalog
Domain Vocabs
(Water Quality at environment.data.gov.au)
Quantities/ Units ontology
(QUDT)
substanceOrTaxon_id = �http://environment.data.gov.au
Allows for harmonised access to those binding to conventions
20
Enviro Application #1
Enviro Application #2
Enviro Application #3
Data
Data
Data
DB
DB
DB
substanceOrTaxon= http://environment.data.gov.au/def/object/chlorophyll
scaledQuantityKind = http://environment.data.gov.au/def/property/chlorophyll_concentration
Yu, J., Simons, B. A., Car, N. J., & Cox, S. J. (2014). Enhancing water quality data service discovery and access using standard vocabularies. Hydroinformatics conference. New York.
Others examples of injecting vocab URIs
2. netCDF-U - uncertainty URIs (Bigagli & Nativi 2013)
3. Use of flags to encode URIs for categorical data
21
It’s already happening out in the community! �Should we co-ordinate how we do this?
#2: netCDF-LD
22 |
‘Context’ boilerplates
… Apply JSON-LD pattern
Note: Not yet tested - more a thought experiment...
netCDF-LD: Linkifying netCDF
23
Air Temperature Definition
Air
Temperature
Medium
Quantity Kind
Kelvin
UoM
Ref vocabularies
Note: Not yet tested - more a thought experiment...
netCDF-LD: Global Attributes
24
Note: Not yet tested - more a thought experiment...
Assigning URIs to variable level attributes
z:units = "meters";
z:units_ref = "http://qudt.org/vocab/unit#Meter";
z:a = "http://environment.data.gov.au/def/op#quantityKind";
z:dcPartOf = "http://foo.bar/linked_netCDF_example";
z:valid_range = 0., 5000.;
25 |
Note: Not yet tested - more a thought experiment...
netCDF-LD to RDF
@prefix unit: <http://qudt.org/vocab/unit#> .
@prefix qudt: <http://qudt.org/1.1/schema/qudt#> .
@prefix op: <http://environment.data.gov.au/def/op#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
_:z qudt:unit unit:Meter;
a op:ScaledQuantityKind;
dcterms:isPartOf <http://foo.bar/linked_netCDF_example>.
26 |
Note: Not yet tested - more a thought experiment...
Use of netCDF-LD to support Data Discovery
27
Yu, J., Car, N. J., Leadbetter, A., Simons, B. A., & Cox, S. J. (2015). Towards linked data conventions for delivery of environmental data using netCDF. In Environmental Software Systems. Infrastructures, Services and Applications (pp. 102-112). Springer International Publishing. https://dx.doi.org/10.1007/978-3-319-15994-2_9
#3: BALD (Binary Array LD)
Linked Data Conventions for netCDF, HDF, …
https://github.com/binary-array-ld/bald
http://binary-array-ld.net/latest
http://binary-array-ld.net/latest?classView=true
https://github.com/binary-array-ld/bald/issues
28 |
#3: BALD (Binary Array LD)
#2 prefix identification
#3 prefix container
validation - do URIs resolve? - are array references consistent?
Aim: create an RDF graph of the metadata within a file (collection of files)
Identifying a file is an interesting question: OpenDAP presents an interesting angle on this
29 |
Discussion: Value for this community (and broader) and draft use cases�
Does the community see value in a Linked Data profile/convention for netCDF? Part of netCDF-CF?
What are the use cases?
Spend some time drafting use cases
30 |
Draft use cases
Discovery
Use
Encoding
Compliance checking
31 |
cf__standard_name = cf__air_temp
ereefs__quantity = wq__
Options:
cf namespace default mixed with other conventions
standard_name = “xxx”
acdd__
32 |
Challenges
Governance of prefix namespace
Persistence of URIs -
33 |
Principles
Benefits
34 |
Draft a plan for activities to engineer prototype(s), test and validate against use cases�
2. What would an activity look like?
3. How do we organise it? Next steps and timeframes
35 |
Participation
36 |
Thank you
37 |
CSIRO Land and Water
Jonathan Yu
Research Software Engineer
t +61 3 9252 6440
e jonathan.yu@csiro.au�
UK Met Office
Mark Hedley
[insert title]
t [phone?]
e mark.hedley@metoffice.gov.uk�
Linked Data principles
38 |
http://linkeddata.org/
http://linkeddatabook.com
5 stars of Linked Open Data
39 |
netCDF(-CF) online
netCDF(-CF) online�+ additional context - links to standardised vocab URLs e.g. NERC vocabs, QUDT, Observable Properties vocabs, CF standard name URLs online, DBPedia URLs
★ | make your stuff available on the web (whatever format) |
★★ | make it available as structured data (e.g. excel instead of image scan of a table) |
★★★ | non-proprietary format (e.g. csv instead of excel) |
★★★★ | use URLs to identify things, so that people can point at your stuff |
★★★★★ | link your data to other people’s data to provide context |
40
http://www.fireflyim.com/docs/smart_enterprise_data.pdf
Break up components in standard_name into multiple attributes – ref (Yu et al. 2014)
float Nap_MIM(time, latitude, longitude) ;
Nap_MIM:_FillValue = -999.f ;
Nap_MIM:long_name = "TSS, MIM SVDC on Rrs" ;
Nap_MIM:units = "mg/L" ;
Nap_MIM:valid_min = 0.01209607f ;
Nap_MIM:valid_max = 226.9626f ;
Nap_MIM:scaledQuantityKind_id � = "http://environment.data.gov.au/water/quality/def/property/solids-total_suspended" ;
Nap_MIM:unit_id = "http://environment.data.gov.au/water/quality/def/unit/MilliGramsPerLitre" ;
Nap_MIM:substanceOrTaxon_id = "http://environment.data.gov.au/water/quality/def/object/solids";
Nap_MIM:medium_id = "http://environment.data.gov.au/water/quality/def/object/ocean"
Nap_MIM:procedure_id = "http://data.ereefs.org.au/ocean-colour/MIM_SVDC_RRS" ;
41 |
JSON-LD
{
"name": "John Lennon",
"born": "1940-10-09",
"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}
42 |
JSON-LD
Decorators
JSON-LD and Semantic Web
43 |
http://dbpedia.org/resource/John_Lennon
http://dbpedia.org/resource/Cynthia_Lennon
“John Lennon”
1940-10-09
CSV-on-the-web
Linked data for CSV tabular data
Add context to tables via metadata file
Use cases: documentation, validation, transformation (e.g. RDF, JSON, XML…), annotate semantics, enhanced discovery
Towards linked data conventions for delivery of environmental data using netCDF | Jonathan Yu
44 |
DBL Harvesting and End Use
Towards linked data conventions for delivery of environmental data using netCDF | Jonathan Yu
45
Data Brokering Layer
THREDDS
THREDDS Catalog
Domain Vocabs
(Water Quality at environment.data.gov.au)
Quantities/ Units ontology
(QUDT)
substanceOrTaxon= �http://environment.data.gov.au
End users
Client application
chlorophyll
eReefs visualisation portal
46
47
Summary of approach #1: injecting vocab URIs
Various communities are developing approaches to add context and semantics to complement netCDF-CF.
�Clearly, there are use cases for adding more semantics to current netCDF-CF metadata specifications.
�Approaches are currently fragmented.
�Would benefit from agreement and common approaches to profile.
48
Deep (or invisible) web
400-500x more public information than the Surface Web
1000-2000x greater quality than Surface Web
95% Deep Web is publicly accessible
Deep Web tend to be narrower, with deeper content
49 |
netCDF (scientific) data part �of this Deep Web?