1 of 49

Linked Data breakout session

1

Jonathan Yu, Mark Hedley�CSIRO and UK Met Office

25 May 2016 | EarthCube netCDF-CF Workshop

2 of 49

Session agenda

  1. Context + Purpose
  2. What is Linked Data? (aka. web of data)
  3. How is it relevant for netCDF and CF conventions?�Benefits?
  4. Work done to date
  5. Discuss its value for this community (and broader) and draft use cases
  6. Draft a plan for activities to engineer prototype(s), test and validate against use cases

2 |

3 of 49

We’re not data poor

“90% of the world’s data has been produced over the last two years”

Problem�Users - find the right data, access, use it, (cite it?)

Data providers – collect data, describe, publish, (update)

3 |

4 of 49

Problem

Data

XML

CSV

JSON

netCDF

HDF

THREDDS

Data implicit in

webpages

Common formats

not well handled�by machines across�formats and sources

Discovering, accessing, parsing data held in databases and via APIs

5 of 49

netCDF conventions – level of agreement

CF

Individual

Scientist/Researcher

Agencies

Working group/

Committee/CoP

Project

Teams

Company

International/

Global bodies

ACDD

Argo

eReefs

Streamflow forecast

Internal �but shared

Local

Private

Community wide

(intl’)

Cross organisation

OceanSites

SeaDataNet

RAF

My.C.

6 of 49

Challenges with conventions

Keeping up to date/Updating them/Need something now

Suitability – which one?

Validation – have I done the right thing?

Compatibility between versions

Cost/Benefits of adopting conventions – why should I?

Tooling – help me adopt the convention? Make data useful...

6 |

7 of 49

netCDF not alone in these challenges�

7 |

Data

netCDF

HDF

json-ld

geojson

json-rpc

seadas

csv-au-geo

Numerous bespoke

csv…

Too many

CF

ACDD

OceanSites

RAF

Temperature

Project/initiative

conventions

Numerous bespoke

json

8 of 49

Linked Data / Web of Data

Method to connect related data and semantics using web links (HTTP URIs)

Data is self-describing

Standardised – HTTP + RDF

Applications can then�lookup embedded web links�to get more info, find more �connections, and infer �new insights from the data

8 |

9 of 49

Linked Data principles

  1. Use URIs as names for things.

  • Use HTTP URIs, so that people can look up those names.

  • When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).

  • Include links to other URIs, so that they can discover more things.

9 |

http://linkeddata.org/

http://linkeddatabook.com

10 of 49

Linked Open Data Cloud

10 |

570 Datasets

295 Datasets

95 Datasets

11 of 49

Science/Domain vocabularies as Linked Data

Vocabulary services | Cox & Yu

11 |

12 of 49

13 of 49

JSON-LD

{

"name": "John Lennon",

"born": "1940-10-09",

"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"

}

13 |

JSON-LD

Decorators

14 of 49

CSV-on-the-web

Linked data for CSV tabular data

Add context to tables via metadata file

Use cases: documentation, validation, transformation (e.g. RDF, JSON, XML…), annotate semantics, enhanced discovery

Towards linked data conventions for delivery of environmental data using netCDF | Jonathan Yu

14 |

15 of 49

How is all this Linked Data stuff relevant for netCDF and CF conventions?�

Have all the building blocks to enhance netCDF(-CF)!

15 |

CSV-on-the-web

Vocabularies

as Linked Data

Linked Data

General approach

Principles

Tools

Patterns for linkifying

common formats

Content for annotating�metadata

Tools to create, manage,�publish vocabs

linkify’ netCDF!

16 of 49

5 stars of Linked Open Data

16 |

netCDF(-CF) online

netCDF(-CF) online�+ additional context - links to standardised vocab URLs e.g. NERC vocabs, QUDT, Observable Properties vocabs, CF standard name URLs online, DBPedia URLs

make your stuff available on the web (whatever format)

★★

make it available as structured data (e.g. excel instead of image scan of a table)

★★★

non-proprietary format (e.g. csv instead of excel)

★★★★

use URLs to identify things, so that people can point at your stuff

★★★★★

link your data to other people’s data to provide context

17 of 49

Benefits of linkifying netCDF(-CF)

1. Improve discoverability and reduce ambiguity

  • link to vocabularies to add context
  • easier to support community profiles and validation

2. Improve data integration �

3. Potential to translate netCDF to other formats

4. Potentially ease metadata generation

5. Easier to build applications

17 |

Increased discoverability -> More usage -> Greater impact

18 of 49

Current examples / thought exercises

  1. Injecting vocabulary URIs in netCDF headers using special attr�
  2. eReefs/Observable property model convention
  3. SeaDataNet
  4. ‘Smuggling’ semantics into flags

  • netCDF-LD

  • Binary-array-LD (BALD)

18 |

19 of 49

#1: Injecting vocabulary URIs in netCDF headers using special attributes

Towards linked data conventions for delivery of environmental data using netCDF | Jonathan Yu

19

THREDDS

THREDDS Catalog

Domain Vocabs

(Water Quality at environment.data.gov.au)

Quantities/ Units ontology

(QUDT)

20 of 49

Allows for harmonised access to those binding to conventions

20

Enviro Application #1

Enviro Application #2

Enviro Application #3

Data

Data

Data

DB

DB

DB

Yu, J., Simons, B. A., Car, N. J., & Cox, S. J. (2014). Enhancing water quality data service discovery and access using standard vocabularies. Hydroinformatics conference. New York.

21 of 49

Others examples of injecting vocab URIs

1. SeaDataNet CF Profile

  • Specifies minimum info content as attributes: sdn_parameter_urn , sdn_parameter_name, sdn_uom_urn, sdn_uom_name
  • Binding to NERC P01 (parameters), P06 (units) vocabulary collections

2. netCDF-U - uncertainty URIs (Bigagli & Nativi 2013)

  • Use of “ref” attribute for uncertainty concept URI, further references using ancillary_variables

3. Use of flags to encode URIs for categorical data

  • Use of flag_namespace attribute to give vocab URI prefix to values in flag_meanings

21

It’s already happening out in the community! �Should we co-ordinate how we do this?

22 of 49

#2: netCDF-LD

22 |

‘Context’ boilerplates

… Apply JSON-LD pattern

Note: Not yet tested - more a thought experiment...

23 of 49

netCDF-LD: Linkifying netCDF

23

Air Temperature Definition

Air

Temperature

Medium

Quantity Kind

Kelvin

UoM

Ref vocabularies

Note: Not yet tested - more a thought experiment...

24 of 49

netCDF-LD: Global Attributes

24

Note: Not yet tested - more a thought experiment...

25 of 49

Assigning URIs to variable level attributes

z:units = "meters";

z:units_ref = "http://qudt.org/vocab/unit#Meter";

z:a = "http://environment.data.gov.au/def/op#quantityKind";

z:dcPartOf = "http://foo.bar/linked_netCDF_example";

z:valid_range = 0., 5000.;

25 |

Note: Not yet tested - more a thought experiment...

26 of 49

netCDF-LD to RDF

@prefix unit: <http://qudt.org/vocab/unit#> .

@prefix qudt: <http://qudt.org/1.1/schema/qudt#> .

@prefix op: <http://environment.data.gov.au/def/op#> .

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix dcterms: <http://purl.org/dc/terms/> .

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

_:z qudt:unit unit:Meter;

a op:ScaledQuantityKind;

dcterms:isPartOf <http://foo.bar/linked_netCDF_example>.

26 |

Note: Not yet tested - more a thought experiment...

27 of 49

Use of netCDF-LD to support Data Discovery

27

Yu, J., Car, N. J., Leadbetter, A., Simons, B. A., & Cox, S. J. (2015). Towards linked data conventions for delivery of environmental data using netCDF. In Environmental Software Systems. Infrastructures, Services and Applications (pp. 102-112). Springer International Publishing. https://dx.doi.org/10.1007/978-3-319-15994-2_9

28 of 49

#3: BALD (Binary Array LD)

28 |

29 of 49

#3: BALD (Binary Array LD)

#2 prefix identification

#3 prefix container

validation - do URIs resolve? - are array references consistent?

Aim: create an RDF graph of the metadata within a file (collection of files)

Identifying a file is an interesting question: OpenDAP presents an interesting angle on this

29 |

30 of 49

Discussion: Value for this community (and broader) and draft use cases�

Does the community see value in a Linked Data profile/convention for netCDF? Part of netCDF-CF?

  • General consensus
  • Acceptable to reference external resources?

What are the use cases?

Spend some time drafting use cases

30 |

31 of 49

Draft use cases

Discovery

Use

  • Machine readable content
  • DOI for a dataset - create links for a URI

Encoding

  • help data providers reference external sources
  • reference features (geoms, stations, platform, instrument, sensor)

Compliance checking

  • help data providers check conventions bound - e.g. practice of 1 or more conventions

31 |

32 of 49

cf__standard_name = cf__air_temp

ereefs__quantity = wq__

Options:

cf namespace default mixed with other conventions

standard_name = “xxx”

acdd__

32 |

33 of 49

Challenges

  • Wary of introducing XML-ism into netCDF
    • Perhaps have default namespaces for each convention
    • People like netCDF because there’s no namespace
    • not as elegant
    • alternatives for specifying LD - using ‘@’ to prefix incl. standard_name?

Governance of prefix namespace

  • falls under unidata?
  • governance of other namespace

Persistence of URIs -

  • injecting fragility
  • already exists - references to convention documentation
  • doi?

33 |

34 of 49

Principles

  • Doesn’t break classic CF - Backwards compatible
  • Prefer elegance of classic CF
  • Forward looking approach

Benefits

  • able to pull in content from external sources e.g. labels, features/geometries?

34 |

35 of 49

Draft a plan for activities to engineer prototype(s), test and validate against use cases�

  1. What would we need to make this work? Examples, qualified use cases from existing projects/data, endorsement?
  2. principles (see prev slide)
  3. project use cases
  4. endorsement - CF/ACDD/CMIP (conventions level) or netCDF (at an API level)?

2. What would an activity look like?

  • github
  • test BALD software on github

3. How do we organise it? Next steps and timeframes

  • 6 months, propose monthly telecon in this period
  • github

35 |

36 of 49

Participation

  • contribute use cases, test cases from projects
    • e.g. features, grid specs, ship track
    • netcdf groups?
  • monthly telecons
  • github code and issue tracker

36 |

37 of 49

Thank you

37 |

CSIRO Land and Water

Jonathan Yu

Research Software Engineer

t +61 3 9252 6440

e jonathan.yu@csiro.au

UK Met Office

Mark Hedley

[insert title]

t [phone?]

e mark.hedley@metoffice.gov.uk

38 of 49

Linked Data principles

  • Use URIs as names for things.

  • Use HTTP URIs, so that people can look up those names.

  • When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).

  • Include links to other URIs, so that they can discover more things.

38 |

http://linkeddata.org/

http://linkeddatabook.com

39 of 49

5 stars of Linked Open Data

39 |

netCDF(-CF) online

netCDF(-CF) online�+ additional context - links to standardised vocab URLs e.g. NERC vocabs, QUDT, Observable Properties vocabs, CF standard name URLs online, DBPedia URLs

make your stuff available on the web (whatever format)

★★

make it available as structured data (e.g. excel instead of image scan of a table)

★★★

non-proprietary format (e.g. csv instead of excel)

★★★★

use URLs to identify things, so that people can point at your stuff

★★★★★

link your data to other people’s data to provide context

40 of 49

40

http://www.fireflyim.com/docs/smart_enterprise_data.pdf

41 of 49

Break up components in standard_name into multiple attributes – ref (Yu et al. 2014)

float Nap_MIM(time, latitude, longitude) ;

   Nap_MIM:_FillValue = -999.f ;

   Nap_MIM:long_name = "TSS, MIM SVDC on Rrs" ;

   Nap_MIM:units = "mg/L" ;

   Nap_MIM:valid_min = 0.01209607f ;

   Nap_MIM:valid_max = 226.9626f ;

   Nap_MIM:scaledQuantityKind_id � = "http://environment.data.gov.au/water/quality/def/property/solids-total_suspended" ;

   Nap_MIM:unit_id = "http://environment.data.gov.au/water/quality/def/unit/MilliGramsPerLitre" ;

   Nap_MIM:substanceOrTaxon_id = "http://environment.data.gov.au/water/quality/def/object/solids";

   Nap_MIM:medium_id = "http://environment.data.gov.au/water/quality/def/object/ocean"

Nap_MIM:procedure_id = "http://data.ereefs.org.au/ocean-colour/MIM_SVDC_RRS" ;

41 |

42 of 49

JSON-LD

{

"name": "John Lennon",

"born": "1940-10-09",

"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"

}

42 |

JSON-LD

Decorators

43 of 49

JSON-LD and Semantic Web

43 |

http://dbpedia.org/resource/John_Lennon

http://dbpedia.org/resource/Cynthia_Lennon

“John Lennon”

1940-10-09

44 of 49

CSV-on-the-web

Linked data for CSV tabular data

Add context to tables via metadata file

Use cases: documentation, validation, transformation (e.g. RDF, JSON, XML…), annotate semantics, enhanced discovery

Towards linked data conventions for delivery of environmental data using netCDF | Jonathan Yu

44 |

45 of 49

DBL Harvesting and End Use

Towards linked data conventions for delivery of environmental data using netCDF | Jonathan Yu

45

Data Brokering Layer

THREDDS

THREDDS Catalog

Domain Vocabs

(Water Quality at environment.data.gov.au)

Quantities/ Units ontology

(QUDT)

End users

Client application

chlorophyll

46 of 49

eReefs visualisation portal

46

47 of 49

47

48 of 49

Summary of approach #1: injecting vocab URIs

Various communities are developing approaches to add context and semantics to complement netCDF-CF.

�Clearly, there are use cases for adding more semantics to current netCDF-CF metadata specifications.

�Approaches are currently fragmented.

�Would benefit from agreement and common approaches to profile.

48

49 of 49

Deep (or invisible) web

400-500x more public information than the Surface Web

1000-2000x greater quality than Surface Web

95% Deep Web is publicly accessible

Deep Web tend to be narrower, with deeper content

49 |

netCDF (scientific) data part �of this Deep Web?