Developing Community standards-based Search Tools for Earth System Model Data using STAC
Richard Smith1, Philip Kershaw1, Ag Stephens2, Rhys Evans2, Aparna Radhakrishnan3, V. Balaji3, Ryan Abernathey4
Overview
1 What are we looking to solve?
A look at our shared problem
2 STAC Overview
What is STAC?
3 Proposed Solution
System including indexing framework, server and clients
Background
The CEDA Archive
Background
ESGF
International collaboration that develops, deploys and maintains software infrastructure for the management, dissemination, and analysis of model output and observational data.
Data catalog comprises several observation and modelled datasets.
Pangeo
Community promoting open, reproduceable, scalable science. Aims to coordinate scientists, software and computing to further research.
Looking for a cataloguing standard (familiar with Intake)
Converting popular data to Cloud Optimised Formats (Zarr) with Google Cloud and Amazon
What do we want?
Develop a search tool which allows users to perform faceted search and find the relevant data for their use-case, taking into account the heterogeneity of the data.
It needs to
What have we tried?
Intake Catalogs (PANGEO)
ESGF Search
File-based Search (CEDA)
Benefits of Shared Approach
Shared Problem
We think this problem is common among data providers with heterogeneous data.
We are actively looking for collaborators to work with us on this
STAC Overview
What is STAC?
“The SpatioTemporal Asset Catalog (STAC) specification provides a common language to describe a range of geospatial information, so it can more easily be indexed and discovered. A 'spatiotemporal asset' is any file that represents information about the earth captured in a certain space and time.
The goal is for all providers of spatiotemporal assets (Imagery, SAR, Point Clouds, Data Cubes, Full Motion Video, etc) to expose their data as SpatioTemporal Asset Catalogs (STAC), so that new code doesn't need to be written whenever a new data set or API is released.”
What is STAC?
What is STAC?
“Catalog - a simple, flexible JSON file of links that provides a structure to organize and browse STAC Items.”
“Collection - an extension of the STAC Catalog with additional information such as the extents, license, keywords, providers, etc. that describe STAC Items that fall within the Collection.”
“Item - the core atomic unit, representing a single spatiotemporal asset as a GeoJSON feature plus datetime and links.”
“Assets - an object that contains a URI to data associated with the Item that can be downloaded or streamed.”
What is STAC? - For us
Collection - Set of items with a common vocabulary/DRS (e.g. CMIP6)
Item - Meaningful group of 1+ files (e.g. ESGF atomic Dataset)
Asset - Single data file (e.g. one NetCDF file)
STAC – Static or Dynamic?
Static Catalogs – A set of interlinked JSON files which can be navigated hierarchically (navigation is set)
Dynamic Catalogs – Use the STAC API to provide enhanced navigation and search capability (can perform item search, faceted search, etc.)
STAC Ecosystem
Documentation
Active community
STAC Ecosystem
Some issues
Initially gained traction with Earth Observation community. Challenges with model data include:
Given the flexibility of STAC and community engagement, we feel STAC fulfils the majority of our requests and can be adapted to fill others.
Progress So Far
Overview
Indexing
Server
Clients
Indexing Framework – Asset Scanner
Indexing Framework
Plugin architecture can satisfy different use cases:
This section is fed by a YAML file to describe the workflow. Also makes use of pre/post-processors to condition the raw information.
STAC Server
STAC Extensions
Allow you to write content and feature specifications to extend the basic STAC specification.
Some things we want:
STAC Extensions
Two categories:
Content
API
STAC Extensions
So far we haven’t looked at content extensions.
The flexible properties attribute can hold rich metadata.
On the feature side:
Free-text Search
Powered by Elasticsearch query string
https://github.com/cedadev/stac-freetext-search
Free-text Search any “property”
Free-text Search with wildcard
Free-text Search with logical operators
Free-text Search for specific fields
Clients - Web
Why make our own?
What's different?
Start, end, orbit number.
https://stac.ceda.ac.uk/search
Clients - Python
�Fully exposes all the capabilities provided by the STAC API server.
Retrieve collection and items as their own models with respective functions for their model.
The response models are no different in structure than the JSON counterpart!
Where next?
Thank you!
JASMIN: support@jasmin.ac.uk
CEDA: support@ceda.ac.uk
Twitter - @cedanews
Website - www.ceda.ac.uk
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement N°824084