1 of 31

Developing Community standards-based Search Tools for Earth System Model Data using STAC

Richard Smith1, Philip Kershaw1, Ag Stephens2, Rhys Evans2, Aparna Radhakrishnan3, V. Balaji3, Ryan Abernathey4

2 of 31

Overview

1 What are we looking to solve?

A look at our shared problem

2 STAC Overview

What is STAC?

3 Proposed Solution

System including indexing framework, server and clients

3 of 31

Background

The CEDA Archive

  • Holds > 13 PB of atmospheric and earth observation data
  • Data from many different sources
    • 7300 datasets in the CEDA catalogue
    • 344+ million files
    • +200,000/day
    • Our data is stored on POSIX disk, Tape archive and Object Store

4 of 31

Background

ESGF

International collaboration that develops, deploys and maintains software infrastructure for the management, dissemination, and analysis of model output and observational data.

Data catalog comprises several observation and modelled datasets.

Pangeo

Community promoting open, reproduceable, scalable science. Aims to coordinate scientists, software and computing to further research.

Looking for a cataloguing standard (familiar with Intake)

Converting popular data to Cloud Optimised Formats (Zarr) with Google Cloud and Amazon

5 of 31

What do we want?

Develop a search tool which allows users to perform faceted search and find the relevant data for their use-case, taking into account the heterogeneity of the data.

It needs to

  • Allow low level search of all items (granules)
  • Work with different domains/vocabularies
  • Provide faceted search
  • Handle heterogeneous datasets
  • Be scalable
  • Also comprise an indexing framework to generate content

6 of 31

What have we tried?

Intake Catalogs (PANGEO)

    • Static
    • Scales to a certain extent
    • Familiarity in the community

ESGF Search

  • Proprietary API format
  • Complex publication process
  • Provides rich faceted search

File-based Search (CEDA)

  • Leverages Elasticsearch
  • Quick and scalable search
  • Powers light-touch directory browser

7 of 31

Benefits of Shared Approach

    • Community only needs to learn one API
    • Shared development effort
    • Expanding scope and use cases allows solutions to be found for common issues

8 of 31

Shared Problem

We think this problem is common among data providers with heterogeneous data.

We are actively looking for collaborators to work with us on this

9 of 31

STAC Overview

10 of 31

What is STAC?

“The SpatioTemporal Asset Catalog (STAC) specification provides a common language to describe a range of geospatial information, so it can more easily be indexed and discovered. A 'spatiotemporal asset' is any file that represents information about the earth captured in a certain space and time.

The goal is for all providers of spatiotemporal assets (Imagery, SAR, Point Clouds, Data Cubes, Full Motion Video, etc) to expose their data as SpatioTemporal Asset Catalogs (STAC), so that new code doesn't need to be written whenever a new data set or API is released.”

https://stacspec.org/

11 of 31

What is STAC?

  • Based on OGC API Features
  • JSON formatted response and data standard
  • Extension is encouraged
  • Requires temporal and spatial attributes
  • Available as either a static or dynamic API
  • Strong community engagement

12 of 31

What is STAC?

“Catalog - a simple, flexible JSON file of links that provides a structure to organize and browse STAC Items.”

“Collection - an extension of the STAC Catalog with additional information such as the extents, license, keywords, providers, etc. that describe STAC Items that fall within the Collection.”

“Item - the core atomic unit, representing a single spatiotemporal asset as a GeoJSON feature plus datetime and links.”

“Assets - an object that contains a URI to data associated with the Item that can be downloaded or streamed.”

13 of 31

What is STAC? - For us

Collection - Set of items with a common vocabulary/DRS (e.g. CMIP6)

Item - Meaningful group of 1+ files (e.g. ESGF atomic Dataset)

Asset - Single data file (e.g. one NetCDF file)

14 of 31

STAC – Static or Dynamic?

Static Catalogs – A set of interlinked JSON files which can be navigated hierarchically (navigation is set)

Dynamic Catalogs – Use the STAC API to provide enhanced navigation and search capability (can perform item search, faceted search, etc.)

15 of 31

STAC Ecosystem

Documentation

Active community

16 of 31

STAC Ecosystem

17 of 31

Some issues

    • "ST" (Spatial and Temporal) in STAC is mandatory� - CEDA holds Martian datasets
    • Dynamic facet reduction (based on search context) not yet in the spec

Initially gained traction with Earth Observation community. Challenges with model data include:

    • Rotated grids
    • Non-standard dates (paleo-climatology, non-standard calendars)

18 of 31

Given the flexibility of STAC and community engagement, we feel STAC fulfils the majority of our requests and can be adapted to fill others.

19 of 31

Progress So Far

20 of 31

Overview

Indexing

Server

Clients

21 of 31

Indexing Framework – Asset Scanner

22 of 31

Indexing Framework

Plugin architecture can satisfy different use cases:

This section is fed by a YAML file to describe the workflow. Also makes use of pre/post-processors to condition the raw information.

23 of 31

   

STAC Server

24 of 31

STAC Extensions

Allow you to write content and feature specifications to extend the basic STAC specification.

Some things we want:

  • Free-text search
  • Dynamic facet discovery

25 of 31

STAC Extensions

Two categories:

Content

    • Processing – (processing level and provenance)
    • Datacube – (variables & dimensions)

API

    • Filter
    • Context
    • Sort
    • Transaction

26 of 31

STAC Extensions

So far we haven’t looked at content extensions.

The flexible properties attribute can hold rich metadata.

On the feature side:

27 of 31

Free-text Search

https://github.com/cedadev/stac-freetext-search

Free-text Search any “property”

Free-text Search with wildcard

Free-text Search with logical operators

Free-text Search for specific fields

28 of 31

Clients - Web

Why make our own?

  • Existing tools are focussed on static catalogs
  • API supporting UIs are very map-centric (not relevant for global model datasets)
  • Want to be able to move quickly and try out new features (free-text search, faceted-search)
  • Loosely based on the STAC Browser

What's different?

Start, end, orbit number.

29 of 31

Clients - Python

Based upon stac.py,

�Fully exposes all the capabilities provided by the STAC API server.

Retrieve collection and items as their own models with respective functions for their model.

The response models are no different in structure than the JSON counterpart!

30 of 31

    • Improving faceted search �https://github.com/radiantearth/stac-api-spec/issues/182  
    • Improve indexing coverage of the CEDA Archive 
    • Build out python client, specifically with ESGF community in mind
    • Work with others to improve our approach

Where next?

31 of 31

Thank you!

JASMIN: support@jasmin.ac.uk

CEDA: support@ceda.ac.uk

Twitter - @cedanews

Website - www.ceda.ac.uk

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement N°824084