1 of 11

Why and How of Data and Algorithm Standards

Craig Dsouza, WELL Labs, 8th July, 2024

WATER • ENVIRONMENT • LAND • LIVELIHOODS

2 of 11

Why Data and Algorithm Standards

  • To increase trust & reduce friction in data and algorithm sharing and hence accelerate development of better data and algorithms.
  • Existing domain specific standards, especially NRM, either don’t exist or they aren’t widely adopted.
  • We plan to explore existing standards for our current datasets & algorithms.

3 of 11

Data Standards

Algorithm Standards

4 of 11

How do we compare performance of Data Standards

To increase trust & reduce friction in data and algorithm sharing and hence accelerate development of better data and algorithms.

reduce friction

  • time taken for dataset/model integration with existing open source tools
  • time taken for end user to create a new dummy datapoint
  • time taken for end user to run the model + end user makes first minor ‘fix’

accelerate development

  • No of collaborators on datasets and algorithms vs time
  • No of additions to dataset by 3rd largest collaborator
  • Increase in model performance vs time

5 of 11

An Example

6 of 11

The seasonal LULC product

WATER • ENVIRONMENT • LAND • LIVELIHOODS

Maps

Land Use Land Cover (LULC)

7 of 11

SUPERVISED

SUPERVISED

UNSUPERVISED

WATER • ENVIRONMENT • LAND • LIVELIHOODS

Models

Land Use Land Cover (LULC)

The seasonal LULC product

8 of 11

Sharing the Seasonal LULC product (Dataset)

An example of how we can share the LULC dataset using existing open geospatial data standards (STAC spec)

Data Provider

  • Upload data to Cloud storage
  • Create a STAC Catalog > catalog.json file
    • An example with pystac

9 of 11

Accessing the Seasonal LULC product (Dataset)

An example of how we can access the LULC dataset using existing open geospatial data standards

Data User

  • Accesses latest version of data via API or STAC browser
  • No need to download all the data, can query new data as needed and see changes

10 of 11

Join this afternoon’s group activity to discuss publishing your datasets with open standards

Thank You!

11 of 11

Group Activity

Broad questions

How to leverage data hosting efforts that are underway, E.g. Source Coop (for Big data)

Build on top of metadata standards already in use, E.g. Open Imagery Network, ARIES for SEAA

Explore their extensibility to the kind of data we will be dealing with: Remote sensed, secondary, primary

Think through algorithm standards and to track data flow chains which more or less seem missing in other efforts

Build processes for agreement on domain specific standards for various data products that we are producing, including primary data collection standards

Task

Take some datasets and algorithms as examples and publish them with existing specs.