Pangeo Data Working Group - Meeting Notes

Meeting 12 - May 6th, 2020

Agenda:

Participants:

   Norman Barker

Apologies:

Actions

Meeting 11 - April 8, 2020

Agenda:

Participants:

   Norman Barker

Apologies:

Actions

Meeting 10 - March 11, 2020

Agenda:

  1. Survey charts
  2. Cloud formats
  3. Available tooling

Participants:

   Norman Barker, Justin Minsk, Dave Bianco

Apologies:

Actions

  1. Justin to create charts, if not possible due to time constraints then we will delegate
  2. Recreate NASA benchmarks with Pangeo
  1. Highlight r/w+ issues with Zarr and others (Norman)

Notes

Discussed GDAL relevance, Zarr still doesn’t have support for GDAL but people are still using GDAL for regular work. Is there a gap in tooling?

Time series and columnar query support is missing with COGs, e.g. single pixel across multiple TIFFs.

Even with Zarr there are some IO (both r and w) gotches at scale, works fine with read, how do we go from grib, hdf to Zarr. Appends with Zarr is difficult.

Meeting 9 - February 26, 2020

Agenda:

  1. Survey charts
  2. STAC
  3. Cloud formats

Participants:

Norman Barker, Aimee Barciauskas, Matt Hanson

Apologies:

Actions

1. Review Cloud Data Formats document in more detail in the next meeting

2. Review survey charts

3. Discuss available tooling within the pangeo community

Notes

Meeting 9 - February 12, 2020

Agenda:

  1. ZARR, STAC, TileDB …
  2. Survey
  3. GDAL

Participants:

Justin Minsk, Norman Barker, Dave Bianco, Aimee Barciauskas

Apologies:

Actions

Notes

Survey is now closed. Justin has taken the action to graph the results and we will discuss further in the next meeting how to blog.

Zarr - lots of discussion around the use of Zarr, seems great for cloud reads, appending in parallel from dask/xarray in the cloud has been a little problematic.

Meeting 8 - January 29, 2020

Agenda:

Participants:

Justin Minsk, Aimee Barciauskas, Dave Bianco

Actions

Notes

Meeting 7 - January 15, 2020

Agenda:

  1. Survey status
  2. Other
  1. TileDB and GDAL

Participants:

Justin Minsk, Norman Barker, Aimee Barciauskas, Dave Bianco

Apologies

Matt Hanson

Actions

Re-advertise survey;

Notes

Meeting 6 - December 18, 2019

Agenda:

  1. Review actions 12/4/2019
  2. Survey status
  3. Other

Participants:

Cancelled

Apologies

Norman Barker

Actions

Notes

Meeting 5 - December 4, 2019

Agenda:

  1. Review actions 11/20/2019
  2. Review survey
  3. Review subject header / tweet for survey

Participants:

Justin Minsk

Norman Barker

Dave Bianco

Kevin Paul

Matt Hanson

Actions

  1. [] Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
  2. Draft blog post about why we are doing a survey (Norman)
  3. Publish survey 12/8

Notes

Carry over action to make STAC collections from existing datasets to the next meeting.

Reviewing survey.

Use “Help shape Pangeo’s catalog and data format trajectory!” as our survey heading.

When we should publish - consensus early week of 12/8

Justin will contact Ryan to get survey link tweeted!

Meeting 4 - November 20, 2019

Agenda:

  1. Review actions 10/23/2019
  2. STAC sprint report

Participants:

Matthew Hanson

Justin Minsk

Dave Bianco @talldave

Rich Signell (USGS) @rsignell-usgs on github

Aimee Barciauskas

Norman Barker

Scott Henderson

Apologies:

Review Actions from last meeting

  1. [] Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
  2. [] Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.
  1. TileDB - NB
  2. COGs - MH, AB (may slip timeframe)
  3. Zarr - AB

Notes: Keep action 1 (STAC) open and revisit.

Survey questions

  1. What is your primary role? e.g. scientist, engineer, manager
  2. Which public datasets do you use in your work?
  3. What are the challenges you face when using these datasets, e.g. access, discovery, reliability
  4. Do you wish to use cloud native datasets and perform cloud based computing?
  5. Which data formats do you use or want to use?
  6. What catalogs, do you use (if any), any attempts to create your own catalogs?
  7. How are you doing data transfer, i.e. between multiple offices
  8. Where is your data stored?
  9. Are you doing any data versioning / provenance marking, how do you handle this, how do you link back, which tools?
  10. What software do you use for data processing, is speed important?
  11. How much of your time is currently spent pre-processing data as opposed to processing?
  12. How important is compression to you, is lossy ok and if so at what threshold?
  13. Does your data benefit from a sparse representation?
  14. Is there anything else you would like to tell us?

New Actions:

  1. Justin will create google form with these survey questions
  1. (Recreating on Personal Gmail, seems like my work blocks anyone outside of the company from editing)
  2. New share link for editing: https://docs.google.com/forms/d/1CwfsKberK1x6IGMcJ7uicX0chjGbf4l6BQKdqY9q158/edit?usp=sharing 

Future discussion topic (revisit in a month):

  1. Better understand crosswalk between different catalog implementations, e.g. data.gov -> STAC, be aware of ISO/CSW activities

Notes:

CSW/Thredds and STAC? How to do this harmonization? STAC CMR proxy service.

Look at Parquet as well in addition to zarr, tiledb, hdf and netcdf and be able to explain why.

Survey notes:

Anonymous collection

What is your role, are you familiar with pangeo tools

Tweet from pangeo and link to google form - Ryan

Also send out to pangeo mailing list - Ryan

https://discourse.pangeo.io/ forum and Gitter channel

Create sentence as to why this survey is useful? Create intro in google form.

Launch before Dec 8th, provisionally aim for sending out on the 5th Dec.

Please put ideas below;

“Help shape Pangeo’s catalog and data format trajectory!”

Meeting 3 - October 23, 2019

Agenda:

  1. Review actions 10/9/2019
  2. Poll group for data survey questions
  3. AOB

Participants:

Norman Barker
Kevin Donkers

Justin Minsk

Aimee Barciauskas

Apologies:

None

Review actions

  1. [] Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
  2. [x] Think of questions for data survey (use cases and specific datasets), add them to these notes and review in the next meeting
  3. [] Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.
  1. TileDB - NB
  2. COGs - MH, AB (may slip timeframe)
  3. Zarr - AB

 revisit action (3) in mid-november

Actions carried over

  1. Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
  2. Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.
  1. TileDB - NB
  2. COGs - MH, AB (may slip timeframe)
  3. Zarr - AB

 revisit action above in mid-november

New Actions

Notes:

Icesat-2: Been in discussions with UW, been thinking of which format is best and how to surface metadata. ATL6 is a point cloud format, needs to specify a bounding box and time range, filtering is done by point. There is no real granule level metadata. Currently stored in HDF. In terms of STAC, this is a collection with an asset at the top level.

Earth System Model (ESM) metadata collection specification: https://github.com/NCAR/esm-collection-spec

ESM looks like STAC but isn’t.  Good opportunity to transfer this to STAC. Will raise this as the STAC sprint

Pangeo/STAC hackathon, 2020?

Data survey:

Target audience

The audience is primarily scientists, and includes agencies such as Met Office and NOAA.

We will reach out to;

  1. Everyone who is part of Pangeo as a start
  2. CSIRO climate forecasters (https://research.csiro.au/dfp/ | Thomas Moore + Dougie Squire)
  3. CEDA / RAL - Matt Pryor (https://github.com/mkjpryor-stfc | https://people.ncas.ac.uk/people/view/370)
  4. GeoScience Australia (OpenDataCube - Digital Earth Africa / Australia)

Questions;

  1. What datasets are they using
  2. What are the challenges, (access, discovery, cloud native, do use cloud based computing)
  3. Formats (less important than the above for now)
  4. What catalogs, do you use it, any attempts
  5. How are you doing data transfer, multiple offices, cloud
  6. Where is your data
  7. Data versioning / provenance, how do you handle this, how do you link back, which tools
  8. How are you processing, is speed important
  9. Compression, is lossy ok and if so at what threshold

Tools for survey;

Survey Monkey, Google Forms (Justin Minsk is willing to create a Google Form after questions are decided on)

Report;

Blog post (report style), let the data be available for others to use.

Meeting 3 - October 9, 2019

Agenda:

  1. Review actions 9/11/2019
  2. ICESAT 2
  3. Poll group for data survey questions
  4. EPIC / NOAA - Industry day
  5. AOB

Participants:

Norman Barker, Aimee Barciauskas, Matthew Hanson, Luke Madaus, Jeff Sadler, Charles Blackmon-Luca

Apologies:

None

Review actions

  1. [X] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
  2. [X] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)
  3. [X] Fix Zoom link for future meetings - Daniel Rothenberg

Actions carried over

None

MH - STAC proxy to NASA CMR (common metadata repository). ICESAT 2 is one of the data, interest in making this cloud “friendly”, currently most products are in HDF5. Add this new dataset back as a new collection into CMR (test not prod). Will be using ATL6 (non-gridded product). IceSat-2 Products: https://icesat-2.gsfc.nasa.gov/science/data-products

Charles - ESM Collection Specs in progress, would like to merge with STAC extension development in time

MH - document benefits of different cloud formats, COGs, ZARR or TileDB - filtering is a good example of a typical action. How does a collection of assets fits into STAC?

Provide IceSat2 as multiple different formats so we can analyze and gauge performance.

Define patterns for laying out variables for a dataset in COG, ZARR or TileDB or another format.

STAC sprint - https://docs.google.com/forms/d/e/1FAIpQLSeGbt8sji0H9Kd0RMS1AMlk7KmWwIfRu-P3G3pehxPTGZh0rw/viewform

EPIC - https://www.fbo.gov/index?s=opportunity&mode=form&id=2de56675c6e0556d3e08d9e01b691bce&tab=core&_cview=1

Tentatively

New Actions

  1. Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
  2. Think of questions for data survey (use cases and specific datasets), add them to these notes and review in the next meeting
  3. Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.
  1. TileDB - NB
  2. COGs - MH, AB (may slip timeframe)
  3. Zarr - AB

Meeting 2 - September 25, 2019

Agenda:

  1. Review actions 9/11/2019
  1. Note: little action due to absence of chairs (vacation)
  1. GDAL API
  2. Data survey
  3. Icesat2
  1. converting to cloud native and making STAC catalog
  1. Any other business (AOB)

Actions carried over;

  1. [ ] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
  2. [ ] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)

New Action

Fix Zoom link for future meetings - Daniel Rothenberg

GDAL API

Norman: Adding additional TileDB support to GDAL by extending the existing support for two dimensions to the new multi-dimensional API, https://gdal.org/development/rfc/rfc75_multidimensional_arrays.html 

This work is ongoing.

Daniel: XArray integration - open up issue on XArray package, open up a more general multi-dimensional api patch

Rasterio integration - this is ongoing, https://rasterio.groups.io/g/dev/topic/33040759#100

Data Survey

Norman to create strawman, to be reviewed by the attendees of this meeting.

Icesat 2 - reach out to Ryan A for more touch points.

AOB

NOAA Epic project, we should be aware of this. Weather research and improvement act, $50m 2020 gov budget. Improve numerical weather prediction. Earth Prediction Innovation Capability. Cloud Native Infrastructure for modeling suites. Industry and Vendor Work day Oct 8th, Silver Spring, MD.

 Rich Signell (missed the meeting, postfacto comments)

With regard to EPIC,  Harry House (the Director of Cloud Services for USGS) and I traveled to NOAA in Silver Spring last week and met with DaNa Corliss (who will likely lead the EPIC effort, I gather), Bill Lapenta (Director of NCEP) and a bunch of other NOAA people.    We gave a Pangeo presentation, pitching it as a perfect community collaboration platform for NWS and EPIC data and they were interested enough to continue the conversation.  We were suggesting some easy win demonstrations, like getting a rolling archive of HRRR forecast data delivered as Zarr on S3 and showing some cool Pangeo notebooks that crunch and visualize that data.   Jebb Stewart said he was already working with Xarray, Dask and Zarr, and thought that should be pretty straightforward. I have shared with them this simple demo of some NWS HRRR data in Zarr that I created using the Unidata THREDDS dataserver "best time series" dataset:

https://nbviewer.jupyter.org/gist/rsignell-usgs/5618a19280448a9b2e76af08056ea1ca 

and this repo with a "run on binder" link:

 https://github.com/reproducible-notebooks/hrrr-zarr   (note that it can take a while for the cluster to spin up )

Regarding rasterio, I was excited to learn about rioxarray, which I didn't know about until recently.    It has recently added the ability to load COG overviews, like this:

https://nbviewer.jupyter.org/gist/rsignell-usgs/f4dd62ad1274c5b5ed69e5a6b81c1295 

Participants:

Daniel Rothenberg, Justin Minsk, Norman Barker, Charles Blackmon-Luca, Aimee Barciauskas

Apologies:

Matt Hanson

Meeting 1 - September 11, 2019

Participants: Daniel Rothenberg, Justin Minsk, Norman Barker, Matthew Hanson, Luke Madaus, Charles Blackmon-Luca

Action Items