Pangeo Data Working Group - Meeting Notes

Meeting 3 - October 9, 2019

Agenda:

  1. Review actions 9/11/2019
  2. ICESAT 2
  3. Poll group for data survey questions
  4. EPIC / NOAA - Industry day
  5. AOB

Participants:

Norman Barker, Aimee Barciauskas, Matthew Hanson, Luke Madaus, Jeff Sadler, Charles Blackmon-Luca

Apologies:

None

Review actions

  1. [X] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
  2. [X] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)
  3. [X] Fix Zoom link for future meetings - Daniel Rothenberg

Actions carried over

None

MH - STAC proxy to NASA CMR (common metadata repository). ICESAT 2 is one of the data, interest in making this cloud “friendly”, currently most products are in HDF5. Add this new dataset back as a new collection into CMR (test not prod). Will be using ATL6 (non-gridded product). IceSat-2 Products: https://icesat-2.gsfc.nasa.gov/science/data-products

Charles - ESM Collection Specs in progress, would like to merge with STAC extension development in time

MH - document benefits of different cloud formats, COGs, ZARR or TileDB - filtering is a good example of a typical action. How does a collection of assets fits into STAC?

Provide IceSat2 as multiple different formats so we can analyze and gauge performance.

Define patterns for laying out variables for a dataset in COG, ZARR or TileDB or another format.

STAC sprint - https://docs.google.com/forms/d/e/1FAIpQLSeGbt8sji0H9Kd0RMS1AMlk7KmWwIfRu-P3G3pehxPTGZh0rw/viewform

EPIC - https://www.fbo.gov/index?s=opportunity&mode=form&id=2de56675c6e0556d3e08d9e01b691bce&tab=core&_cview=1

Tentatively

New Actions

  1. Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
  2. Think of questions for data survey (use cases and specific datasets), add them to these notes and review in the next meeting
  3. Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.
  1. TileDB - NB
  2. COGs - MH, AB (may slip timeframe)
  3. Zarr - AB

Meeting 2 - September 25, 2019

Agenda:

  1. Review actions 9/11/2019
  1. Note: little action due to absence of chairs (vacation)
  1. GDAL API
  2. Data survey
  3. Icesat2
  1. converting to cloud native and making STAC catalog
  1. Any other business (AOB)

Actions carried over;

  1. [ ] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
  2. [ ] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)

New Action

Fix Zoom link for future meetings - Daniel Rothenberg

GDAL API

Norman: Adding additional TileDB support to GDAL by extending the existing support for two dimensions to the new multi-dimensional API, https://gdal.org/development/rfc/rfc75_multidimensional_arrays.html 

This work is ongoing.

Daniel: XArray integration - open up issue on XArray package, open up a more general multi-dimensional api patch

Rasterio integration - this is ongoing, https://rasterio.groups.io/g/dev/topic/33040759#100

Data Survey

Norman to create strawman, to be reviewed by the attendees of this meeting.

Icesat 2 - reach out to Ryan A for more touch points.

AOB

NOAA Epic project, we should be aware of this. Weather research and improvement act, $50m 2020 gov budget. Improve numerical weather prediction. Earth Prediction Innovation Capability. Cloud Native Infrastructure for modeling suites. Industry and Vendor Work day Oct 8th, Silver Spring, MD.

 Rich Signell (missed the meeting, postfacto comments)

With regard to EPIC,  Harry House (the Director of Cloud Services for USGS) and I traveled to NOAA in Silver Spring last week and met with DaNa Corliss (who will likely lead the EPIC effort, I gather), Bill Lapenta (Director of NCEP) and a bunch of other NOAA people.    We gave a Pangeo presentation, pitching it as a perfect community collaboration platform for NWS and EPIC data and they were interested enough to continue the conversation.  We were suggesting some easy win demonstrations, like getting a rolling archive of HRRR forecast data delivered as Zarr on S3 and showing some cool Pangeo notebooks that crunch and visualize that data.   Jebb Stewart said he was already working with Xarray, Dask and Zarr, and thought that should be pretty straightforward. I have shared with them this simple demo of some NWS HRRR data in Zarr that I created using the Unidata THREDDS dataserver "best time series" dataset:

https://nbviewer.jupyter.org/gist/rsignell-usgs/5618a19280448a9b2e76af08056ea1ca 

and this repo with a "run on binder" link:

 https://github.com/reproducible-notebooks/hrrr-zarr   (note that it can take a while for the cluster to spin up )

Regarding rasterio, I was excited to learn about rioxarray, which I didn't know about until recently.    It has recently added the ability to load COG overviews, like this:

https://nbviewer.jupyter.org/gist/rsignell-usgs/f4dd62ad1274c5b5ed69e5a6b81c1295 

Participants:

Daniel Rothenberg, Justin Minsk, Norman Barker, Charles Blackmon-Luca, Aimee Barciauskas

Apologies:

Matt Hanson

Meeting 1 - September 11, 2019

Participants: Daniel Rothenberg, Justin Minsk, Norman Barker, Matthew Hanson, Luke Madaus, Charles Blackmon-Luca

Action Items