Pangeo Data Working Group

Pangeo Data Working Group - Meeting Notes

Meeting 12 - May 6th, 2020

Agenda:

Participants:

Norman Barker

Apologies:

Actions

Meeting 11 - April 8, 2020

Agenda:

Participants:

Norman Barker

Apologies:

Actions

Meeting 10 - March 11, 2020

Agenda:

Survey charts
Cloud formats
Available tooling

Participants:

Norman Barker, Justin Minsk, Dave Bianco

Apologies:

Actions

Justin to create charts, if not possible due to time constraints then we will delegate
Recreate NASA benchmarks with Pangeo

Highlight r/w+ issues with Zarr and others (Norman)

Notes

Discussed GDAL relevance, Zarr still doesn’t have support for GDAL but people are still using GDAL for regular work. Is there a gap in tooling?

Time series and columnar query support is missing with COGs, e.g. single pixel across multiple TIFFs.

Even with Zarr there are some IO (both r and w) gotches at scale, works fine with read, how do we go from grib, hdf to Zarr. Appends with Zarr is difficult.

Meeting 9 - February 26, 2020

Agenda:

Survey charts
STAC
Cloud formats

Participants:

Norman Barker, Aimee Barciauskas, Matt Hanson

Apologies:

Actions

1. Review Cloud Data Formats document in more detail in the next meeting

2. Review survey charts

3. Discuss available tooling within the pangeo community

Notes

Matt - STAC release 0.9

Chris Holmes working on a blog post

Matt - converting sentinel-2 L2 to COGs and creating STAC for both input (JPEG200 and COG output as different collections) - DE Africa

Cost for processing is $3200, does not include importing data from AWS
Going to be mirrored in the new Cape Town region
Hoping to get funding from AWS to pay for entire globe

Joe and Matt also working on a blog post on intake and intake stac
Cloud Data Formats study

How can Pangeo group be involved?
Interested in how they developed the benchmarks
Glad TileDB was included, also interested that MRF was included
Recreate benchmarks within Pangeo?
Is gdal always a good fit? Not necessarily for Pangeo, Zarr

Everyone else not using lat / lon - you need something to do subsetting

Meeting 9 - February 12, 2020

Agenda:

ZARR, STAC, TileDB …
Survey
GDAL

Participants:

Justin Minsk, Norman Barker, Dave Bianco, Aimee Barciauskas

Apologies:

Actions

Notes

Survey is now closed. Justin has taken the action to graph the results and we will discuss further in the next meeting how to blog.

Zarr - lots of discussion around the use of Zarr, seems great for cloud reads, appending in parallel from dask/xarray in the cloud has been a little problematic.

Meeting 8 - January 29, 2020

Agenda:

Participants:

Justin Minsk, Aimee Barciauskas, Dave Bianco

Actions

Notes

V1 Results for survey: https://docs.google.com/spreadsheets/d/1rehxEGnkVsxRAVmcKDXQV4nQKa5oyJRJowVuspyqW90/edit?usp=sharing
With only 3 attendees, postponed meatier topics to next session

Meeting 7 - January 15, 2020

Agenda:

Survey status
Other

TileDB and GDAL

Participants:

Justin Minsk, Norman Barker, Aimee Barciauskas, Dave Bianco

Apologies

Matt Hanson

Actions

Re-advertise survey;

Retweet through Pangeo - Justin
See if we can publish this on GDAL mailing list - Norman
Discourse, Pangeo, https://discourse.pangeo.io/ - Dave

Notes

Link to survey: https://forms.gle/pv3ywuuiWnr4MkSx8

Email me for results: justin.minsk@gmail.com

Aimee to discuss her project and work with zarr datasets (next meeting)
Climacell CBAM India: https://console.cloud.google.com/marketplace/details/climacell/cbam?filter=solution-type:dataset&filter=category:climate&id=2711b322-9612-4ebd-96b9-bb9dcf7e8239

Meeting 6 - December 18, 2019

Agenda:

Review actions 12/4/2019
Survey status
Other

Participants:

Cancelled

Apologies

Norman Barker

Actions

Notes

Meeting 5 - December 4, 2019

Agenda:

Review actions 11/20/2019
Review survey
Review subject header / tweet for survey

Participants:

Justin Minsk

Norman Barker

Dave Bianco

Kevin Paul

Matt Hanson

Actions

[] Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
Draft blog post about why we are doing a survey (Norman)
Publish survey 12/8

Notes

Carry over action to make STAC collections from existing datasets to the next meeting.

Reviewing survey.

Use “Help shape Pangeo’s catalog and data format trajectory!” as our survey heading.

When we should publish - consensus early week of 12/8

Justin will contact Ryan to get survey link tweeted!

Meeting 4 - November 20, 2019

Agenda:

Review actions 10/23/2019
STAC sprint report

Participants:

Matthew Hanson

Justin Minsk

Dave Bianco @talldave

Rich Signell (USGS) @rsignell-usgs on github

Aimee Barciauskas

Norman Barker

Scott Henderson

Apologies:

Review Actions from last meeting

[] Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
[] Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.

TileDB - NB
COGs - MH, AB (may slip timeframe)
Zarr - AB

Notes: Keep action 1 (STAC) open and revisit.

Survey questions

What is your primary role? e.g. scientist, engineer, manager
Which public datasets do you use in your work?
What are the challenges you face when using these datasets, e.g. access, discovery, reliability
Do you wish to use cloud native datasets and perform cloud based computing?
Which data formats do you use or want to use?
What catalogs, do you use (if any), any attempts to create your own catalogs?
How are you doing data transfer, i.e. between multiple offices
Where is your data stored?
Are you doing any data versioning / provenance marking, how do you handle this, how do you link back, which tools?
What software do you use for data processing, is speed important?
How much of your time is currently spent pre-processing data as opposed to processing?
How important is compression to you, is lossy ok and if so at what threshold?
Does your data benefit from a sparse representation?
Is there anything else you would like to tell us?

New Actions:

Justin will create google form with these survey questions

(Recreating on Personal Gmail, seems like my work blocks anyone outside of the company from editing)
New share link for editing: https://docs.google.com/forms/d/1CwfsKberK1x6IGMcJ7uicX0chjGbf4l6BQKdqY9q158/edit?usp=sharing

Future discussion topic (revisit in a month):

Better understand crosswalk between different catalog implementations, e.g. data.gov -> STAC, be aware of ISO/CSW activities

Notes:

AGU Pangeo Tutorial happening Dec 8 w/ 65 participants, so could advertise/force people to fill out survey
https://nbviewer.jupyter.org/gist/rsignell-usgs/8628257cee49b9a999d333df2f7593b0 (example notebook that writes zarr from lots of netcdf files, using xarray append to write each time chunk)

CSW/Thredds and STAC? How to do this harmonization? STAC CMR proxy service.

Look at Parquet as well in addition to zarr, tiledb, hdf and netcdf and be able to explain why.

Survey notes:

Anonymous collection

What is your role, are you familiar with pangeo tools

Tweet from pangeo and link to google form - Ryan

Also send out to pangeo mailing list - Ryan

https://discourse.pangeo.io/ forum and Gitter channel

Create sentence as to why this survey is useful? Create intro in google form.

Launch before Dec 8th, provisionally aim for sending out on the 5th Dec.

Please put ideas below;

“Help shape Pangeo’s catalog and data format trajectory!”

Meeting 3 - October 23, 2019

Agenda:

Review actions 10/9/2019
Poll group for data survey questions
AOB

Participants:

Norman Barker
Kevin Donkers

Justin Minsk

Aimee Barciauskas

Apologies:

None

Review actions

[] Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
[x] Think of questions for data survey (use cases and specific datasets), add them to these notes and review in the next meeting
[] Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.

TileDB - NB
COGs - MH, AB (may slip timeframe)
Zarr - AB

revisit action (3) in mid-november

Actions carried over

Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.

TileDB - NB
COGs - MH, AB (may slip timeframe)
Zarr - AB

revisit action above in mid-november

New Actions

Notes:

Icesat-2: Been in discussions with UW, been thinking of which format is best and how to surface metadata. ATL6 is a point cloud format, needs to specify a bounding box and time range, filtering is done by point. There is no real granule level metadata. Currently stored in HDF. In terms of STAC, this is a collection with an asset at the top level.

Earth System Model (ESM) metadata collection specification: https://github.com/NCAR/esm-collection-spec

ESM looks like STAC but isn’t. Good opportunity to transfer this to STAC. Will raise this as the STAC sprint

Pangeo/STAC hackathon, 2020?

Data survey:

Target audience

The audience is primarily scientists, and includes agencies such as Met Office and NOAA.

We will reach out to;

Everyone who is part of Pangeo as a start
CSIRO climate forecasters (https://research.csiro.au/dfp/ | Thomas Moore + Dougie Squire)
CEDA / RAL - Matt Pryor (https://github.com/mkjpryor-stfc | https://people.ncas.ac.uk/people/view/370)
GeoScience Australia (OpenDataCube - Digital Earth Africa / Australia)

Questions;

What datasets are they using
What are the challenges, (access, discovery, cloud native, do use cloud based computing)
Formats (less important than the above for now)
What catalogs, do you use it, any attempts
How are you doing data transfer, multiple offices, cloud
Where is your data
Data versioning / provenance, how do you handle this, how do you link back, which tools
How are you processing, is speed important
Compression, is lossy ok and if so at what threshold

Tools for survey;

Survey Monkey, Google Forms (Justin Minsk is willing to create a Google Form after questions are decided on)

Report;

Blog post (report style), let the data be available for others to use.

Meeting 3 - October 9, 2019

Agenda:

Review actions 9/11/2019
ICESAT 2
Poll group for data survey questions
EPIC / NOAA - Industry day
AOB

Participants:

Norman Barker, Aimee Barciauskas, Matthew Hanson, Luke Madaus, Jeff Sadler, Charles Blackmon-Luca

Apologies:

None

Review actions

[X] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
[X] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)
[X] Fix Zoom link for future meetings - Daniel Rothenberg

Actions carried over

None

MH - STAC proxy to NASA CMR (common metadata repository). ICESAT 2 is one of the data, interest in making this cloud “friendly”, currently most products are in HDF5. Add this new dataset back as a new collection into CMR (test not prod). Will be using ATL6 (non-gridded product). IceSat-2 Products: https://icesat-2.gsfc.nasa.gov/science/data-products

Charles - ESM Collection Specs in progress, would like to merge with STAC extension development in time

MH - document benefits of different cloud formats, COGs, ZARR or TileDB - filtering is a good example of a typical action. How does a collection of assets fits into STAC?

Provide IceSat2 as multiple different formats so we can analyze and gauge performance.

Define patterns for laying out variables for a dataset in COG, ZARR or TileDB or another format.

STAC sprint - https://docs.google.com/forms/d/e/1FAIpQLSeGbt8sji0H9Kd0RMS1AMlk7KmWwIfRu-P3G3pehxPTGZh0rw/viewform

EPIC - https://www.fbo.gov/index?s=opportunity&mode=form&id=2de56675c6e0556d3e08d9e01b691bce&tab=core&_cview=1

Tentatively

Solicitation in January 2020
Proposal due Match 2020
Award in September 2020

New Actions

Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
Think of questions for data survey (use cases and specific datasets), add them to these notes and review in the next meeting
Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.

TileDB - NB
COGs - MH, AB (may slip timeframe)
Zarr - AB

Meeting 2 - September 25, 2019

Agenda:

Review actions 9/11/2019

Note: little action due to absence of chairs (vacation)

GDAL API
Data survey
Icesat2

converting to cloud native and making STAC catalog

Any other business (AOB)

Actions carried over;

[ ] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
[ ] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)

New Action

Fix Zoom link for future meetings - Daniel Rothenberg

GDAL API

Norman: Adding additional TileDB support to GDAL by extending the existing support for two dimensions to the new multi-dimensional API, https://gdal.org/development/rfc/rfc75_multidimensional_arrays.html

This work is ongoing.

Daniel: XArray integration - open up issue on XArray package, open up a more general multi-dimensional api patch

Rasterio integration - this is ongoing, https://rasterio.groups.io/g/dev/topic/33040759#100

Data Survey

Norman to create strawman, to be reviewed by the attendees of this meeting.

Daniel + Justin can help review

Icesat 2 - reach out to Ryan A for more touch points.

AOB

NOAA Epic project, we should be aware of this. Weather research and improvement act, $50m 2020 gov budget. Improve numerical weather prediction. Earth Prediction Innovation Capability. Cloud Native Infrastructure for modeling suites. Industry and Vendor Work day Oct 8th, Silver Spring, MD.

Rich Signell (missed the meeting, postfacto comments)

With regard to EPIC, Harry House (the Director of Cloud Services for USGS) and I traveled to NOAA in Silver Spring last week and met with DaNa Corliss (who will likely lead the EPIC effort, I gather), Bill Lapenta (Director of NCEP) and a bunch of other NOAA people. We gave a Pangeo presentation, pitching it as a perfect community collaboration platform for NWS and EPIC data and they were interested enough to continue the conversation. We were suggesting some easy win demonstrations, like getting a rolling archive of HRRR forecast data delivered as Zarr on S3 and showing some cool Pangeo notebooks that crunch and visualize that data. Jebb Stewart said he was already working with Xarray, Dask and Zarr, and thought that should be pretty straightforward. I have shared with them this simple demo of some NWS HRRR data in Zarr that I created using the Unidata THREDDS dataserver "best time series" dataset:

https://nbviewer.jupyter.org/gist/rsignell-usgs/5618a19280448a9b2e76af08056ea1ca

and this repo with a "run on binder" link:

https://github.com/reproducible-notebooks/hrrr-zarr (note that it can take a while for the cluster to spin up )

Regarding rasterio, I was excited to learn about rioxarray, which I didn't know about until recently. It has recently added the ability to load COG overviews, like this:

https://nbviewer.jupyter.org/gist/rsignell-usgs/f4dd62ad1274c5b5ed69e5a6b81c1295

Participants:

Daniel Rothenberg, Justin Minsk, Norman Barker, Charles Blackmon-Luca, Aimee Barciauskas

Apologies:

Matt Hanson

Meeting 1 - September 11, 2019

Participants: Daniel Rothenberg, Justin Minsk, Norman Barker, Matthew Hanson, Luke Madaus, Charles Blackmon-Luca

Leadership / Admin

Norman and Matt volunteered to start out as “co-chairs” in order to help boot-strap initial activities
… but if you couldn’t attend the call and would like to take a leadership position please let them know!
Preference for keeping bi-weekly meetings
Try to use Pangeo “Appear” channel; if doesn’t work for some reason Daniel can volunteer a dedicated Zoom conference room

Potential Projects

1) Establishing use cases for different data types / applications

How are people actually using data on the cloud today?
Proposal: create a survey to distribute to the Pangeo community (and beyond?) to capture what people are doing; what their experiences are; what data is being used; how that data is being used

Solicit responses through end of Q4
Use results of survey to prepare report to the community and scope/design performance testing motivated to best improve the experience for community
Include a “willing to work” question
Identify datasets that people would be willing to dedicate time to working on for the community

2) New online data catalogs

Potentially initiate a move away from static list-based catalogs on GitHub towards an API or a more dynamic system
E.g. satellite data / catalogued and maintained by Element84; climate data from Pangeo Intake catalogs

3) Seek dedicated funding/procurement to support individual data management projects

Miscellaneous thoughts

Public/private breakdown - some folks can share data/knowledge in the public domain
Our main target is going to be open source / open data but still should solicit feedback from industry within the community, and appreciate any information they’re willing to share or contribute
Get a contact with ECMWF and see if we can incorporate some of their data
New GDAL API

Separates raster bands
Feedback about possibly incorporating this tooling
Currently no buy-in by rasterio package

Action Items

[ ] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
[ ] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)