Pangeo Data Working Group - Meeting Notes
Meeting 12 - May 6th, 2020
Agenda:
Participants:
Norman Barker
Apologies:
Actions
Meeting 11 - April 8, 2020
Agenda:
Participants:
Norman Barker
Apologies:
Actions
Meeting 10 - March 11, 2020
Agenda:
- Survey charts
- Cloud formats
- Available tooling
Participants:
Norman Barker, Justin Minsk, Dave Bianco
Apologies:
Actions
- Justin to create charts, if not possible due to time constraints then we will delegate
- Recreate NASA benchmarks with Pangeo
- Highlight r/w+ issues with Zarr and others (Norman)
Notes
Discussed GDAL relevance, Zarr still doesn’t have support for GDAL but people are still using GDAL for regular work. Is there a gap in tooling?
Time series and columnar query support is missing with COGs, e.g. single pixel across multiple TIFFs.
Even with Zarr there are some IO (both r and w) gotches at scale, works fine with read, how do we go from grib, hdf to Zarr. Appends with Zarr is difficult.
Meeting 9 - February 26, 2020
Agenda:
- Survey charts
- STAC
- Cloud formats
Participants:
Norman Barker, Aimee Barciauskas, Matt Hanson
Apologies:
Actions
1. Review Cloud Data Formats document in more detail in the next meeting
2. Review survey charts
3. Discuss available tooling within the pangeo community
Notes
- Chris Holmes working on a blog post
- Matt - converting sentinel-2 L2 to COGs and creating STAC for both input (JPEG200 and COG output as different collections) - DE Africa
- Cost for processing is $3200, does not include importing data from AWS
- Going to be mirrored in the new Cape Town region
- Hoping to get funding from AWS to pay for entire globe
- Joe and Matt also working on a blog post on intake and intake stac
- Cloud Data Formats study
- How can Pangeo group be involved?
- Interested in how they developed the benchmarks
- Glad TileDB was included, also interested that MRF was included
- Recreate benchmarks within Pangeo?
- Is gdal always a good fit? Not necessarily for Pangeo, Zarr
- Everyone else not using lat / lon - you need something to do subsetting
Meeting 9 - February 12, 2020
Agenda:
- ZARR, STAC, TileDB …
- Survey
- GDAL
Participants:
Justin Minsk, Norman Barker, Dave Bianco, Aimee Barciauskas
Apologies:
Actions
Notes
Survey is now closed. Justin has taken the action to graph the results and we will discuss further in the next meeting how to blog.
Zarr - lots of discussion around the use of Zarr, seems great for cloud reads, appending in parallel from dask/xarray in the cloud has been a little problematic.
Meeting 8 - January 29, 2020
Agenda:
Participants:
Justin Minsk, Aimee Barciauskas, Dave Bianco
Actions
Notes
Meeting 7 - January 15, 2020
Agenda:
- Survey status
- Other
- TileDB and GDAL
Participants:
Justin Minsk, Norman Barker, Aimee Barciauskas, Dave Bianco
Apologies
Matt Hanson
Actions
Re-advertise survey;
- Retweet through Pangeo - Justin
- See if we can publish this on GDAL mailing list - Norman
- Discourse, Pangeo, https://discourse.pangeo.io/ - Dave
-
Notes
Meeting 6 - December 18, 2019
Agenda:
- Review actions 12/4/2019
- Survey status
- Other
Participants:
Cancelled
Apologies
Norman Barker
Actions
Notes
Meeting 5 - December 4, 2019
Agenda:
- Review actions 11/20/2019
- Review survey
- Review subject header / tweet for survey
Participants:
Justin Minsk
Norman Barker
Dave Bianco
Kevin Paul
Matt Hanson
Actions
- [] Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
- Draft blog post about why we are doing a survey (Norman)
- Publish survey 12/8
Notes
Carry over action to make STAC collections from existing datasets to the next meeting.
Reviewing survey.
Use “Help shape Pangeo’s catalog and data format trajectory!” as our survey heading.
When we should publish - consensus early week of 12/8
Justin will contact Ryan to get survey link tweeted!
Meeting 4 - November 20, 2019
Agenda:
- Review actions 10/23/2019
- STAC sprint report
Participants:
Matthew Hanson
Justin Minsk
Dave Bianco @talldave
Rich Signell (USGS) @rsignell-usgs on github
Aimee Barciauskas
Norman Barker
Scott Henderson
Apologies:
Review Actions from last meeting
- [] Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
- [] Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.
- TileDB - NB
- COGs - MH, AB (may slip timeframe)
- Zarr - AB
Notes: Keep action 1 (STAC) open and revisit.
Survey questions
- What is your primary role? e.g. scientist, engineer, manager
- Which public datasets do you use in your work?
- What are the challenges you face when using these datasets, e.g. access, discovery, reliability
- Do you wish to use cloud native datasets and perform cloud based computing?
- Which data formats do you use or want to use?
- What catalogs, do you use (if any), any attempts to create your own catalogs?
- How are you doing data transfer, i.e. between multiple offices
- Where is your data stored?
- Are you doing any data versioning / provenance marking, how do you handle this, how do you link back, which tools?
- What software do you use for data processing, is speed important?
- How much of your time is currently spent pre-processing data as opposed to processing?
- How important is compression to you, is lossy ok and if so at what threshold?
- Does your data benefit from a sparse representation?
- Is there anything else you would like to tell us?
New Actions:
- Justin will create google form with these survey questions
- (Recreating on Personal Gmail, seems like my work blocks anyone outside of the company from editing)
- New share link for editing: https://docs.google.com/forms/d/1CwfsKberK1x6IGMcJ7uicX0chjGbf4l6BQKdqY9q158/edit?usp=sharing
Future discussion topic (revisit in a month):
- Better understand crosswalk between different catalog implementations, e.g. data.gov -> STAC, be aware of ISO/CSW activities
Notes:
CSW/Thredds and STAC? How to do this harmonization? STAC CMR proxy service.
Look at Parquet as well in addition to zarr, tiledb, hdf and netcdf and be able to explain why.
Survey notes:
Anonymous collection
What is your role, are you familiar with pangeo tools
Tweet from pangeo and link to google form - Ryan
Also send out to pangeo mailing list - Ryan
https://discourse.pangeo.io/ forum and Gitter channel
Create sentence as to why this survey is useful? Create intro in google form.
Launch before Dec 8th, provisionally aim for sending out on the 5th Dec.
Please put ideas below;
“Help shape Pangeo’s catalog and data format trajectory!”
Meeting 3 - October 23, 2019
Agenda:
- Review actions 10/9/2019
- Poll group for data survey questions
- AOB
Participants:
Norman Barker
Kevin Donkers
Justin Minsk
Aimee Barciauskas
Apologies:
None
Review actions
- [] Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
- [x] Think of questions for data survey (use cases and specific datasets), add them to these notes and review in the next meeting
- [] Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.
- TileDB - NB
- COGs - MH, AB (may slip timeframe)
- Zarr - AB
revisit action (3) in mid-november
Actions carried over
- Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
- Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.
- TileDB - NB
- COGs - MH, AB (may slip timeframe)
- Zarr - AB
revisit action above in mid-november
New Actions
Notes:
Icesat-2: Been in discussions with UW, been thinking of which format is best and how to surface metadata. ATL6 is a point cloud format, needs to specify a bounding box and time range, filtering is done by point. There is no real granule level metadata. Currently stored in HDF. In terms of STAC, this is a collection with an asset at the top level.
Earth System Model (ESM) metadata collection specification: https://github.com/NCAR/esm-collection-spec
ESM looks like STAC but isn’t. Good opportunity to transfer this to STAC. Will raise this as the STAC sprint
Pangeo/STAC hackathon, 2020?
Data survey:
Target audience
The audience is primarily scientists, and includes agencies such as Met Office and NOAA.
We will reach out to;
- Everyone who is part of Pangeo as a start
- CSIRO climate forecasters (https://research.csiro.au/dfp/ | Thomas Moore + Dougie Squire)
- CEDA / RAL - Matt Pryor (https://github.com/mkjpryor-stfc | https://people.ncas.ac.uk/people/view/370)
- GeoScience Australia (OpenDataCube - Digital Earth Africa / Australia)
Questions;
- What datasets are they using
- What are the challenges, (access, discovery, cloud native, do use cloud based computing)
- Formats (less important than the above for now)
- What catalogs, do you use it, any attempts
- How are you doing data transfer, multiple offices, cloud
- Where is your data
- Data versioning / provenance, how do you handle this, how do you link back, which tools
- How are you processing, is speed important
- Compression, is lossy ok and if so at what threshold
Tools for survey;
Survey Monkey, Google Forms (Justin Minsk is willing to create a Google Form after questions are decided on)
Report;
Blog post (report style), let the data be available for others to use.
Meeting 3 - October 9, 2019
Agenda:
- Review actions 9/11/2019
- ICESAT 2
- Poll group for data survey questions
- EPIC / NOAA - Industry day
- AOB
Participants:
Norman Barker, Aimee Barciauskas, Matthew Hanson, Luke Madaus, Jeff Sadler, Charles Blackmon-Luca
Apologies:
None
Review actions
- [X] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
- [X] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)
- [X] Fix Zoom link for future meetings - Daniel Rothenberg
Actions carried over
None
MH - STAC proxy to NASA CMR (common metadata repository). ICESAT 2 is one of the data, interest in making this cloud “friendly”, currently most products are in HDF5. Add this new dataset back as a new collection into CMR (test not prod). Will be using ATL6 (non-gridded product). IceSat-2 Products: https://icesat-2.gsfc.nasa.gov/science/data-products
Charles - ESM Collection Specs in progress, would like to merge with STAC extension development in time
MH - document benefits of different cloud formats, COGs, ZARR or TileDB - filtering is a good example of a typical action. How does a collection of assets fits into STAC?
Provide IceSat2 as multiple different formats so we can analyze and gauge performance.
Define patterns for laying out variables for a dataset in COG, ZARR or TileDB or another format.
STAC sprint - https://docs.google.com/forms/d/e/1FAIpQLSeGbt8sji0H9Kd0RMS1AMlk7KmWwIfRu-P3G3pehxPTGZh0rw/viewform
EPIC - https://www.fbo.gov/index?s=opportunity&mode=form&id=2de56675c6e0556d3e08d9e01b691bce&tab=core&_cview=1
Tentatively
- Solicitation in January 2020
- Proposal due Match 2020
- Award in September 2020
New Actions
- Take existing pangeo dataset and make STAC collections (master list is in pangeo-datastore)
- Think of questions for data survey (use cases and specific datasets), add them to these notes and review in the next meeting
- Define patterns for laying out variables, how do we want to do that? Maybe a one page tutorial on the format and data. Which formats make sense for particular data types? Explain why HDF5 and netCDF doesn’t particularly work.
- TileDB - NB
- COGs - MH, AB (may slip timeframe)
- Zarr - AB
Meeting 2 - September 25, 2019
Agenda:
- Review actions 9/11/2019
- Note: little action due to absence of chairs (vacation)
- GDAL API
- Data survey
- Icesat2
- converting to cloud native and making STAC catalog
- Any other business (AOB)
Actions carried over;
- [ ] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
- [ ] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)
New Action
Fix Zoom link for future meetings - Daniel Rothenberg
GDAL API
Norman: Adding additional TileDB support to GDAL by extending the existing support for two dimensions to the new multi-dimensional API, https://gdal.org/development/rfc/rfc75_multidimensional_arrays.html
This work is ongoing.
Daniel: XArray integration - open up issue on XArray package, open up a more general multi-dimensional api patch
Rasterio integration - this is ongoing, https://rasterio.groups.io/g/dev/topic/33040759#100
Data Survey
Norman to create strawman, to be reviewed by the attendees of this meeting.
- Daniel + Justin can help review
Icesat 2 - reach out to Ryan A for more touch points.
AOB
NOAA Epic project, we should be aware of this. Weather research and improvement act, $50m 2020 gov budget. Improve numerical weather prediction. Earth Prediction Innovation Capability. Cloud Native Infrastructure for modeling suites. Industry and Vendor Work day Oct 8th, Silver Spring, MD.
Rich Signell (missed the meeting, postfacto comments)
With regard to EPIC, Harry House (the Director of Cloud Services for USGS) and I traveled to NOAA in Silver Spring last week and met with DaNa Corliss (who will likely lead the EPIC effort, I gather), Bill Lapenta (Director of NCEP) and a bunch of other NOAA people. We gave a Pangeo presentation, pitching it as a perfect community collaboration platform for NWS and EPIC data and they were interested enough to continue the conversation. We were suggesting some easy win demonstrations, like getting a rolling archive of HRRR forecast data delivered as Zarr on S3 and showing some cool Pangeo notebooks that crunch and visualize that data. Jebb Stewart said he was already working with Xarray, Dask and Zarr, and thought that should be pretty straightforward. I have shared with them this simple demo of some NWS HRRR data in Zarr that I created using the Unidata THREDDS dataserver "best time series" dataset:
https://nbviewer.jupyter.org/gist/rsignell-usgs/5618a19280448a9b2e76af08056ea1ca
and this repo with a "run on binder" link:
https://github.com/reproducible-notebooks/hrrr-zarr (note that it can take a while for the cluster to spin up )
Regarding rasterio, I was excited to learn about rioxarray, which I didn't know about until recently. It has recently added the ability to load COG overviews, like this:
https://nbviewer.jupyter.org/gist/rsignell-usgs/f4dd62ad1274c5b5ed69e5a6b81c1295
Participants:
Daniel Rothenberg, Justin Minsk, Norman Barker, Charles Blackmon-Luca, Aimee Barciauskas
Apologies:
Matt Hanson
Meeting 1 - September 11, 2019
Participants: Daniel Rothenberg, Justin Minsk, Norman Barker, Matthew Hanson, Luke Madaus, Charles Blackmon-Luca
- Norman and Matt volunteered to start out as “co-chairs” in order to help boot-strap initial activities
- … but if you couldn’t attend the call and would like to take a leadership position please let them know!
- Preference for keeping bi-weekly meetings
- Try to use Pangeo “Appear” channel; if doesn’t work for some reason Daniel can volunteer a dedicated Zoom conference room
- 1) Establishing use cases for different data types / applications
- How are people actually using data on the cloud today?
- Proposal: create a survey to distribute to the Pangeo community (and beyond?) to capture what people are doing; what their experiences are; what data is being used; how that data is being used
- Solicit responses through end of Q4
- Use results of survey to prepare report to the community and scope/design performance testing motivated to best improve the experience for community
- Include a “willing to work” question
- Identify datasets that people would be willing to dedicate time to working on for the community
- 2) New online data catalogs
- Potentially initiate a move away from static list-based catalogs on GitHub towards an API or a more dynamic system
- E.g. satellite data / catalogued and maintained by Element84; climate data from Pangeo Intake catalogs
- 3) Seek dedicated funding/procurement to support individual data management projects
- Public/private breakdown - some folks can share data/knowledge in the public domain
- Our main target is going to be open source / open data but still should solicit feedback from industry within the community, and appreciate any information they’re willing to share or contribute
- Get a contact with ECMWF and see if we can incorporate some of their data
- New GDAL API
- Separates raster bands
- Feedback about possibly incorporating this tooling
- Currently no buy-in by rasterio package
Action Items
- [ ] (delegate) Add bi-weekly working group meeting to official Pangeo Google Calendar (reach out to Joe H or Ryan A?)
- [ ] Team members propose preferred communication method to chairs (Discord, e-mail list-serv, Google Doc, https://discourse.pangeo.io/, etc)