1 of 24

Pangeo: A community platform for open, reproducible and scalable geoscience

Rich Signell (Open Science Computing, LLC) …� … and the Pangeo Community !

North Carolina Institute for Climate Studies, March 5, 2024

2 of 24

3 of 24

Pangeo is a Community

4 of 24

Pangeo is a Platform

DATA

Cloud-friendly ndarray data

dask.distributed dask-jobqueue dask-mpi dask-kubernetes dask-cloudprovider dask-gateway LocalCluster() SlurmCluster() KubeCluster() FargateCluster()

5 of 24

Live Demo time!

6 of 24

Pangeo for numerical model output

7 of 24

Pangeo for numerical model output

8 of 24

Pangeo lives in a rich Python ecosystem

9 of 24

Pangeo is in production

10 of 24

Pangeo is in production

11 of 24

12 of 24

The High Speed Network (100GbE+)

13 of 24

14 of 24

Cost of Cloud Storage

No Egress Fees

Egress Fees

See this notebook for how this was calculated…

15 of 24

Zarr format

  • Developed as cloud-optimized version of NetCDF/HDF
  • Each chunk is stored as a separate file/object
  • Global and variable metadata stored as JSON
  • Groups, filters, compression
  • Free, open-source, community-driven software
  • Simple format, clear specification
  • Read/write using Xarray

16 of 24

Zarr format

17 of 24

Zarr format

18 of 24

Kerchunk

  • Use Zarr library, but access chunks via byte-range requests to any scientific data format file
  • Create sidecar file that contains ALL the metata (all global metadata, all variable metadata, and the byte ranges of all chunks)
  • Anyone can create these “reference” sidecar files
  • One created, native libraries are no longer needed – only Zarr library
  • References for each individual file can be combined into references for a virtual dataset

19 of 24

Kerchunk

  • Large references can now be written as a collection of Parquet files instead of JSON
  • much smaller reference file sizes
  • faster data opening
  • Can now append to both JSON or Parquet references

20 of 24

Cloud-Optimized Data

21 of 24

Benefits of the Pangeo Framework:

  • Flexible: Select which tools you need and put them together to solve your problem
  • Scalable: Run on anything from a single-core laptop to a thousand-node cluster
  • Multi-architecture: Runs on your desktop and on Mac/Windows/Linux CPUs and GPUs
  • Efficient: Run at machine-code speeds using just-in-time compilation

  • Interactive: Support fully interactive exploration, not just rendering static images or text files
  • Scriptable: Run in batch mode for parameter searches and unattended operation
  • Visualizable: Support rendering even the largest datasets
  • Future-proof: Maintained, used, and tested by many people all across the world
  • Open: Free for research or commercial use, without restrictive licensing or extra costs
  • Cloud-friendly: Takes advantage of cloud-native technologies without retooling

22 of 24

Deploying Pangeo

  • 2i2c: paid service, they provide/manage your JupyterHub with Dask running on Kubernetes
  • Coiled: paid service providing Dask clusters and notebooks on demand
  • Nebari: open-source project that allows simple deployment and management of JupyterHub with Dask on Kubernetes

23 of 24

Learning Pangeo

24 of 24

Benefits of Cloud Native for Science:

  • Computing becomes a commodity
  • Whatever you need, whenever you need it
  • Works great for small data, works great for big data
  • Pay only for what you use
  • Supports Open Science

  • Levels the playing field – not just privileged institutions
  • Open and robust: data can be accessed without additional data services
  • Encourages standards and best practices