1 of 24

Pangeo: A community platform for open, reproducible and scalable geoscience

Rich Signell (Open Science Computing, LLC) …� … and the Pangeo Community !

North Carolina Institute for Climate Studies, March 5, 2024

2 of 24

3 of 24

Pangeo is a Community

4 of 24

Pangeo is a Platform

DATA

Cloud-friendly ndarray data

dask.distributed dask-jobqueue dask-mpi dask-kubernetes dask-cloudprovider dask-gateway LocalCluster() SlurmCluster() KubeCluster() FargateCluster()

https://pangeo.dev

https://medium.com/pangeo

5 of 24

Live Demo time!

6 of 24

Pangeo for numerical model output

7 of 24

Pangeo for numerical model output

8 of 24

Pangeo lives in a rich Python ecosystem

9 of 24

Pangeo is in production

10 of 24

Pangeo is in production

11 of 24

https://github.com/hytest-org/hytest

12 of 24

The High Speed Network (100GbE+)

14 of 24

Cost of Cloud Storage

No Egress Fees

Egress Fees

See this notebook for how this was calculated…

15 of 24

Zarr format

Developed as cloud-optimized version of NetCDF/HDF
Each chunk is stored as a separate file/object
Global and variable metadata stored as JSON
Groups, filters, compression
Free, open-source, community-driven software
Simple format, clear specification
Read/write using Xarray

16 of 24

Zarr format

17 of 24

Zarr format

18 of 24

Kerchunk

Use Zarr library, but access chunks via byte-range requests to any scientific data format file
Create sidecar file that contains ALL the metata (all global metadata, all variable metadata, and the byte ranges of all chunks)
Anyone can create these “reference” sidecar files
One created, native libraries are no longer needed – only Zarr library
References for each individual file can be combined into references for a virtual dataset

19 of 24

Kerchunk

Large references can now be written as a collection of Parquet files instead of JSON
much smaller reference file sizes
faster data opening
Can now append to both JSON or Parquet references

20 of 24

Cloud-Optimized Data

Source: Cloud-Optimized Geospatial Formats Guide (cloudnativegeo.org)

21 of 24

Benefits of the Pangeo Framework:

Flexible: Select which tools you need and put them together to solve your problem
Scalable: Run on anything from a single-core laptop to a thousand-node cluster
Multi-architecture: Runs on your desktop and on Mac/Windows/Linux CPUs and GPUs
Efficient: Run at machine-code speeds using just-in-time compilation

Interactive: Support fully interactive exploration, not just rendering static images or text files
Scriptable: Run in batch mode for parameter searches and unattended operation
Visualizable: Support rendering even the largest datasets
Future-proof: Maintained, used, and tested by many people all across the world
Open: Free for research or commercial use, without restrictive licensing or extra costs
Cloud-friendly: Takes advantage of cloud-native technologies without retooling

22 of 24

Deploying Pangeo

2i2c: paid service, they provide/manage your JupyterHub with Dask running on Kubernetes
Coiled: paid service providing Dask clusters and notebooks on demand
Nebari: open-source project that allows simple deployment and management of JupyterHub with Dask on Kubernetes

23 of 24

Learning Pangeo

24 of 24

Benefits of Cloud Native for Science:

Computing becomes a commodity
Whatever you need, whenever you need it
Works great for small data, works great for big data
Pay only for what you use
Supports Open Science

Levels the playing field – not just privileged institutions
Open and robust: data can be accessed without additional data services
Encourages standards and best practices

1 of 24

2 of 24

3 of 24

4 of 24

5 of 24

6 of 24

7 of 24

8 of 24

9 of 24

10 of 24

11 of 24

12 of 24

13 of 24

14 of 24

15 of 24

16 of 24

17 of 24

18 of 24

19 of 24

20 of 24

21 of 24

22 of 24

23 of 24

24 of 24