1 of 32

Cloud Native Analysis of Earth Observation Satellite Data with Pangeo

Scott Henderson

eScience Postdoctoral Fellow

University of Washington

February 14, 2019

NASA has estimated that by 2025, it will be storing upwards of 250 Petabytes (PB) of its data using commercial Cloud services (e.g. Amazon Web Services [AWS]). This presentation will focus on efforts funded by a NASA ACCESS 2017 grant to transition the Earth Science community into Cloud computing by developing technologies that build on top of the growing Pangeo ecosystem. In particular, the integration of JupyterHub with Kubernetes and several high-level Python packages (i.e. Xarray, Dask, Rasterio, Intake, PyViz), are enabling Cloud-native workflows that circumvent the bottleneck of downloading large amounts of data. These tools work best with emerging Cloud-native storage solutions for satellite imagery (i.e. NASA’s CMR, STAC, COGs). In this presentation, Scott will give an update on the Pangeo project and showcase a few example workflows using large public archives of optical and radar satellite data.

2 of 32

Community Tools for Analysis of NASA Earth Observation System Data in the Cloud

1st Technical Review

March 15, 2019

Proposal (project) Number: 17-ACCESS17-0003

Co-Operative Agreement Number(s): 80NSSC18M0157, 80NSSC18M0158, 80NSSC18M0159

PIs: Anthony Arendt (1), Ethan Gutmann (2), Daniel Pilone (3)

Institutions: (1) University Of Washington, Seattle, (2) University Corporation For Atmospheric Research, (3) Element 84, Inc.

NASA has estimated that by 2025, it will be storing upwards of 250 Petabytes (PB) of its data using commercial Cloud services (e.g. Amazon Web Services [AWS]). This presentation will focus on efforts funded by a NASA ACCESS 2017 grant to transition the Earth Science community into Cloud computing by developing technologies that build on top of the growing Pangeo ecosystem. In particular, the integration of JupyterHub with Kubernetes and several high-level Python packages (i.e. Xarray, Dask, Rasterio, Intake, PyViz), are enabling Cloud-native workflows that circumvent the bottleneck of downloading large amounts of data. These tools work best with emerging Cloud-native storage solutions for satellite imagery (i.e. NASA’s CMR, STAC, COGs). In this presentation, Scott will give an update on the Pangeo project and showcase a few example workflows using large public archives of optical and radar satellite data.

3 of 32

What is Pangeo?

Presentation: ESIP Tech Dive (01/2018) by Ryan Abernathy & Matthew Rocklin
Website: http://pangeo.io/
Github: https://github.com/pangeo-data
Blog: https://medium.com/pangeo
Gitter: https://gitter.im/pangeo-data

3

“A community platform for Big Data geoscience”

4 of 32

Pangeo Funding and Contributors

4

5 of 32

5

Big data in the geosciences

Increase in Model Data

Higher resolution

More process representation

Larger ensembles

Increase in Earth Observations

New sensors / platforms

Continuous observations

Multiple versions of derived datasets

2PB

150PB

40TB

1GB

500GB

2006

2012

2018

https://earthdata.nasa.gov/about/eosdis-cloud-evolution

6 of 32

NASA ACCESS Project (2018-2020)

6

Main Goal:

Facilitate the Geoscience community's transition into cloud computing by building on top of the growing Pangeo ecosystem.

7 of 32

7

Joe Hamman

Landung Setiawan

Rob Fatland

Scott Henderson

Jonah Joughin

Ethan Gutmann

Anthony Arendt

Amanda Tan

Dan Pilone

Andrew Pawlowski

NASA ACCESS Team

Matt Rocklin

8 of 32

NASA ACCESS Project Goals

Deploy a scalable cloud-based JupyterHub on AWS for community use.
Integrate existing NASA data discovery tools with cloud based data access protocols.
Create an advanced, cloud-optimized framework for custom analysis of large remote data archives.
Demonstrate scientific use-cases with GRACE, Sentinel-1, and Hydrologic Models

8

9 of 32

More about the NASA ACCESS Project

Funding information: http://pangeo.io/collaborators.html#access
Blog post: https://medium.com/pangeo/nasa-access-c3515a44f31b
NASA Summary: https://earthdata.nasa.gov/community/community-data-system-programs/access-projects/community-tools-for-analysis-of-nasa-earth-observation-system-data-in-the-cloud

9

10 of 32

InSAR Use Case

10

Goal: develop a database of Sentinel-1 InSAR in the Pacific Northwest for geohazard applications:

Slow-slip earthquake detection
monitoring slow-moving landslides
volcano deformation

2014 -> Present (Sentinel-1)

500+ Acquisitions

1000+ Interferograms

25 +Tb of storage!

* Sentinel-1 is best proxy for NASA’s NISAR mission planned for launch in 2022

11 of 32

HIMAT Use Case

11

Output from a high resolution (1km) land surface model from NASA Land Information System (LIS) is spatially aggregated to 1x1 degree equal area GRACE mascons for all of High Mountain Asia (HiMAT)

https://github.com/NASA-Planetary-Science/HiMAT

12 of 32

Pangeo computing interface

Deploy a scalable cloud-based JupyterHub on AWS for community use.

http://pangeo.esipfed.org/hub/login *under development, likely to change in near future
Upcoming improvements:

Deploy in different regions
Pangeo Binder for examples and demos
Continuous integration and testing

12

13 of 32

Scientific interaction w/ NASA data

13

existing model

future model

14 of 32

14

Pangeo JupyterLab Interface

15 of 32

Benefits for Cloud-Native Analysis

15

Instant Access with no need for complicated application procedures.

Centralized Repository where everyone can access shared data.

Secure Location only accessible to team members during preliminary data sharing phase.

Minimal Shifting of Data since computing can be brought to the cloud.

16 of 32

The Pangeo Architecture

16

Jupyter for interactive access on remote systems

Cloud/HPC

Xarray provides data structures and intuitive interface for interacting with datasets

Parallel computing system built on top of Kubernetes or HPC.

Dask tells the nodes what to do.

Distributed storage

Analysis Ready Data�Stored and cataloged on globally-available distributed storage (e.g. S3, GCS)

17 of 32

The Pangeo Architecture

17

Persistent deployment

Good for domain-specific research teams
Persistent storage for users
NOTE: deployer pays Cloud costs

Binder deployment

Good for demos and sharing
Cached image w/ Docker guarantees reproducibility
Grant credits footing the bill
NOTE: storage not persistent

https://binder.pangeo.io/

https://github.com/pangeo-data/cookiecutter-pangeo-binder

Google Cloud:

https://github.com/pangeo-data/dev.pangeo.io-deploy

https://pangeo.io/deployments

AWS:

https://github.com/Element84/pangeo-deployment

*NOTE: currently working on facilitating deployment procedure with Hubploy project

18 of 32

Data Discovery

Integrate existing NASA data discovery tools with cloud based data access protocols.

Working with CMR + STAC for static and dynamic catalogs on Cloud Storage
Programmatic access with NASA URS authentication

18

19 of 32

Goals for “Cloud-Native”

Easily discover imagery
Access image subsets instead of the entire files
Run algorithms where imagery is stored, and download only results
Easy dissemination of results via URLs
Scale analysis (global scope, high spatial and temporal resolution)
Workflows deployable on any cloud-provider (AWS, GC, Azure…)
Shouldn’t have to worry about security (pass credentials securely)
Costs comparable or better running locally

19

20 of 32

EO data moving to the Cloud

20

https://aws.amazon.com/earth/

21 of 32

Status of EO Datasets on AWS

21

22 of 32

Many formats, regions, providers

22

23 of 32

Cloud Native storage formats

23

Allows for HTTP GET range requests to only retrieve portions of full file
Metadata and overviews stored in front for speedy access
Works with legacy GIS programs (Arc, QGIS)!

https://zarr.readthedocs.io/en/stable/

ZARR

https://www.cogeo.org/

Straightforward conversion from netCDF and HDF

Why it’s important: https://medium.com/@_VincentS_/do-you-really-want-people-using-your-data-ec94cd94dc3f

Convert:

https://github.com/cogeotiff/rio-cogeo

Verify:

http://cog-validate.radiant.earth/html

24 of 32

Status of EO datasets on AWS

Currently, archives of record are still at DAACs
Many Cloud datasets currently managed by 3rd parties, not providers
Only approved archive of record formats are netCDF and HDF (maybe Geotiff in future) https://earthdata.nasa.gov/user-resources/standards-and-references
“Archives of convenience” an option for staging temporary cloud-optimized formats (e.g. netCDF → COG)

https://labs.element84.com/goes16/

24

25 of 32

How to find and discover data?

25

https://github.com/radiantearth/stac-spec

“Static catalog” is standardized JSON metadata:

https://landsat-stac.s3.amazonaws.com/landsat-8-l1/catalog.json

https://github.com/sat-utils/sat-api

Also “dynamic” in that APIs are built on top:

https://sat-api.developmentseed.org/stac/search?bbox=[-125,45,-117,49]&time=2019-02-01/2019-02-14

26 of 32

STAC + COGs enable Cloud-Native tools

26

STAC Browser generates HTML on the fly from static catalogs, with on-demand tiling of COGs

https://github.com/radiantearth/stac-browser

https://github.com/radiantearth/tiles.rdnt.io

example:

https://landsat.stac.cloud

Current work to have static catalogs & items discoverable w/ Google Search:

27 of 32

Putting it all together

27

NASA’s CMR consolidates DAAC archives:

https://github.com/nasa/Common-Metadata-Repository

STAC metadata is especially useful for archives on the Cloud, and can be used on top of CMR

https://github.com/Element84/cmr-stac-api-proxy

Intake is a Python library that facilitates loading n-D arrays from various catalogs

https://github.com/pangeo-data/intake-stac

28 of 32

Computational tools

Create an advanced, cloud-optimized framework for custom analysis of large remote data archives.

A combination of Python packages enable this:

Dask (distributed computing)
Xarray (n-D array interface)
Rasterio (raster processing utilities)
Intake (catalog and data ingest to python)
PyViz (interactive browser-centric visuals)

28

* This is an opinionated list, but you can add your Python package of choice.

29 of 32

Demos! ( )

29

Landsat NDVI (data on AWS, compute on Google)

https://github.com/scottyhq/esip-tech-dive

AGU 2018 Tutorial material (various examples)

https://github.com/pangeo-data/pangeo-tutorial-agu-2018

Getting ready for NISAR data

https://github.com/scottyhq/grfn_pangeo_demo

STAC catalogs, Intake, mosaics

https://github.com/scottyhq/stac-intake-landsat

Landsat NDVI (+ blog post for context)

https://medium.com/pangeo/cloud-native-geoprocessing-of-earth-observation-satellite-data-with-pangeo-997692d91ca2

30 of 32

Conclusions and Outlook

30

The Pangeo project offers a timely pathway to help transition Earth Science community to Cloud computing

Great progress on scalable computing when data is discoverable (CMR+STAC) and in tiled format (e.g. COG, ZARR)

Lots of opportunities for improvement, driven by scientific use cases… so please get involved!

31 of 32

How to get involved?

31

pangeo.io

Open communication via GitHub

https://github.com/pangeo-data

https://pangeo.io

32 of 32

Hackweeks to support community transition

32

https://waterhackweek.github.io

https://icesat-2hackweek.github.io

https://geohackweek.github.io