1 of 32

Cloud Native Analysis of Earth Observation Satellite Data with Pangeo

Scott Henderson

scottyh@uw.edu

eScience Postdoctoral Fellow

University of Washington

ESIP Tech Dive

February 14, 2019

2 of 32

Community Tools for Analysis of NASA Earth Observation System Data in the Cloud

1st Technical Review

March 15, 2019

Proposal (project) Number: 17-ACCESS17-0003

Co-Operative Agreement Number(s): 80NSSC18M0157, 80NSSC18M0158, 80NSSC18M0159

PIs: Anthony Arendt (1), Ethan Gutmann (2), Daniel Pilone (3)

Institutions: (1) University Of Washington, Seattle, (2) University Corporation For Atmospheric Research, (3) Element 84, Inc.

3 of 32

What is Pangeo?

3

“A community platform for Big Data geoscience”

4 of 32

Pangeo Funding and Contributors

4

5 of 32

5

Big data in the geosciences

Increase in Model Data

Higher resolution

More process representation

Larger ensembles

Increase in Earth Observations

New sensors / platforms

Continuous observations

Multiple versions of derived datasets

2PB

150PB

40TB

1GB

500GB

2006

2012

2018

6 of 32

NASA ACCESS Project (2018-2020)

6

Main Goal:

Facilitate the Geoscience community's transition into cloud computing by building on top of the growing Pangeo ecosystem.

7 of 32

7

Joe Hamman

Landung Setiawan

Rob Fatland

Scott Henderson

Jonah Joughin

Ethan Gutmann

Anthony Arendt

Amanda Tan

Dan Pilone

Andrew Pawlowski

NASA ACCESS Team

Matt Rocklin

8 of 32

NASA ACCESS Project Goals

  1. Deploy a scalable cloud-based JupyterHub on AWS for community use.
  2. Integrate existing NASA data discovery tools with cloud based data access protocols.
  3. Create an advanced, cloud-optimized framework for custom analysis of large remote data archives.
  4. Demonstrate scientific use-cases with GRACE, Sentinel-1, and Hydrologic Models

8

9 of 32

More about the NASA ACCESS Project

9

10 of 32

InSAR Use Case

10

Goal: develop a database of Sentinel-1 InSAR in the Pacific Northwest for geohazard applications:

  • Slow-slip earthquake detection
  • monitoring slow-moving landslides
  • volcano deformation

2014 -> Present (Sentinel-1)

500+ Acquisitions

1000+ Interferograms

25 +Tb of storage!

* Sentinel-1 is best proxy for NASA’s NISAR mission planned for launch in 2022

11 of 32

HIMAT Use Case

11

  • Output from a high resolution (1km) land surface model from NASA Land Information System (LIS) is spatially aggregated to 1x1 degree equal area GRACE mascons for all of High Mountain Asia (HiMAT)

12 of 32

Pangeo computing interface

  1. Deploy a scalable cloud-based JupyterHub on AWS for community use.
    1. http://pangeo.esipfed.org/hub/login *under development, likely to change in near future
    2. Upcoming improvements:
      1. Deploy in different regions
      2. Pangeo Binder for examples and demos
      3. Continuous integration and testing

12

13 of 32

Scientific interaction w/ NASA data

13

existing model

future model

14 of 32

14

Pangeo JupyterLab Interface

15 of 32

Benefits for Cloud-Native Analysis

15

Instant Access with no need for complicated application procedures.

Centralized Repository where everyone can access shared data.

Secure Location only accessible to team members during preliminary data sharing phase.

Minimal Shifting of Data since computing can be brought to the cloud.

16 of 32

The Pangeo Architecture

16

Jupyter for interactive access on remote systems

Cloud/HPC

Xarray provides data structures and intuitive interface for interacting with datasets

Parallel computing system built on top of Kubernetes or HPC.

Dask tells the nodes what to do.

Distributed storage

Analysis Ready DataStored and cataloged on globally-available distributed storage (e.g. S3, GCS)

17 of 32

The Pangeo Architecture

17

  1. Persistent deployment
    1. Good for domain-specific research teams
    2. Persistent storage for users
    3. NOTE: deployer pays Cloud costs
  • Binder deployment
    • Good for demos and sharing
    • Cached image w/ Docker guarantees reproducibility
    • Grant credits footing the bill
    • NOTE: storage not persistent

*NOTE: currently working on facilitating deployment procedure with Hubploy project

18 of 32

Data Discovery

  • Integrate existing NASA data discovery tools with cloud based data access protocols.
    • Working with CMR + STAC for static and dynamic catalogs on Cloud Storage
    • Programmatic access with NASA URS authentication

18

19 of 32

Goals for “Cloud-Native”

  • Easily discover imagery
  • Access image subsets instead of the entire files
  • Run algorithms where imagery is stored, and download only results
  • Easy dissemination of results via URLs
  • Scale analysis (global scope, high spatial and temporal resolution)
  • Workflows deployable on any cloud-provider (AWS, GC, Azure…)
  • Shouldn’t have to worry about security (pass credentials securely)
  • Costs comparable or better running locally

19

20 of 32

EO data moving to the Cloud

20

https://aws.amazon.com/earth/

21 of 32

Status of EO Datasets on AWS

21

22 of 32

Many formats, regions, providers

22

23 of 32

Cloud Native storage formats

23

  • Allows for HTTP GET range requests to only retrieve portions of full file
  • Metadata and overviews stored in front for speedy access
  • Works with legacy GIS programs (Arc, QGIS)!

ZARR

  • Straightforward conversion from netCDF and HDF

24 of 32

Status of EO datasets on AWS

  • Currently, archives of record are still at DAACs
  • Many Cloud datasets currently managed by 3rd parties, not providers
  • Only approved archive of record formats are netCDF and HDF (maybe Geotiff in future) https://earthdata.nasa.gov/user-resources/standards-and-references
  • “Archives of convenience” an option for staging temporary cloud-optimized formats (e.g. netCDF → COG)

24

25 of 32

How to find and discover data?

25

  • Also “dynamic” in that APIs are built on top:

26 of 32

STAC + COGs enable Cloud-Native tools

26

STAC Browser generates HTML on the fly from static catalogs, with on-demand tiling of COGs

https://github.com/radiantearth/stac-browser

https://github.com/radiantearth/tiles.rdnt.io

example:

https://landsat.stac.cloud

Current work to have static catalogs & items discoverable w/ Google Search:

27 of 32

Putting it all together

27

  • NASA’s CMR consolidates DAAC archives:
  • STAC metadata is especially useful for archives on the Cloud, and can be used on top of CMR
  • Intake is a Python library that facilitates loading n-D arrays from various catalogs

28 of 32

Computational tools

  • Create an advanced, cloud-optimized framework for custom analysis of large remote data archives.
    • A combination of Python packages enable this:
      • Dask (distributed computing)
      • Xarray (n-D array interface)
      • Rasterio (raster processing utilities)
      • Intake (catalog and data ingest to python)
      • PyViz (interactive browser-centric visuals)

28

* This is an opinionated list, but you can add your Python package of choice.

29 of 32

Demos! ( )

29

  1. Landsat NDVI (data on AWS, compute on Google)
    1. https://github.com/scottyhq/esip-tech-dive
  2. AGU 2018 Tutorial material (various examples)
  3. Getting ready for NISAR data
  4. STAC catalogs, Intake, mosaics
  5. Landsat NDVI (+ blog post for context)

30 of 32

Conclusions and Outlook

30

  • The Pangeo project offers a timely pathway to help transition Earth Science community to Cloud computing

  • Great progress on scalable computing when data is discoverable (CMR+STAC) and in tiled format (e.g. COG, ZARR)

  • Lots of opportunities for improvement, driven by scientific use cases… so please get involved!

31 of 32

How to get involved?

31

pangeo.io

Open communication via GitHub

32 of 32

Hackweeks to support community transition

32