Cloud Native Analysis of Earth Observation Satellite Data with Pangeo
February 14, 2019
Community Tools for Analysis of NASA Earth Observation System Data in the Cloud
1st Technical Review
March 15, 2019
Proposal (project) Number: 17-ACCESS17-0003
Co-Operative Agreement Number(s): 80NSSC18M0157, 80NSSC18M0158, 80NSSC18M0159
PIs: Anthony Arendt (1), Ethan Gutmann (2), Daniel Pilone (3)
Institutions: (1) University Of Washington, Seattle, (2) University Corporation For Atmospheric Research, (3) Element 84, Inc.
What is Pangeo?
3
“A community platform for Big Data geoscience”
Pangeo Funding and Contributors
4
5
Big data in the geosciences
Increase in Model Data
Higher resolution
More process representation
Larger ensembles
Increase in Earth Observations
New sensors / platforms
Continuous observations
Multiple versions of derived datasets
2PB
150PB
40TB
1GB
500GB
2006
2012
2018
NASA ACCESS Project (2018-2020)
6
Main Goal:
Facilitate the Geoscience community's transition into cloud computing by building on top of the growing Pangeo ecosystem.
7
Joe Hamman
Landung Setiawan
Rob Fatland
Scott Henderson
Jonah Joughin
Ethan Gutmann
Anthony Arendt
Amanda Tan
Dan Pilone
Andrew Pawlowski
NASA ACCESS Team
Matt Rocklin
NASA ACCESS Project Goals
8
More about the NASA ACCESS Project
9
InSAR Use Case
10
Goal: develop a database of Sentinel-1 InSAR in the Pacific Northwest for geohazard applications:
2014 -> Present (Sentinel-1)
500+ Acquisitions
1000+ Interferograms
25 +Tb of storage!
* Sentinel-1 is best proxy for NASA’s NISAR mission planned for launch in 2022
HIMAT Use Case
11
Pangeo computing interface
12
Scientific interaction w/ NASA data
13
existing model
future model
14
Pangeo JupyterLab Interface
Benefits for Cloud-Native Analysis
15
Instant Access with no need for complicated application procedures.
Centralized Repository where everyone can access shared data.
Secure Location only accessible to team members during preliminary data sharing phase.
Minimal Shifting of Data since computing can be brought to the cloud.
The Pangeo Architecture
16
Jupyter for interactive access on remote systems
Cloud/HPC
Xarray provides data structures and intuitive interface for interacting with datasets
Parallel computing system built on top of Kubernetes or HPC.
Dask tells the nodes what to do.
Distributed storage
Analysis Ready Data�Stored and cataloged on globally-available distributed storage (e.g. S3, GCS)
The Pangeo Architecture
17
Google Cloud:
*NOTE: currently working on facilitating deployment procedure with Hubploy project
Data Discovery
18
Goals for “Cloud-Native”
19
EO data moving to the Cloud
20
https://aws.amazon.com/earth/
Status of EO Datasets on AWS
21
Many formats, regions, providers
22
Cloud Native storage formats
23
ZARR
Status of EO datasets on AWS
24
How to find and discover data?
25
https://github.com/radiantearth/stac-spec
“Static catalog” is standardized JSON metadata:
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/catalog.json
STAC + COGs enable Cloud-Native tools
26
STAC Browser generates HTML on the fly from static catalogs, with on-demand tiling of COGs
https://github.com/radiantearth/stac-browser
https://github.com/radiantearth/tiles.rdnt.io
example:
Current work to have static catalogs & items discoverable w/ Google Search:
Putting it all together
27
Computational tools
28
* This is an opinionated list, but you can add your Python package of choice.
Demos! ( )
29
Conclusions and Outlook
30
How to get involved?
31
pangeo.io
Open communication via GitHub
Hackweeks to support community transition
32