1 of 29

Efforts to support end users in the journey to the cloud

Open Source Science Data Repositories Workshop

Amy Steiker • Alexis Hunzinger • Luis Lopez • Catalina Oaida Taglialatela • Aaron Friesz

and the NASA Openscapes Mentors

OSS Data Repositories Workshop

September 27 2022

NASA Award# 20-TWSC20-2-0003 Leads: Julia Stewart Lowndes & Erin Robinson

Openscapes artwork by Allison Horst; @allison_horst

slides: https://nasa-openscapes.github.io/about

We believe Open Science can accelerate data- driven solutions and increase diversity, equity, inclusion, and belonging in research and beyond.

2 of 29

NASA Openscapes

We are a mentor community across

NASA Earth science data centers (DAACs)

We are co-creating and teaching common tutorials to support researchers as they migrate analytical workflows to the Cloud

3 of 29

Agenda: short lightning talks

Perspectives from the NASA Openscapes DAAC mentors

Extra slides also showcase Earthdata Cloud Cookbook - Cheatsheets (Catalina Oaida Taglialatela & Cassandra Nickles, PO.DAAC); see https://nasa-openscapes.github.io/about.html#slides

Time

Topic

Presenter

6 mins

DAAC internal training

Alexis Hunzinger, Christine Smit (GES DISC)

6 mins

End-user training events

Amy Steiker (NSIDC DAAC)

6 mins

earthaccess python library

Luis Lopez (NSIDC DAAC)

What: Brief overview on some of the ways DAACs have started supporting the end-users transition to the cloud paradigm

Why: Share and learn from each other, grow and improve DAAC support of cloud-archived science and applications users, while following open source, open science best practices.

4 of 29

DAAC Internal Training

5 of 29

Knowledge From Within

Giving DAAC Staff Hands-On Experience in the Cloud

Presenters:

Alexis Hunzinger, Chris Battisto

Helpers:

Allison Alcott, Binita KC, Christine Smit

Teach cloud basics and definitions

Grant access to cloud workspace

Interact with one cloud data access method: direct S3 access

Walk a mile in a user’s shoes

GESDISC

6 of 29

How did we do it?

1. Educate

Present and trial cloud user resources at weekly meetings ahead of workshop

2. Prepare

Split participants by skill and experience level with Python and cloud computing

3. Interact

Teach with bite-sized lessons using Jupyter Notebooks��Encourage type-along during interactive workshop

Tutorial

Template

7 of 29

How did it go?

Grant access to cloud workspace

Cloud Understanding

Python/Jupyter Notebook

Python

Jupyter Notebook

LEAST CONFIDENT

MOST CONFIDENT

NOVICE

EXPERT

Skill/Experience

Confidence

after workshop

Beginner: 12

Intermediate: 14

Advanced: 7

  • Window/tab management
  • Too much material
  • Live-coding typing was too fast
  • Continue to define jargon/terminology throughout workshop
  • “More simple & straightforward than I thought”
  • Breaking up code pieces helped
  • Pre-filled notebooks useful to refer to
  • Easy to follow
  • GitHub clone was easy!
  • Breakout rooms for troubleshooting

Total: 33

}

Prerequisite Knowledge?

  • Familiar with Earthdata tools
  • Basic Python
  • Basic cloud (AWS) understanding & terminology
  • Account set up for cloud access

Cloud Understanding

8 of 29

What did we learn?

Learning curve is STEEP!

No one left an expert and we continue to help staff who are experimenting in the cloud

Continued support and education are critical

Necessary to host refresher workshops to exercise the knowledge and introduce new tools and methods

Provide resources that are easy to revisit

Website, slides, instructions, recordings are all useful for staff who spend more time with the material

Lay a foundation with cloud basics and terminologies

Introduce terms and concepts early, perhaps in a separate meeting or clinic, and continue defining them throughout the workshop

9 of 29

End-user Training Events

Outcomes & Lessons Learned

10 of 29

Cloud Training Events

Openscapes Year 1

Event

Date

Focus Area / Goals

November 2021

Five day collaborative open science learning experience aimed at exploring, creating, and promoting effective cloud-based science and applications workflows using NASA Earthdata Cloud data, tools, and services (among others).

December 2021

Half-day workshop focused on enabling Analysis in the Cloud using NASA Earth Science Data

March 2022

Preparing for Surface Water and Ocean Topography (SWOT) and enable the (oceanography) science team to be ready for processing and handling the large volumes of SWOT SSH data in the cloud.

April 2022

Exposing ECOSTRESS data users to ECOSTRESS version 2 (v2) data products in the cloud. Learning objectives focus on how to find and access ECOSTRESS v2 data from Earthdata Cloud either by downloading or accessing the data on the cloud.

April, May 2022

A series of Jupyter Notebooks, written in Python, demonstrating how to get started with NASA Earthdata in the cloud. Topics include: Cloud Data Access in AWS, Cloud Optimized Data, Data Discovery using STAC via NASA’s CMR-STAC API, Working with Cloud Data

11 of 29

Outcome: These events markedly raise cloud comfort level

"[We need] Better documentation/tutorials for how to access data over the cloud. It would have been extremely difficult to do any of this without the help of the hackathon."

Before

After

12 of 29

Outcome: Understanding the why

“...It was really eye-opening to not be constrained by my local computer….”

"... More realistically, I will probably use many of these tools on my local machine unless I'm working with big datasets that really benefit from cloud computing.”

Learners’ takeaways predominantly centered around improved conceptual understanding of why and when to use, or not use, the cloud…

… While also recognizing that there is a significant learning curve and time investment required for adoption

Credit: Open Architecture for scalable cloud-based data analytics. From Abernathey, Ryan (2020): Data Access Modes in Science.

13 of 29

Common Pain Points

  • Inconsistent data availability and service offerings

→ Leads to difficulties reusing a given workflow

"Cloud-based tools are not mature enough to use in a research-focused applications. It seems like the API's are being developed to be extremely flexible and powerful, but the use-cases for any particular researcher are much more narrow. The API examples use an already complex toolchain (Jupyter, cloud, python3 stack, etc.) to call a complex API (harmony or cmr) and perform a simple task"

  • Lack of common and robust learning resources

  • Earthdata Cloud ecosystem is complex and overwhelming

→ Learners struggle to know when to use a given workflow or tool/service

14 of 29

Moving forward

"It would be great to see a tutorial or detailed example of how to set up our own jupyter environment. Is there a way we can track how much the work we're doing using this 2i2c environment costs, to give us a better idea of eventual charges for data processing?"

Open Science Enablement

Collaboration tools & methods, supporting interagency and intercloud workflows

Advanced Cloud Processing

Spinning up larger, parallel resources for big data analysis, optimizing & standardizing code

Spinning up a permanent cloud environment

Leveraging 2i2c environment, understanding cost, funding mechanisms

Continuing to support Openscapes 2i2C Hub

Reducing barriers to cloud entry; Meeting users where they are; power in a shared environment

  • Recognizing easy cloud access as a core service
  • Continuing to close the loop between the users we work with and our engineers to build solutions together

15 of 29

earthaccess

NASA Data Search and Access in Python

Luis López et. al.

Software Engineer @ NSIDC

16 of 29

earthaccess

Overview

Reproducible workflows are extremely important in the age of cloud data access, cloud computing, and open science.

In this context, we are developing earthaccess, a python library that aims to simplify data discovery and access for those using the PyData ecosystem (xarray, dask, numpy).

Using this library eliminates the need to know the intricacies of NASA’s Application Programming Interfaces (APIs) and cloud data storage systems.

17 of 29

The Problem: API Fragmentation

In order to programmatically access NASA datasets, users must be familiar with:

  • EDL
    • How to use it with OAuth, CURL, WGET etc.
    • .netrc
  • CMR
    • How to query for what we want
    • How to read the metadata that CMR returns.
  • Cloud
    • AWS
    • S3 buckets, S3 credentials

Software Engineer

Geo

Scientist

18 of 29

API fragmentation in the cloud

API fragmentation in the notebook�from * import *

Image credit: Patrick Quinn

19 of 29

earthaccess: when to use

  • We work at the granule level
    • If we already use OpenDAP/Harmony in an effective way, earthaccess does not support subsetting, or GIS operations (yet)
  • We don’t require granule pre-processing.
    • (see above)
  • We are not cloud experts and just want to access the data.

20 of 29

earthaccess: simplifying access

21 of 29

Next Steps

  • Technical
    • Use asyncio for all bulk operations
    • Kerchunk / zarr-eosdis-store integration.
  • Programmatic integration tests
    • Make sure we can access granules from any collection.
  • EULA Integration
    • There is no programmatic way of knowing if a dataset requires an EULA and there is no programmatic way of accepting it.
  • Making the API even simpler.

from earthaccess.nsidc import atl06

  • Get more people involved!

22 of 29

Thanks!

Thanks to the people who made this possible!

Not pictured: more people!

23 of 29

Earthdata Cloud Cookbook:

Workflow and Vocab Cheatsheets

Catalina Oaida Taglialatela (PO.DAAC), Cassie Nickles (PO.DAAC), Julie Lowndes (Openscapes), Amy Steiker (NSIDC), Aaron Friez (LP DAAC), Alexis Hunzinger (GES DISC)

24 of 29

Earthdata Cloud Cookbook

Cheatsheets & Guides

Tools & Services Roadmap

  • Practical guide for learning and selecting the right tool or service for a given use case

https://nasa-openscapes.github.io/earthdata-cloud-cookbook/cheatsheet.html#tools-services-roadmap

demo

25 of 29

Earthdata Cloud Cookbook

Cheatsheets & Guides

Workflow Cheatsheet

  • Practical reference guide as user begins taking the conceptual pieces to explore and implement in their own workflows
  • Guide to selecting from available tools to enable and implement the Access Pathway(s)
  • Links concepts and tool resources (tutorials)

https://nasa-openscapes.github.io/earthdata-cloud-cookbook/cheatsheet.html#workflow-cheatsheet

demo

26 of 29

Earthdata Cloud Cookbook

new Cheatsheets & Guides

Why: Increase accessibility to data & resources; many tools, lots of (new) jargon

What: Conceptual, practical, or reference guides to help users find the paths and tools most useful for a given need; we recognize there is a range of where in the learning process the users find themselves → a range of guides & cheatsheets

How: Developed with NASA Openscapes and other DAAC mentors - consistency across DAACs, in messaging, information, and user experience

Where: Implementation in:

27 of 29

Earthdata Cloud Cookbook

Supporting NASA Earth science research

teams’ migration to the cloud

https://nasa-openscapes.github.io/earthdata-cloud-cookbook/

  • Curated collection of tutorials we’ve iterated on and adapted following versions and feedback from live (virtual) training events
  • Focus on the common steps across DAACs/users
  • For self-paced learning
  • Links back to underlying GitHub repo
  • Under active, open development

A place to learn, share, and experiment with NASA Earthdata on the Cloud. We know this has a lot of moving parts, and we are iterating as we go, and welcome feedback and contributions.

28 of 29

Closing

29 of 29