1 of 44

Data Strategies in the Cloud

Tips for managing your data and adjusting for the cloud

2 of 44

What are Data Strategies?

2

Data Strategies enhance collaboration and reproducible science

  • Workflows;
  • Data management best practices;
  • Documentation;

Good to start from the beginning of a project, great to start from where you are now

Help you future-proof workflows: is your workflow reproducible?

3 of 44

Why do we care?

3

  1. Benefit current and future “us” - you, your team, the scientific community

  • Consider the impacts of computing on our planet, shared resources, and fellow humans

Considerations - When To Cloud

Environmental impacts of earth observation data in the constellation and cloud computing era

  • Starting point for adopting best practices for computing
    1. In line with Open Science (check out NASA TOPS Open Science 101 online modules!)

  • Add to your toolkit! And know how to use those tools!

4 of 44

Anatomy of a data workflow

5 of 44

Anatomy of a data workflow

“data wrangling”

6 of 44

Make sure data are Findable and Accessible

Does everyone on your team know where the data is?

Can they access it?

Helpful to document this somewhere.

Try to be FAIR.

  • Findable,
  • Accessible,
  • Interoperable,
  • Reusable

FAIR Principles, GO FAIR Initiative

7 of 44

Tips for Data Management

Keep raw data, raw!

Save intermediate data, not just final versions.

Use consistent and descriptive folder and file name patterns.

Enter: TIDY DATA!

8 of 44

Standard file formats make data Interoperable

  • GeoTIFF for imagery or 2D raster data
  • NetCDF for multi-dimensional data (3D+)
  • Shapefiles or GeoJSON for vector data
  • csv for tabular data

Avoid Excel and other proprietary formats.

Try to be FAIR.

  • Findable,
  • Accessible,
  • Interoperable,
  • Reusable

FAIR Principles, GO FAIR Initiative

9 of 44

Metadata makes data Interoperable and Reusable

Metadata standards and conventions ensure that standard tools can read/interpret the data.

Standard also define the meaning of metadata attributes.

  • What is the Coordinate Reference System? (projection, grid mapping)
  • What are the units?
  • What is the variable name?
  • What is the source of the data?
  • What script produced the data?

Try to be FAIR.

  • Findable,
  • Accessible,
  • Interoperable,
  • Reusable

FAIR Principles, GO FAIR Initiative

10 of 44

Document the analysis

So, is it Reproducible?

Can you (or anyone else) easily reproduce your processing pipeline?

Document each step.

  • Where did you get the data, which files, which version?
  • Write it down. Anywhere is good but using a script is better.

With GUI interfaces (e.g. ArcGIS, QGIS, Excel), use screenshots, journal commands.

Try to be FAIR.

  • Findable,
  • Accessible,
  • Interoperable,
  • Reusable

FAIR Principles, GO FAIR Initiative

11 of 44

Anatomy of a data workflow

intermediate files

intermediate files

results/end product

12 of 44

When and how to save and publish files

12

Save intermediate data, not just final versions.

  • Who needs to use these again?
  • Where can they be stored?
  • Be FAIR, even at intermediate steps

Share your shiny new (tidy) dataset with the world!

  • Submit to the most relevant NASA DAAC as a community product

  • Find more repositories via the DataONE federation

*repositories on a spectrum of FAIR compliance

13 of 44

13

THE CLOUD

14 of 44

14

THE CLOUD

15 of 44

When we store photos in iCloud or Google accounts instead of on our cell phones, we are using cloud storage.

When we watch movies and tv shows on streaming services like Netflix or Hulu, we are using the cloud.

In these cases, we are interacting with “the cloud” without knowing it, though we, the user, are not in “the cloud”.

At a basic level, “The Cloud” is somewhere that isn’t your computer.

We all interact with data and services and that live in “The Cloud” in our daily lives.

"When you are frustrated, can you throw it out a window or can you not?"

- Yuvi Panda, Jupyter, 2i2c 😎

16 of 44

How do Data Strategies change when files can be accessed from or stored in The Cloud?

17 of 44

When and how to save files in The Cloud

17

Save intermediate data, not just final versions.

  • Who needs to use these again?
  • Where can they be stored?
  • Be FAIR, even at intermediate steps

18 of 44

Where is my data? Where is my compute?

18

Compute

Not Cloud

Cloud

Data storage

Not Cloud

Cloud

19 of 44

Where is my data? Where is my compute?

19

Compute

Not Cloud

Cloud

Data storage

Not Cloud

Cloud

20 of 44

Where is my data? Where is my compute?

20

Compute

Not Cloud

Cloud

Data storage

Not Cloud

Cloud

Courtesy of Alexis Hunzinger (NASA Openscapes)

21 of 44

How do Data Strategies change when files can be accessed from or stored in The Cloud?

Example:

Storing intermediate files in our JupyterHub

22 of 44

Demo: data storage in our JupyterHub

22

When you are working in the NASA Openscapes 2i2c JupyterHub, the default storage location is the HOME directory (/home/jovyan/).

HOME is really handy because it is persistent across server restarts and is a great place to store your code!

However the HOME directory is not a great place to store data, as it is very expensive (you’re charged for data sitting there), and can be quite slow to read from and write to.

Our approach for this:

1. $HOME -> for code

2. $SCRATCH_BUCKET -> for 'intermediate' data products

3. $PERSISTENT_BUCKET -> for 'permanent' data products

nasa-openscapes.github.io/earthdata-cloud-cookbook/how-tos/using-s3-storage

}

AWS “S3” Buckets

23 of 44

Q&A / Discussion

23

What questions do you have?

24 of 44

Extra Slides

25 of 44

Resources for evaluating your resources/ best practices?

25

26 of 44

How to future-proof workflows

If your goal is to make your workflows reproducible, try to be FAIR.

  • Findable,
  • Accessible,
  • Interoperable,
  • Reusable

Applies to the future you and your team as well.

When and why we do and when and why we don’t

27 of 44

Anatomy of a data workflow

intermediate files

intermediate files

results/end product

28 of 44

Anatomy of a data workflow

29 of 44

Computing

Storage

29

Your name here

30 of 44

0.1 1 10 100 1000 Data size (TB)

Calc time

(CPU-hours)

10,000

1,000

100

10

1

Laptop/ Desktop

Research

Group Server

Univ. Shared compute cluster

Cloud computing

For my own workflows:

Cloud computing will unlikely replace any current computing environments (share server or compute cluster)

What it will do instead is facilitate processing of large datasets, via NASA Earthdata cloud.

Workflow Scales

Slide credit: Aronne Merrelli

Climate and Space Sciences and Engineering, U. Michigan

merrelli@umich.edu

31 of 44

When to Cloud?

Questions to ask yourself…

31

Is my task limited by:

  1. Network speed
  2. Disk speed (this could apply to your machine
  3. CPU speed or other machines, like “the cloud”)
  4. Disk size

How do I address that rate-limiting process?

What tools and services are available? What do protocols allow?

Slide credit: Aronne Merrelli

Climate and Space Sciences and Engineering, U. Michigan

merrelli@umich.edu

32 of 44

Questions to ask yourself…

32

(from Andy’s slides)

(from Carl’s comment in thread)

INCLUDE AND HIGHLIGHT - cloud-optimized options (formats, readers, storage)

33 of 44

What is the Cloud, anyway?

33

34 of 44

34

B

Accessing data from NASA Earthdata Cloud

35 of 44

36 of 44

36

B

Accessing data from NASA Earthdata Cloud

Access pathways (not exhaustive)

  1. Download to local computer, laptop, server
    • command line
    • programmatic (python, notebook, R, etc)
  2. via graphic user interface, e.g. Earthdata Search

A

See Earthdata Cloud Cookbook - Glossary, Cheatsheets, & Guides section

https://nasa-openscapes.github.io/earthdata-cloud-cookbook/glossary.html

B. In-cloud access workflows

    • programmatic

37 of 44

Some definitions

download = transfer a file from a remote machine to your local machine (which may be a laptop, work station, or some HPC, etc)

cloud = “someone else's computer", which means internet network connection is involved. This “other” computer can be a virtual machine within your institution's computing services, or one of many AWS computers that make up The Cloud.

streaming = accessing that data by transferring bytes into memory or some virtual file system

Analogy: 'Netflix streaming' vs downloading.

By "download", I mean downloading to my laptop's harddisk. That means the computer has a 'cold copy' of the data; I can switch off wifi, shut down my machine, turn it back on, the data is on my machine when I download it.

When we stream data, the hard disk doesn't get involved. Data travels from my network card to my RAM+CPU. We can compute things like means by streaming: you don't store all N numbers in the mean, you just keep track of the running sum and the number you have seen, mean = mean + new. If someone asks you for all the original values, you can't answer without re-streaming, you didn't store them. We can stream data in this way that is way bigger than RAM, bigger than harddisk, just like we can consume more Netflix that we have harddisk or RAM for.

Source: Karl Boettiger, Andy Barrett (via Slack), Catalina Taglialatela

Image source: https://www.linkedin.com/pulse/help-our-cloud-migration-mission-impossible-marco-rijkers

on-premises

“Local” to the user; control of one’s own data server�OR�DAAC’s on-site servers

download

Transfer a file from a remote machine to local machine

stream

Transferring bytes into memory or some virtual file system (no hard disk involved)

range requests

Asks server to send only a portion of an HTTP message back to a client

38 of 44

39 of 44

40 of 44

NASA Earthdata Cloud – Solutions for Access

40

Managed Cloud Service

Do It Yourself

  • Create your own AWS account
  • Connect to EC2 Instance
  • Manage setup and cost

41 of 44

Cloud

Download:

Data copy

Cloud

Streaming no data copy

Local

Download:

Data copy

B

A

42 of 44

Working modes

42

COMPUTE + DATA STORAGE

Compute

Local

Cloud

HPC

Data storage

Local

Downloaded or create files

Why?

Likely not possible

Cloud

Streaming possible but may be slow. Works for small data

Stream data, scalable compute.

Hub, instance or coiled

Likely not possible

HPC

Why?

Why?

CPU intensive processing

HPC: high performance computing

43 of 44

Cloud working modes

43

COMPUTE + DATA STORAGE

Compute

Local

Cloud

HPC

Data storage

Local

Downloaded or create files

Why?

Likely not possible

Cloud

Streaming possible but may be slow. Works for small data

Stream data, scalable compute.

Hub, instance or coiled

Likely not possible

HPC

Why?

Why?

CPU intensive processing

HPC: high performance computing

44 of 44

Cloud working modes

  1. Accessing data from the cloud on my computer (not cloud)
  2. Accessing data from the cloud, working in the cloud

Consider:

  • Advantages of streaming + range requests vs. full downloads of datasets
  • Existing services that help with data access OR compute allocation
    • Data: OPeNDAP
    • Compute: Coiled, Google Colab/binder,
  • What opportunities do cloud storage+computing present?
    • e.g. NASA Earth Science data stored in the cloud

Download: transfer a file from a remote machine to a local machine

Stream: transfer bytes into memory or some virtual file system (no hard disk involved)

Range requests: asks server to to send only a portion of an HTTP message back to a client