Data Strategies in the Cloud
Tips for managing your data and adjusting for the cloud
What are Data Strategies?
2
Data Strategies enhance collaboration and reproducible science
Good to start from the beginning of a project, great to start from where you are now
Help you future-proof workflows: is your workflow reproducible?
Why do we care?
3
Considerations - When To Cloud
Environmental impacts of earth observation data in the constellation and cloud computing era
Anatomy of a data workflow
Anatomy of a data workflow
“data wrangling”
Make sure data are Findable and Accessible
Does everyone on your team know where the data is?
Can they access it?
Helpful to document this somewhere.
Tips for Data Management
Keep raw data, raw!
Save intermediate data, not just final versions.
Use consistent and descriptive folder and file name patterns.
Enter: TIDY DATA!
Standard file formats make data Interoperable
Avoid Excel and other proprietary formats.
Metadata makes data Interoperable and Reusable
Metadata standards and conventions ensure that standard tools can read/interpret the data.
Standard also define the meaning of metadata attributes.
Document the analysis
So, is it Reproducible?
Can you (or anyone else) easily reproduce your processing pipeline?
Document each step.
With GUI interfaces (e.g. ArcGIS, QGIS, Excel), use screenshots, journal commands.
Anatomy of a data workflow
intermediate files
intermediate files
results/end product
When and how to save and publish files
12
Save intermediate data, not just final versions.
Share your shiny new (tidy) dataset with the world!
*repositories on a spectrum of FAIR compliance
13
THE CLOUD
14
THE CLOUD
When we store photos in iCloud or Google accounts instead of on our cell phones, we are using cloud storage.
When we watch movies and tv shows on streaming services like Netflix or Hulu, we are using the cloud.
In these cases, we are interacting with “the cloud” without knowing it, though we, the user, are not in “the cloud”.
At a basic level, “The Cloud” is somewhere that isn’t your computer.
We all interact with data and services and that live in “The Cloud” in our daily lives.
"When you are frustrated, can you throw it out a window or can you not?"
- Yuvi Panda, Jupyter, 2i2c 😎
How do Data Strategies change when files can be accessed from or stored in The Cloud?
When and how to save files in The Cloud
17
Save intermediate data, not just final versions.
Where is my data? Where is my compute?
18
| | Compute | |
| | Not Cloud | Cloud |
Data storage | Not Cloud | | |
Cloud | | |
Where is my data? Where is my compute?
19
| | Compute | |
| | Not Cloud | Cloud |
Data storage | Not Cloud | | |
Cloud | | |
Where is my data? Where is my compute?
20
| | Compute | |
| | Not Cloud | Cloud |
Data storage | Not Cloud | | |
Cloud | | |
Courtesy of Alexis Hunzinger (NASA Openscapes)
How do Data Strategies change when files can be accessed from or stored in The Cloud?
Example:
Storing intermediate files in our JupyterHub
Demo: data storage in our JupyterHub
22
When you are working in the NASA Openscapes 2i2c JupyterHub, the default storage location is the HOME directory (/home/jovyan/).
HOME is really handy because it is persistent across server restarts and is a great place to store your code!
However the HOME directory is not a great place to store data, as it is very expensive (you’re charged for data sitting there), and can be quite slow to read from and write to.
Our approach for this:
1. $HOME -> for code
2. $SCRATCH_BUCKET -> for 'intermediate' data products
3. $PERSISTENT_BUCKET -> for 'permanent' data products
nasa-openscapes.github.io/earthdata-cloud-cookbook/how-tos/using-s3-storage
}
AWS “S3” Buckets
Q&A / Discussion
23
What questions do you have?
Extra Slides
Resources for evaluating your resources/ best practices?
25
How to future-proof workflows
If your goal is to make your workflows reproducible, try to be FAIR.
Applies to the future you and your team as well.
When and why we do and when and why we don’t
Anatomy of a data workflow
intermediate files
intermediate files
results/end product
Anatomy of a data workflow
Computing
Storage
29
Your name here
0.1 1 10 100 1000 Data size (TB)
Calc time
(CPU-hours)
10,000
1,000
100
10
1
Laptop/ Desktop
Research
Group Server
Univ. Shared compute cluster
Cloud computing
For my own workflows:
Cloud computing will unlikely replace any current computing environments (share server or compute cluster)
What it will do instead is facilitate processing of large datasets, via NASA Earthdata cloud.
Workflow Scales
Slide credit: Aronne Merrelli
Climate and Space Sciences and Engineering, U. Michigan
merrelli@umich.edu
When to Cloud?
Questions to ask yourself…
31
Is my task limited by:
How do I address that rate-limiting process?
What tools and services are available? What do protocols allow?
Slide credit: Aronne Merrelli
Climate and Space Sciences and Engineering, U. Michigan
merrelli@umich.edu
Questions to ask yourself…
32
(from Andy’s slides)
(from Carl’s comment in thread)
INCLUDE AND HIGHLIGHT - cloud-optimized options (formats, readers, storage)
What is the Cloud, anyway?
33
34
B
Accessing data from NASA Earthdata Cloud
36
B
Accessing data from NASA Earthdata Cloud
Access pathways (not exhaustive)
A
See Earthdata Cloud Cookbook - Glossary, Cheatsheets, & Guides section
https://nasa-openscapes.github.io/earthdata-cloud-cookbook/glossary.html
B. In-cloud access workflows
Some definitions
download = transfer a file from a remote machine to your local machine (which may be a laptop, work station, or some HPC, etc)
cloud = “someone else's computer", which means internet network connection is involved. This “other” computer can be a virtual machine within your institution's computing services, or one of many AWS computers that make up The Cloud.
streaming = accessing that data by transferring bytes into memory or some virtual file system
Analogy: 'Netflix streaming' vs downloading.
By "download", I mean downloading to my laptop's harddisk. That means the computer has a 'cold copy' of the data; I can switch off wifi, shut down my machine, turn it back on, the data is on my machine when I download it.
When we stream data, the hard disk doesn't get involved. Data travels from my network card to my RAM+CPU. We can compute things like means by streaming: you don't store all N numbers in the mean, you just keep track of the running sum and the number you have seen, mean = mean + new. If someone asks you for all the original values, you can't answer without re-streaming, you didn't store them. We can stream data in this way that is way bigger than RAM, bigger than harddisk, just like we can consume more Netflix that we have harddisk or RAM for.
Source: Karl Boettiger, Andy Barrett (via Slack), Catalina Taglialatela
Image source: https://www.linkedin.com/pulse/help-our-cloud-migration-mission-impossible-marco-rijkers
on-premises | “Local” to the user; control of one’s own data server�OR�DAAC’s on-site servers |
download | Transfer a file from a remote machine to local machine |
stream | Transferring bytes into memory or some virtual file system (no hard disk involved) |
range requests | Asks server to send only a portion of an HTTP message back to a client |
| |
| |
NASA Earthdata Cloud – Solutions for Access
40
Managed Cloud Service
Do It Yourself
Cloud
Download:
Data copy
Cloud
Streaming no data copy
Local
Download:
Data copy
B
A
Working modes
42
COMPUTE + DATA STORAGE
| | Compute | ||
| | Local | Cloud | HPC |
Data storage | Local | Downloaded or create files | Why? | Likely not possible |
Cloud | Streaming possible but may be slow. Works for small data | Stream data, scalable compute. Hub, instance or coiled | Likely not possible | |
HPC | Why? | Why? | CPU intensive processing |
HPC: high performance computing
Cloud working modes
43
COMPUTE + DATA STORAGE
| | Compute | ||
| | Local | Cloud | HPC |
Data storage | Local | Downloaded or create files | Why? | Likely not possible |
Cloud | Streaming possible but may be slow. Works for small data | Stream data, scalable compute. Hub, instance or coiled | Likely not possible | |
HPC | Why? | Why? | CPU intensive processing |
HPC: high performance computing
Cloud working modes
Consider:
Download: transfer a file from a remote machine to a local machine
Stream: transfer bytes into memory or some virtual file system (no hard disk involved)
Range requests: asks server to to send only a portion of an HTTP message back to a client