1 of 44

5 of 44

Anatomy of a data workflow

Data Strategies for Future Us, Slide 12

“data wrangling”

6 of 44

Make sure data are Findable and Accessible

Does everyone on your team know where the data is?

Can they access it?

Helpful to document this somewhere.

Try to be FAIR.

Findable,
Accessible,
Interoperable,
Reusable

Data Strategies for Future Us, Slide 13

7 of 44

Tips for Data Management

Keep raw data, raw!

Save intermediate data, not just final versions.

Use consistent and descriptive folder and file name patterns.

Enter: TIDY DATA!

Data strategies for Future Us (Openscapes)

Data organization in spreadsheets, Broman & Woo, 2018

8 of 44

Standard file formats make data Interoperable

GeoTIFF for imagery or 2D raster data
NetCDF for multi-dimensional data (3D+)
Shapefiles or GeoJSON for vector data
csv for tabular data

Avoid Excel and other proprietary formats.

Data Strategies for Future Us, Slide 14

Data strategies for Future Us (Openscapes)

Parker, Peng, & Bryan 2016: Not So Standard Deviations Podcast Episode 9

Try to be FAIR.

Findable,
Accessible,
Interoperable,
Reusable

Data Strategies for Future Us, Slide 15

9 of 44

Metadata makes data Interoperable and Reusable

Metadata standards and conventions ensure that standard tools can read/interpret the data.

Standard also define the meaning of metadata attributes.

What is the Coordinate Reference System? (projection, grid mapping)
What are the units?
What is the variable name?
What is the source of the data?
What script produced the data?

Metadata, Openscapes

Try to be FAIR.

Findable,
Accessible,
Interoperable,
Reusable

Data Strategies for Future Us, Slide 11

10 of 44

Document the analysis

So, is it Reproducible?

Can you (or anyone else) easily reproduce your processing pipeline?

Document each step.

Where did you get the data, which files, which version?
Write it down. Anywhere is good but using a script is better.

With GUI interfaces (e.g. ArcGIS, QGIS, Excel), use screenshots, journal commands.

Try to be FAIR.

Findable,
Accessible,
Interoperable,
Reusable

11 of 44

Anatomy of a data workflow

https://nasa-openscapes.github.io/earthdata-cloud-cookbook/when-to-cloud.html

intermediate files

results/end product

12 of 44

When and how to save and publish files

12

Save intermediate data, not just final versions.

Who needs to use these again?
Where can they be stored?
Be FAIR, even at intermediate steps

Share your shiny new (tidy) dataset with the world!

Submit to the most relevant NASA DAAC as a community product

Find more repositories via the DataONE federation

*repositories on a spectrum of FAIR compliance

13 of 44

13

THE CLOUD

-Real clouds: Take any form (dinosaur, bunny, racehorse), appear so fluffy like you could toss yourself onto them and land softly, or take a bite of them like cotton candy. Something magical about them when you’re standing all the way down here, unaware of what’s really going on up there

�-Truth is: Clouds move and change, and aren’t as magical as they appear. They’re not fluffy - they’re filled with water droplets and can be violent and dark. You *could* eat them, but they won’t taste like cotton candy. Most importantly, they’re out of reach. \

-Think of “the cloud” as storage and compute that you don’t have physical access to. It’s not a machine in your building you can go turn knobs or push buttons on. It may seem magical right now, far away when we don’t know much about them, but in reality they are just computers in another place.

-“The cloud is nebulous”

14 of 44

14

THE CLOUD

-Real clouds: Take any form (dinosaur, bunny, racehorse), appear so fluffy like you could toss yourself onto them and land softly, or take a bite of them like cotton candy. Something magical about them when you’re standing all the way down here, unaware of what’s really going on up there

�-Truth is: Clouds move and change, and aren’t as magical as they appear. They’re not fluffy - they’re filled with water droplets and can be violent and dark. You *could* eat them, but they won’t taste like cotton candy. Most importantly, they’re out of reach. \

-Think of “the cloud” as storage and compute that you don’t have physical access to. It’s not a machine in your building you can go turn knobs or push buttons on. It may seem magical right now, far away when we don’t know much about them, but in reality they are just computers in another place.

-“The cloud is nebulous”

15 of 44

When we store photos in iCloud or Google accounts instead of on our cell phones, we are using cloud storage.

When we watch movies and tv shows on streaming services like Netflix or Hulu, we are using the cloud.

In these cases, we are interacting with “the cloud” without knowing it, though we, the user, are not in “the cloud”.

At a basic level, “The Cloud” is somewhere that isn’t your computer.

We all interact with data and services and that live in “The Cloud” in our daily lives.

"When you are frustrated, can you throw it out a window or can you not?"

- Yuvi Panda, Jupyter, 2i2c 😎

16 of 44

How do Data Strategies change when files can be accessed from or stored in The Cloud?

17 of 44

When and how to save files in The Cloud

17

Save intermediate data, not just final versions.

Who needs to use these again?
Where can they be stored?
Be FAIR, even at intermediate steps

18 of 44

Where is my data? Where is my compute?

18

		Compute
		Not Cloud	Cloud
Data storage	Not Cloud
Data storage	Cloud

19 of 44

Where is my data? Where is my compute?

19

		Compute
		Not Cloud	Cloud
Data storage	Not Cloud
Data storage	Cloud

20 of 44

Where is my data? Where is my compute?

20

		Compute
		Not Cloud	Cloud
Data storage	Not Cloud
Data storage	Cloud

Courtesy of Alexis Hunzinger (NASA Openscapes)

21 of 44

How do Data Strategies change when files can be accessed from or stored in The Cloud?

Example:

Storing intermediate files in our JupyterHub

22 of 44

Demo: data storage in our JupyterHub

22

When you are working in the NASA Openscapes 2i2c JupyterHub, the default storage location is the HOME directory (/home/jovyan/).

HOME is really handy because it is persistent across server restarts and is a great place to store your code!

However the HOME directory is not a great place to store data, as it is very expensive (you’re charged for data sitting there), and can be quite slow to read from and write to.

Our approach for this:

1. $HOME -> for code

2. $SCRATCH_BUCKET -> for 'intermediate' data products

3. $PERSISTENT_BUCKET -> for 'permanent' data products

nasa-openscapes.github.io/earthdata-cloud-cookbook/how-tos/using-s3-storage

}

AWS “S3” Buckets

23 of 44

Q&A / Discussion

23

What questions do you have?

24 of 44

Extra Slides

25 of 44

Resources for evaluating your resources/ best practices?

25

26 of 44

How to future-proof workflows

If your goal is to make your workflows reproducible, try to be FAIR.

Findable,
Accessible,
Interoperable,
Reusable

Applies to the future you and your team as well.

When and why we do and when and why we don’t

Data Strategies for Future Us, Slide 11

27 of 44

Anatomy of a data workflow

intermediate files

results/end product

28 of 44

Anatomy of a data workflow