1 of 19

2 of 19

Data strategies for Future Us

Pathways to Open Science

Presented by Ileana Fenwick

CC By Openscapes

Artwork by Allison Horst

Last updated 2021-06-03

3 of 19

Outline

Data management strategies

Data organization in spreadsheets
Spreadsheet drama
Good enough practices for scientific computing

Data analysis strategies

Tidy data

4 of 19

Data organization in spreadsheets

By Broman & Woo, 2018

In “Practical Data Science for Stats” collection in PeerJ & American Statistician

“Spreadsheets, for all of their mundane rectangularness,

have been the subject of angst and controversy for decades.…

Amid this debate, spreadsheets have continued to play a significant role in researchers' workflows.

The dangers of spreadsheets are real, however – so much so that the European Spreadsheet Risks Interest Group keeps a public archive of spreadsheet ‘horror stories’... ”

Broman & Woo share practical tips to make spreadsheets less error-prone, easier for computers to process, and easier to share. (3rd most downloaded stats paper: twitter)

5 of 19

Data organization in spreadsheets

Be consistent
Write dates like YYYY-MM-DD
Don't leave any cells empty
Put just one thing in a cell
Organize the data as a single rectangle (“Tidy data”)
Create a data dictionary
Don't include calculations in the raw data files
Don't use font color or highlighting as data
Choose good names for things
Make backups
Use data validation to avoid data entry errors
Save the data in plain text files

Basic Principles

Be consistent
Write dates like YYYY-MM-DD
Don't leave any cells empty
Put just one thing in a cell
Organize data as a rectangle (“Tidy data”)
Create a data dictionary
Don't include calculations in the raw data files
Don't use font color or highlighting as data
Choose good names for things

6 of 19

Spreadsheet drama

“Excel is where I learned I loved data analysis” - Jenny Bryan

Discussion covers:

Real power and real danger of Excel
Separating data from analysis, incremental steps
Early development of tidyverse R packages and Shiny
Teaching data science with GitHub

5 years later, still a truly impactful conversation: to hear experts *talk* about data/analysis

By Parker, Peng, & Bryan 2016

Not So Standard Deviations Podcast Episode 9, special guest Jenny Bryan

7 of 19

Good enough practices in scientific computing

Data management main recs:

work towards ready-to-analyze data incrementally, documenting both the intermediate data & the process.
"tidy data" be a powerful accelerator for analysis

By Wilson et al. 2017

In PLoS Computational Biology, following Wilson et al. 2014: Best practices for scientific computing

Software: write, organize, and share scripts and programs used in an analysis.

Collaboration: make it easy for existing and new collaborators to understand & contribute to a project.

Project organization: organize the digital artifacts of a project to ease discovery & understanding.

Tracking changes: record how various components of your project change over time.

Manuscripts: write manuscripts in a way that leaves an audit trail & minimizes manual merging of conflicts.

Also:

8 of 19

Good enough practices in scientific computing

Box 1: Data management

Save the raw data.
Ensure that raw data are backed up in more than one location.
Create the data you wish to see in the world.
Create analysis-friendly data.
Record all the steps used to process data.
Anticipate the need to use multiple tables, & use a unique identifier for every record.
Submit data to a reputable DOI-issuing repository so that others can access & cite.

9 of 19

Tidy data for efficiency, reproducibility, & collaboration

First, a Preamble:

Tidy data is a philosophy
Raw data may not be stored in a tidy way
“Wrangling” data into tidy structure should be done programmatically as part of the analytical process - keep the raw data raw
There are existing tools to help tidy data

Wickham & Grolemund 2016

By Lowndes & Horst 2020

An illustrated series to tell a story about the power of tidy data

Tidying data (“data wrangling”) – up to 50–80% of a data scientist’s time Lohr 2014, New York Times

10 of 19

Lowndes & Horst 2020, Tidy data for reproducibility, efficiency, and collaboration

11 of 19

12 of 19

13 of 19

Tidy data for easier collaboration!

Whether collaborators are current teammates, Future You, or Future Us, organizing and sharing data in a consistent and predictable way means less adjustment, time, and effort for all.

Tidy data for reproducibility and reuse

By using tools that all expect tidy data as

inputs, you can build and iterate powerful

workflows that are easier to understand,

update, and reuse.

14 of 19

Tools exist to help you tidy data

tidyr::separate()

tidyr::pivot_longer()

Also pivot_wider() -- modern forms of gather() and spread()

sfirke.github.io/janitor

tidyr.tidyverse.org, R for data science

15 of 19

Tidy data for the win!

Once empowered to work with tidy data generally, working with other datasets feel more approachable too. This transferrable confidence and ability to collaborate might be the best thing about tidy data.

16 of 19

Learn more about tidy data:

Wickham (2014). Tidy Data. Journal of Statistical Software. http://jstatsoft.org/v59/i10

“Informal and code-heavy version”: https://tidyr.tidyverse.org/articles/tidy-data

Grolemund & Wickham (2016). R for Data Science: Ch 12 https://r4ds.had.co.nz

17 of 19

Additional slide decks

Metadata

18 of 19

Questions?

19 of 19

Data. The good, the bad and the ugly. Creating our community!

Questions for Discussion

Has there been an opportunity you’ve been able to participate in that helped you grow your skills? If so, which one?
What has been your worst variable name? Or file name?
Do you have a data horror story? Spill it!
What skills do you want to learn? And why?
Is there somewhere or with someone you feel safe asking questions about your data? Why?
When do you feel most confident in handling data? Why?
What barriers do you see the most to open science?