1 of 19

2 of 19

Data strategies for Future Us

Pathways to Open Science

Presented by Ileana Fenwick

CC By Openscapes

Artwork by Allison Horst

Last updated 2021-06-03

3 of 19

Outline

  • Data management strategies
    • Data organization in spreadsheets
    • Spreadsheet drama
    • Good enough practices for scientific computing
  • Data analysis strategies
    • Tidy data

4 of 19

Data organization in spreadsheets

By Broman & Woo, 2018

In “Practical Data Science for Stats” collection in PeerJ & American Statistician

“Spreadsheets, for all of their mundane rectangularness,

have been the subject of angst and controversy for decades.…

Amid this debate, spreadsheets have continued to play a significant role in researchers' workflows.

The dangers of spreadsheets are real, however – so much so that the European Spreadsheet Risks Interest Group keeps a public archive of spreadsheet ‘horror stories’... ”

Broman & Woo share practical tips to make spreadsheets less error-prone, easier for computers to process, and easier to share. (3rd most downloaded stats paper: twitter)

5 of 19

Data organization in spreadsheets

  1. Be consistent
  2. Write dates like YYYY-MM-DD
  3. Don't leave any cells empty
  4. Put just one thing in a cell
  5. Organize the data as a single rectangle (“Tidy data”)
  6. Create a data dictionary
  7. Don't include calculations in the raw data files
  8. Don't use font color or highlighting as data
  9. Choose good names for things
  10. Make backups
  11. Use data validation to avoid data entry errors
  12. Save the data in plain text files

Basic Principles

  • Be consistent
  • Write dates like YYYY-MM-DD
  • Don't leave any cells empty
  • Put just one thing in a cell
  • Organize data as a rectangle (“Tidy data”)
  • Create a data dictionary
  • Don't include calculations in the raw data files
  • Don't use font color or highlighting as data
  • Choose good names for things

6 of 19

Spreadsheet drama

“Excel is where I learned I loved data analysis” - Jenny Bryan

Discussion covers:

  • Real power and real danger of Excel
  • Separating data from analysis, incremental steps
  • Early development of tidyverse R packages and Shiny
  • Teaching data science with GitHub

5 years later, still a truly impactful conversation: to hear experts *talk* about data/analysis

By Parker, Peng, & Bryan 2016

Not So Standard Deviations Podcast Episode 9, special guest Jenny Bryan

7 of 19

Good enough practices in scientific computing

Data management main recs:

  1. work towards ready-to-analyze data incrementally, documenting both the intermediate data & the process.
  2. "tidy data" be a powerful accelerator for analysis

Software: write, organize, and share scripts and programs used in an analysis.

Collaboration: make it easy for existing and new collaborators to understand & contribute to a project.

Project organization: organize the digital artifacts of a project to ease discovery & understanding.

Tracking changes: record how various components of your project change over time.

Manuscripts: write manuscripts in a way that leaves an audit trail & minimizes manual merging of conflicts.

Also:

8 of 19

Good enough practices in scientific computing

Box 1: Data management

  • Save the raw data.
  • Ensure that raw data are backed up in more than one location.
  • Create the data you wish to see in the world.
  • Create analysis-friendly data.
  • Record all the steps used to process data.
  • Anticipate the need to use multiple tables, & use a unique identifier for every record.
  • Submit data to a reputable DOI-issuing repository so that others can access & cite.

9 of 19

Tidy data for efficiency, reproducibility, & collaboration

First, a Preamble:

  • Tidy data is a philosophy
  • Raw data may not be stored in a tidy way
  • “Wrangling” data into tidy structure should be done programmatically as part of the analytical process - keep the raw data raw
  • There are existing tools to help tidy data

By Lowndes & Horst 2020

An illustrated series to tell a story about the power of tidy data

Tidying data (“data wrangling”) – up to 50–80% of a data scientist’s time Lohr 2014, New York Times

10 of 19

11 of 19

12 of 19

13 of 19

Tidy data for easier collaboration!

Whether collaborators are current teammates, Future You, or Future Us, organizing and sharing data in a consistent and predictable way means less adjustment, time, and effort for all.

Tidy data for reproducibility and reuse

By using tools that all expect tidy data as

inputs, you can build and iterate powerful

workflows that are easier to understand,

update, and reuse.

14 of 19

Tools exist to help you tidy data

tidyr::separate()

tidyr::pivot_longer()

Also pivot_wider() -- modern forms of gather() and spread()

15 of 19

Tidy data for the win!

Once empowered to work with tidy data generally, working with other datasets feel more approachable too. This transferrable confidence and ability to collaborate might be the best thing about tidy data.

16 of 19

Learn more about tidy data:

Wickham (2014). Tidy Data. Journal of Statistical Software. http://jstatsoft.org/v59/i10

“Informal and code-heavy version”: https://tidyr.tidyverse.org/articles/tidy-data

Grolemund & Wickham (2016). R for Data Science: Ch 12 https://r4ds.had.co.nz

17 of 19

Additional slide decks

Metadata

18 of 19

Questions?

19 of 19

Data. The good, the bad and the ugly. Creating our community!

Questions for Discussion

  • Has there been an opportunity you’ve been able to participate in that helped you grow your skills? If so, which one?
  • What has been your worst variable name? Or file name?
  • Do you have a data horror story? Spill it!
  • What skills do you want to learn? And why?
  • Is there somewhere or with someone you feel safe asking questions about your data? Why?
  • When do you feel most confident in handling data? Why?
  • What barriers do you see the most to open science?