Data strategies for Future Us
Pathways to Open Science
Presented by Ileana Fenwick
CC By Openscapes
Artwork by Allison Horst
Last updated 2021-06-03
Outline
Data organization in spreadsheets
In “Practical Data Science for Stats” collection in PeerJ & American Statistician
“Spreadsheets, for all of their mundane rectangularness,
have been the subject of angst and controversy for decades.…
Amid this debate, spreadsheets have continued to play a significant role in researchers' workflows.
The dangers of spreadsheets are real, however – so much so that the European Spreadsheet Risks Interest Group keeps a public archive of spreadsheet ‘horror stories’... ”
Broman & Woo share practical tips to make spreadsheets less error-prone, easier for computers to process, and easier to share. (3rd most downloaded stats paper: twitter)
Data organization in spreadsheets
Basic Principles
Spreadsheet drama
“Excel is where I learned I loved data analysis” - Jenny Bryan
Discussion covers:
5 years later, still a truly impactful conversation: to hear experts *talk* about data/analysis
Not So Standard Deviations Podcast Episode 9, special guest Jenny Bryan
Good enough practices in scientific computing
Data management main recs:
In PLoS Computational Biology, following Wilson et al. 2014: Best practices for scientific computing
Software: write, organize, and share scripts and programs used in an analysis.
Collaboration: make it easy for existing and new collaborators to understand & contribute to a project.
Project organization: organize the digital artifacts of a project to ease discovery & understanding.
Tracking changes: record how various components of your project change over time.
Manuscripts: write manuscripts in a way that leaves an audit trail & minimizes manual merging of conflicts.
Also:
Good enough practices in scientific computing
Box 1: Data management
Tidy data for efficiency, reproducibility, & collaboration
First, a Preamble:
Tidying data (“data wrangling”) – up to 50–80% of a data scientist’s time Lohr 2014, New York Times
Lowndes & Horst 2020, Tidy data for reproducibility, efficiency, and collaboration
Tidy data for easier collaboration!
Whether collaborators are current teammates, Future You, or Future Us, organizing and sharing data in a consistent and predictable way means less adjustment, time, and effort for all.
Tidy data for reproducibility and reuse
By using tools that all expect tidy data as
inputs, you can build and iterate powerful
workflows that are easier to understand,
update, and reuse.
Tools exist to help you tidy data
tidyr::separate()
tidyr::pivot_longer()
Also pivot_wider() -- modern forms of gather() and spread()
Tidy data for the win!
Once empowered to work with tidy data generally, working with other datasets feel more approachable too. This transferrable confidence and ability to collaborate might be the best thing about tidy data.
Learn more about tidy data:
Wickham (2014). Tidy Data. Journal of Statistical Software. http://jstatsoft.org/v59/i10
“Informal and code-heavy version”: https://tidyr.tidyverse.org/articles/tidy-data
Grolemund & Wickham (2016). R for Data Science: Ch 12 https://r4ds.had.co.nz
Questions?
Data. The good, the bad and the ugly. Creating our community!
Questions for Discussion