Data Organisation:
Making Your Research Life Easier
Peter Hickey
@PeteHaitch
Single Cell Open Research Endeavour (SCORE),
Walter and Eliza Hall Institute of Medical Research
A bit about me
Biology
Statistics
Bioinformatics
A bit about me
12 life-changing tips to up your analysis game (Buzzfeed version)
12 practical recommendations for organising spreadsheet data in a way that both humans and computer programs can read (academic version)
Overview
Broman, K. W. & Woo, K. H. Data Organization in Spreadsheets. Am. Stat. 72, 2–10 (2018).
Why spreadsheets are great
Why spreadsheets are terrible
The dangers of spreadsheets are real
Panko, R. (2008), “What We Know About Spreadsheet Errors,” available at http://panko.shidler.hawaii.edu/SSR/Mypapers/whatknow.htm
1. Be consistent
The first rule of data organization is be consistent. Whatever you
do, do it consistently.
Use consistent codes for categorical variables
Bad
Good
Subject | Batting |
Ricky Ponting | Right-handed |
Don Bradman | Right handed |
Brian Lara | Left |
Sachin Tendulkar | R |
Subject | Batting |
Ricky Ponting | Right |
Don Bradman | Right |
Brian Lara | Left |
Sachin Tendulkar | Right |
Use a consistent fixed code for any missing values
Bad
Good
Subject | Test stumpings |
Jason Dunstall | - |
Adam Gilchrist | 37 |
Ricard Chee Quee | 0 |
Ozzy Osbourne | NA (not a cricketer) |
Subject | Test stumpings | Note |
Jason Dunstall | NA | Not a cricketer |
Adam Gilchrist | 37 | NA |
Richard Chee Quee | NA | No test matches |
Ozzy Osbourne | NA | Not a cricketer |
Use consistent variable names
Bad
Sheet 1
Sheet 2
Subject | blood_pressure_week1 |
Jason Dunstall | 120/80 |
Adam Gilchrist | 139/89 |
Subject | BP_week1 |
Michael Di Venuto | 140/90 |
Jamie Cox | 90/60 |
Use consistent subject identifiers
Sheet 1
Sheet 2
Bad
Subject | blood_pressure_week1 |
Jason Dunstall | 120/80 |
Adam Gilchrist | 139/89 |
Subject | blood_pressure_week10 |
The Chief | 140/90 |
Gilly | 90/60 |
Use a consistent data layout in multiple files
Bad
Subject | Test stumpings | Bats eaten |
Jason Dunstall | NA | 0 |
Adam Gilchrist | 37 | 0 |
Richard Chee Quee | NA | 0 |
Subject | Bats eaten | Test stumpings |
Ozzy Osbourne | 1 | NA |
Bert Oldfield | 0 | 52 |
Michelle Payne | 0 | NA |
Use consistent file names
Bad
Serum_batch1_2015-01-30.csv
batch2_serum_52915.csv
Good
Serum_batch1_2015-01-30.csv
Serum_batch2_2015-05-29.csv
Be careful about extra spaces within cells
"male" is different from "male "
These can be really difficult to spot
2. Choose Good Names for Things
Aim for short, but meaningful
Avoid spaces in file names and variable names!
Use underscores (or perhaps hyphens)
Bad
fascicle length
Good
fascicle_length
But remember rule 1
Bad
date_of-injury
Good
date_of_injury
General tips about naming things
General tips about naming things
3. Write Dates as YYYY-MM-DD
E.g., 2019-02-08
This is actually the international standard.
Also, never work with Americans.
4. No Empty Cells
Fill in all cells.
Use some common code for missing data.
Make it clear that the data are known to be missing rather than unintentionally left blank
Bad
Good
2015-06-14
2015-06-18
2015-06-18
2015-06-20
5. Put Just One Thing in a Cell
The cells in your spreadsheet should each contain one piece of
data. Do not put more than one thing in a cell.
Put just one thing in a cell
Bad
Good
Subject | Leg_injured |
Sally Pearson | right_injured |
Sally Pearson | left_uninjured |
Subject | Leg | Injured |
Sally Pearson | right | TRUE |
Sally Pearson | left | FALSE |
Put the units in the column name (or in a data dictionary)
Bad
Good
Ball | Weight |
Golf ball | 45 g |
Bowling ball | 7.2 kg |
Ball | Weight_g |
Golf_ball | 45 |
Sally Pearson | 7200 |
6. Make it a Rectangle
The best layout for your data within a spreadsheet is as a single
big rectangle with rows corresponding to subjects and columns
corresponding to variables.
Make it a Rectangle
Bad
Good
7. Create a Data Dictionary
A data dictionary is a separate file that explains what all the variables are.
Things you might put in a data dictionary
8. No Calculations in the Raw Data Files
Your primary data file should contain just the data and nothing else: no calculations, no graphs
No Calculations in the Raw Data Files
9. Do Not Use Font Color or Highlighting as Data
Instead, just add an extra column
Bad
Good
Why?
10. Make Backups
Make regular backups of your data.
In multiple locations.
Making backups
Keeping your raw data safe
11. Use Data Validation to Avoid Errors
And make data entry as error-free and repetitive-stress-injury-free as possible
Using data validation
12. Save the data in Plain Text Files
Excel is great for data entry, but .xls and .xlsx files aren't so analysis software-friendly.
Prefer comma-delimited (CSV) files, instead.
Bonus: What if my data already exist and aren't in this tidy format?
Don't worry! Focus on the future.
Focus on the future
Summary
Links
Slides
Article
Broman, K. W. & Woo, K. H. Data Organization in Spreadsheets. Am. Stat. 72, 2–10 (2018)
https://www.tandfonline.com/doi/abs/10.1080/00031305.2017.1375989