1 of 51

Data Organisation:

Making Your Research Life Easier

Peter Hickey

@PeteHaitch

Single Cell Open Research Endeavour (SCORE),

Walter and Eliza Hall Institute of Medical Research

2 of 51

A bit about me

  • I'm a statistician, mostly working in bioinformatics

3 of 51

Biology

Statistics

Bioinformatics

4 of 51

A bit about me

  • I'm a statistician, mostly working in bioinformatics
  • I love analysing data
  • BSc (Honours) and PhD in Statistics
  • Currently working at the Walter and Eliza Hall Institute of Medical Research
  • One of the things I do there is teach R and statistics

5 of 51

6 of 51

12 life-changing tips to up your analysis game (Buzzfeed version)

12 practical recommendations for organising spreadsheet data in a way that both humans and computer programs can read (academic version)

7 of 51

Overview

  • This talk is based on Broman and Woo (2018)
  • By following this advice, you will create spreadsheets that are:
    • Less error-prone
    • Easier for computers to process
    • Easier to share with collaborators and the public

Broman, K. W. & Woo, K. H. Data Organization in Spreadsheets. Am. Stat. 72, 2–10 (2018).

8 of 51

Why spreadsheets are great

  • Anyone with a computer can open them
  • Everyone can use Excel a bit
  • Spreadsheets can be used for data entry, storage, analysis, and visualization

9 of 51

Why spreadsheets are terrible

  • Anyone with a computer can open them
  • Everyone can use Excel a bit
  • Spreadsheets can be used for data entry, storage, analysis, and visualization

10 of 51

The dangers of spreadsheets are real

  • Popular spreadsheet programs (e.g., Excel) make certain types of errors easy to commit and difficult to rectify
  • In 13 audits of real-world spreadsheets, an average of 88% contained errors (Panko 2008)
  • Risk reduction beats abstinence

Panko, R. (2008), “What We Know About Spreadsheet Errors,” available at http://panko.shidler.hawaii.edu/SSR/Mypapers/whatknow.htm

11 of 51

1. Be consistent

The first rule of data organization is be consistent. Whatever you

do, do it consistently.

12 of 51

Use consistent codes for categorical variables

Bad

Good

Subject

Batting

Ricky Ponting

Right-handed

Don Bradman

Right handed

Brian Lara

Left

Sachin Tendulkar

R

Subject

Batting

Ricky Ponting

Right

Don Bradman

Right

Brian Lara

Left

Sachin Tendulkar

Right

13 of 51

Use a consistent fixed code for any missing values

Bad

Good

Subject

Test stumpings

Jason Dunstall

-

Adam Gilchrist

37

Ricard Chee Quee

0

Ozzy Osbourne

NA (not a cricketer)

Subject

Test stumpings

Note

Jason Dunstall

NA

Not a cricketer

Adam Gilchrist

37

NA

Richard Chee Quee

NA

No test matches

Ozzy Osbourne

NA

Not a cricketer

14 of 51

Use consistent variable names

Bad

Sheet 1

Sheet 2

Subject

blood_pressure_week1

Jason Dunstall

120/80

Adam Gilchrist

139/89

Subject

BP_week1

Michael Di Venuto

140/90

Jamie Cox

90/60

15 of 51

Use consistent subject identifiers

Sheet 1

Sheet 2

Bad

Subject

blood_pressure_week1

Jason Dunstall

120/80

Adam Gilchrist

139/89

Subject

blood_pressure_week10

The Chief

140/90

Gilly

90/60

16 of 51

Use a consistent data layout in multiple files

Bad

Subject

Test stumpings

Bats eaten

Jason Dunstall

NA

0

Adam Gilchrist

37

0

Richard Chee Quee

NA

0

Subject

Bats eaten

Test stumpings

Ozzy Osbourne

1

NA

Bert Oldfield

0

52

Michelle Payne

0

NA

17 of 51

Use consistent file names

Bad

Serum_batch1_2015-01-30.csv

batch2_serum_52915.csv

Good

Serum_batch1_2015-01-30.csv

Serum_batch2_2015-05-29.csv

18 of 51

Be careful about extra spaces within cells

"male" is different from "male "

These can be really difficult to spot

19 of 51

2. Choose Good Names for Things

Aim for short, but meaningful

20 of 51

Avoid spaces in file names and variable names!

Use underscores (or perhaps hyphens)

Bad

fascicle length

Good

fascicle_length

21 of 51

But remember rule 1

Bad

date_of-injury

Good

date_of_injury

22 of 51

General tips about naming things

23 of 51

General tips about naming things

24 of 51

3. Write Dates as YYYY-MM-DD

E.g., 2019-02-08

This is actually the international standard.

Also, never work with Americans.

25 of 51

4. No Empty Cells

Fill in all cells.

Use some common code for missing data.

Make it clear that the data are known to be missing rather than unintentionally left blank

26 of 51

Bad

27 of 51

Good

2015-06-14

2015-06-18

2015-06-18

2015-06-20

28 of 51

5. Put Just One Thing in a Cell

The cells in your spreadsheet should each contain one piece of

data. Do not put more than one thing in a cell.

29 of 51

Put just one thing in a cell

Bad

Good

Subject

Leg_injured

Sally Pearson

right_injured

Sally Pearson

left_uninjured

Subject

Leg

Injured

Sally Pearson

right

TRUE

Sally Pearson

left

FALSE

30 of 51

Put the units in the column name (or in a data dictionary)

Bad

Good

Ball

Weight

Golf ball

45 g

Bowling ball

7.2 kg

Ball

Weight_g

Golf_ball

45

Sally Pearson

7200

31 of 51

6. Make it a Rectangle

The best layout for your data within a spreadsheet is as a single

big rectangle with rows corresponding to subjects and columns

corresponding to variables.

32 of 51

Make it a Rectangle

  • First row should contain variable names
    • Avoid using more than one row for the variable names
    • Start in cell A1 not the first place you happen to plonk on down
  • Data sets that don't fit nicely into a single rectangle will usually fit into a set of rectangles
    • Personal preference: one file per rectangle (but you could use multiple sheets within a workbook)
  • If it really doesn't fit into a rectangle or series of rectangles, a spreadsheet probably isn't the best format!

33 of 51

Bad

Good

34 of 51

7. Create a Data Dictionary

A data dictionary is a separate file that explains what all the variables are.

35 of 51

Things you might put in a data dictionary

  • The exact variable name as in the data file
  • A version of the variable name that might be used in data visualisations
  • A longer explanation of what the variable means
  • The measurement units
  • Expected minimum and maximum values

36 of 51

8. No Calculations in the Raw Data Files

Your primary data file should contain just the data and nothing else: no calculations, no graphs

37 of 51

No Calculations in the Raw Data Files

  • The raw data is precious
    • You spent weeks, perhaps even months or years, collecting it
  • Doing calculations in the data file means you are regularly opening and typing into it
    • Each time you risk accidentally modifying or destroying your data
  • Your primary data file should be a pristine store of data
    • Write-protect it, back it up, and do not touch it.
  • If you want to do some analyses in Excel, make a copy of the file and do your calculations and graphs in the copy.

38 of 51

9. Do Not Use Font Color or Highlighting as Data

39 of 51

40 of 51

Instead, just add an extra column

Bad

Good

41 of 51

Why?

  • The highlighting is nice visually, but it is hard to extract that information for use in the later analysis
  • You might forget what the colour represents
  • Analysis software can't read colour highlights

42 of 51

10. Make Backups

Make regular backups of your data.

In multiple locations.

43 of 51

Making backups

  • Keep all versions of the data files, so that if something gets corrupted (e.g., you accidentally type over some of the data and do not notice it until much later), you will be able to go back and fix it.
    • Before you start inserting more data, make a copy of the file with a new version number: file_v1.xlsx, file_v2.xlsx, ...
  • Where to keep backups
    • On an external hard drive (that you don't keep in the same location as your computer!)
    • Online (e.g., Dropbox)
    • (Be sure to check your ethics/IRB guidelines)

44 of 51

Keeping your raw data safe

  • Learn how to write-protect your files so you can't accidentally edit them
    • On a Mac, right-click on the file in Finder and select “Get Info.” In the menu that opens, there is a section at the bottom on “Sharing & Permissions.” Click on “Privilege” for yourself and select “Read only.”
    • In Windows, right-click on the file in Windows Explorer and select “Properties.” In the “General” tab, there is a section at the bottom with “Attributes.” Select the box for “Read-only” and click the “OK” button.

45 of 51

11. Use Data Validation to Avoid Errors

And make data entry as error-free and repetitive-stress-injury-free as possible

46 of 51

Using data validation

  • With data validation, you can control what can be entered into a cell
    • E.g., ensure that a subject's weight is entered is always entered as a number (45) and not use a number with a unit (45 kg)
  • Tutorial at http://bit.ly/excel_dataval

47 of 51

12. Save the data in Plain Text Files

Excel is great for data entry, but .xls and .xlsx files aren't so analysis software-friendly.

Prefer comma-delimited (CSV) files, instead.

48 of 51

Bonus: What if my data already exist and aren't in this tidy format?

Don't worry! Focus on the future.

49 of 51

Focus on the future

  • Focus primarily on adopting these principles for future projects.
  • It's best not to try use copy-and-paste to rearrange existing files
    • You run the very real risk of introducing errors!
  • Data rearrangement is best accomplished via code (such as with R or Python) so you never lose the record of what you did to the data
    • But that's a whole nother workshop

50 of 51

Summary

  • Spreadsheets are valuable tools for entering, organising, and storing data.
  • They can also be used for calculations, analysis, and visualisation
    • That's a good thing!
    • But separate this out from the raw data.
  • By following these tips you will:
    • Keep your primary data files pristine and data-only.
    • Reduce errors during data entry.
    • Simplify later analysis by creating tidy spreadsheets with clear and consistent names and formatting.

51 of 51

Links

Slides

www.peterhickey.org/talk

Article

Broman, K. W. & Woo, K. H. Data Organization in Spreadsheets. Am. Stat. 72, 2–10 (2018)

https://www.tandfonline.com/doi/abs/10.1080/00031305.2017.1375989