1 of 60

EEMB Data Management Workshop

Spring 2024

2 of 60

Slides: bit.ly/4aRWcbo

3 of 60

From our survey, topics in order of interest are:

  1. Data processing
  2. Data storage
  3. Data reproducibility
  4. Data redundancy
  5. Field data collection

4 of 60

Workshop format

Sections:

Data storage and redundancy

Data processing

Data reproducibility

For each section:

A mini-lecture

Q&A

Plan your own!

At the end of this workshop, you will come away with a plan to get from

data collection to processing to publishing!

5 of 60

Panel

Zoe: Nearly graduated ancient PhD candidate studying food webs and nutrient subsidies in California intertidal habitats.

Caitlin: PhD student in the Briggs Lab studying frog disease in California

An: thinking about communities on land and in the ocean

6 of 60

Data storage and redundancy

7 of 60

If your data exist in multiple forms, that’s redundancy!

Copies

Physical storage

Cloud storage

Primary copy and back-ups

8 of 60

Data copies - Caitlin

  • Primary (field or lab data sheets)
    • Paper! Old school! Make sure you can read it :)
    • Tablet or phone for survey-123
  • Backups (scans or photos - or both)

Tips for not losing the primary data sheets:

    • Have a designated spot for the primary data sheets in the field & at field station
      • Ex: always in this pocket of this backpack
    • Before we leave the site, check it is in the right spot!
      • Consider putting one person in charge of checking this if part of a team

Source: Tails from the Field

Caitlin’s field notebook scan, covered in mud

9 of 60

Data copies - lessons learned

  • Take photos of your field notebook throughout the day
    • Recommended: make it part of your end of site routine to photograph notebook

  • If you have location on in your camera (and service), it will tag your location
    • Story time! I wrote the same site code twice, but photographed my notebook there and could pinpoint where I was when I wrote it at each point
    • I also use this for time stamps if I forget to record the time

  • At end of each trip: I back up all images in Google Photos in my ucsb account
    • In addition to field notebook pdf scans

10 of 60

Copies

Physical storage

Cloud storage

Primary copy and back-ups

On your computer, on an external drive

11 of 60

Multiple types of physical storage: if you lose one, you will always have another!

Physical data sheets in a binder in lab

Scans of data sheets on computer and external drive

Entered data in .csv on computer

Wherever they are, keep it consistent!

Scan with flatbed scanner connected to computer

Save all files to external hard drive that lives at my house

Another option: scan to flash drive using Noble Hall scanner!

Technically these .csv files are downloaded from Google Sheets… which we will talk about in a bit!

12 of 60

Copies

Physical storage

Cloud storage

Primary copy and back-ups

On your computer, on an external drive

Dropbox, Google Drive, Box, etc.

13 of 60

Platforms and benefits

Dropbox

Google Drive

Box

Other options: OneDrive, WeTransfer, etc.

Easiest transition from working at UCSB to elsewhere

Easiest for collaborative data collection

Unlimited storage with UCSB

14 of 60

Copies

Physical storage

Cloud storage

Primary copy and back-ups

On your computer, on an external drive

Dropbox, Google Drive, Box, etc.

15 of 60

Questions about data storage and redundancy?

16 of 60

Plan your own!

17 of 60

Data processing

18 of 60

19 of 60

Data processing and reproducibility

  • How can others reproduce my cleaning and processing steps?
  • Assumption: you have entered your data!

20 of 60

Design your data entry sheet to mimic your data sheet as closely as possible to minimize errors

21 of 60

… and the packages to use!

Operations you might have to do once you’ve entered your data…

Read it into RStudio

Look at it

Clean it up

Visualize and summarize it

tidyverse, googlesheets4, repmis

22 of 60

Read it into RStudio (from your computer)

If your data is in a .csv: read_csv from readr in the tidyverse

  • Package info

If your data is in a .xlsx: read_xlsx from readxl in the tidyverse

  • Package info

23 of 60

Read it into RStudio (from the cloud)

If you have your data in Google Sheets: googlesheets4

  • Reads data directly from Google Sheets to RStudio
  • Package info
  • Cheatsheet

If you have your data in Dropbox or Box: read in using repmis and the share link

  • Package info (and if using Dropbox, change a little bit of the url - directions to do that)
  • Very useful if you have one data file that you’re using for different projects!

24 of 60

… and the packages to use!

Operations you might have to do once you’ve entered your data…

Read it into RStudio

Look at it

Clean it up

Visualize and summarize it

tidyverse, googlesheets4, repmis

dplyr, base R, skimr

25 of 60

Look at your data: what are the columns called?

26 of 60

Look at your data: what’s in the columns?

`str()` is another option!

27 of 60

Look at your data: what do the rows have in them?

28 of 60

Look at your data: what are some basic summary stats for my data?

29 of 60

Look at your data: what are some basic summary stats for my data?

30 of 60

… and the packages to use!

Operations you might have to do once you’ve entered your data…

Read it into RStudio

Look at it

Clean it up

Visualize and summarize it

tidyverse, googlesheets4, repmis

dplyr, janitor

dplyr, base R, skimr

31 of 60

Useful functions to keep in mind

  • clean_names(): cleans up column names
  • mutate(): creates new columns, changes columns (very powerful when used with case_when())
  • select(): selects columns from a data frame
  • pivot_longer(): puts the data frame in “long format” (each row is an observation)
  • rename(): renames columns
  • filter(): filters data frame

32 of 60

33 of 60

… and the packages to use!

Operations you might have to do once you’ve entered your data…

Read it into RStudio

Look at it

Clean it up

Visualize and summarize it

ggplot2

tidyverse, googlesheets4, repmis

dplyr, janitor

dplyr, base R, skimr

34 of 60

Visualize it

35 of 60

Questions about data processing?

36 of 60

Plan your own!

37 of 60

Data reproducibility

38 of 60

Data reproducibility

How can you make sure that someone (who isn’t you)...

can find your data and code?

will know what they’re looking at when they see it?

can reproduce your processing, visualizing, and analysis steps?

39 of 60

How can you make sure that someone can find your data and code?

Solution: use data-sharing platforms

40 of 60

Things to consider when choosing a platform:

Do you want to connect to a GitHub repository? Choose Zenodo or Figshare

Do you want to not pay money? Choose Zenodo, Figshare, or Environmental Data Initiative

Do you want someone to curate your submission? Choose Environmental Data Initiative

41 of 60

Resources for exploring data sharing platforms

Mislan et al. Elevating the status of code in ecology. https://doi.org/10.1016/j.tree.2015.11.006

More repository comparisons: https://zenodo.org/records/3946720

User comparison : https://evodify.com/free-research-repository/

42 of 60

Data reproducibility

How can you make sure that someone (who isn’t you)...

can find your data and code?

will know what they’re looking at when they see it?

can reproduce your processing, visualizing, and analysis steps?

43 of 60

How can you make sure that someone will know what they’re looking at when they see your data and code?

Solution: write a README file!

44 of 60

README files

  • Describe what is in a repository (data, files, etc.)
  • Usually in plain text format
  • On GitHub, in .md format
  • Useful sections:
    • General overview
    • Data and file overview
    • Sharing and accessing information
    • Methodological information
    • Data-specific information

Resource from Research Data Services at the library: README, please!

45 of 60

General information

  • What’s the project?
  • Who’s working on it?
  • What’s it about?

46 of 60

Data and file overview

  • What files are there?
  • Create a tree map of files in the repository using `tree` in the Terminal (in RStudio), then just copy/past the output

47 of 60

Sharing and accessing information

Types of licenses:

  • Public domain
  • Attribution
  • Open database license

See more here

You decide what license you want to give your work: these are community standards, meaning that they are not necessarily legally binding, but everyone follows them.

48 of 60

Methodological information

  • Where did the data come from?
  • How did you collect the data?

49 of 60

Data-specific information

What types of data do you have?

Can get unwieldy if you have a lot of data files! You decide the resolution of describing this information (at the level of the columns/rows, or at the level of the file)

50 of 60

Data reproducibility

How can you make sure that someone (who isn’t you)...

can find your data and code?

will know what they’re looking at when they see it?

can reproduce your processing, visualizing, and analysis steps?

51 of 60

How can you make sure that someone can reproduce your processing, visualizing, and analysis steps?

Solution: review your code!

52 of 60

The 4 Rs of reproducible code standards

Is the code as Reported?

Do the methods and code match up?

Does the code Run?

Code should run without any modifications

Is the code Reliable?

Code actually does what you think it does

Are the results Reproducible?

Could someone get the same results you do?

53 of 60

Is the code as Reported?

Keep your thoughts organized

    • Outline your methods as you are working on your data cleaning and analysis!!
    • Keep a “journal” of data cleaning steps: packages, functions, etc.

Keep your code organized

    • Organize your code following the order of your methods (though this can change, so give yourself some flexibility to change too!)

54 of 60

Does the code Run?

  • Make sure the data is available
    • Solution: share data and code on data sharing platform
  • Make sure the packages/functions are available
    • `renv`: recreates the package environment as an object that people can read in

55 of 60

Is the code Reliable?

  • Double check that your transformation, summarizing, etc. steps are doing what you think they’re doing
    • Solution: run your processing steps on a small subset of your data that would be easy to double check on your own
    • An’s example: calculating the proportion of total community biomass for each individual species: hard when you have 11 years of data across 5 sites sampled 4 times a year, easier when you pick out two sampling points, do the calculation, then double check the numbers
  • If you are developing a package of functions (or writing a bunch of your own functions): `testthat`

56 of 60

Are the results Reproducible?

Make sure what you write matches up with the output of your code

    • One way: if the data have changed, re-run all analyses and double check statistics at the end of every version of a draft
    • Better solution: write papers in RStudio using Quarto or RMarkdown where you can directly extract components of an output

57 of 60

More resources:

58 of 60

Questions about data reproducibility?

59 of 60

Plan your own!

60 of 60

Thanks for coming!

Please fill out the survey for us to learn what we can improve on and what future iterations of this workshop should be like!

Survey link: https://bit.ly/4bM5Bmk