1 of 60

EEMB Data Management Workshop

Spring 2024

2 of 60

Slides: bit.ly/4aRWcbo

3 of 60

From our survey, topics in order of interest are:

Data processing
Data storage
Data reproducibility
Data redundancy
Field data collection

4 of 60

Workshop format

Sections:

Data storage and redundancy

Data processing

Data reproducibility

For each section:

A mini-lecture

Q&A

Plan your own!

At the end of this workshop, you will come away with a plan to get from

data collection to processing to publishing!

5 of 60

Panel

Zoe: Nearly graduated ancient PhD candidate studying food webs and nutrient subsidies in California intertidal habitats.

Caitlin: PhD student in the Briggs Lab studying frog disease in California

An: thinking about communities on land and in the ocean

6 of 60

Data storage and redundancy

7 of 60

If your data exist in multiple forms, that’s redundancy!

Copies

Physical storage

Cloud storage

Primary copy and back-ups

8 of 60

Data copies - Caitlin

Primary (field or lab data sheets)

Paper! Old school! Make sure you can read it :)
Tablet or phone for survey-123

Backups (scans or photos - or both)

Tips for not losing the primary data sheets:

Have a designated spot for the primary data sheets in the field & at field station

Ex: always in this pocket of this backpack

Before we leave the site, check it is in the right spot!

Consider putting one person in charge of checking this if part of a team

Source: Tails from the Field

Caitlin’s field notebook scan, covered in mud

9 of 60

Data copies - lessons learned

Take photos of your field notebook throughout the day

Recommended: make it part of your end of site routine to photograph notebook

If you have location on in your camera (and service), it will tag your location

Story time! I wrote the same site code twice, but photographed my notebook there and could pinpoint where I was when I wrote it at each point
I also use this for time stamps if I forget to record the time

At end of each trip: I back up all images in Google Photos in my ucsb account

In addition to field notebook pdf scans

10 of 60

Copies

Physical storage

Cloud storage

Primary copy and back-ups

On your computer, on an external drive

11 of 60

Multiple types of physical storage: if you lose one, you will always have another!

Physical data sheets in a binder in lab

Scans of data sheets on computer and external drive

Entered data in .csv on computer

Wherever they are, keep it consistent!

Scan with flatbed scanner connected to computer

Save all files to external hard drive that lives at my house

Another option: scan to flash drive using Noble Hall scanner!

Technically these .csv files are downloaded from Google Sheets… which we will talk about in a bit!

12 of 60

Copies

Physical storage

Cloud storage

Primary copy and back-ups

On your computer, on an external drive

Dropbox, Google Drive, Box, etc.

13 of 60

Platforms and benefits

Dropbox

Google Drive

Box

Other options: OneDrive, WeTransfer, etc.

Easiest transition from working at UCSB to elsewhere

Easiest for collaborative data collection

Unlimited storage with UCSB

14 of 60

Copies

Physical storage

Cloud storage

Primary copy and back-ups

On your computer, on an external drive

Dropbox, Google Drive, Box, etc.

15 of 60

Questions about data storage and redundancy?

16 of 60

Plan your own!

Copy the worksheet here!

17 of 60

Data processing

18 of 60

Resource to follow along:

https://an-bui.github.io/eemb-data-management-workshop/

19 of 60

Data processing and reproducibility

How can others reproduce my cleaning and processing steps?
Assumption: you have entered your data!

20 of 60

Design your data entry sheet to mimic your data sheet as closely as possible to minimize errors

21 of 60

… and the packages to use!

Operations you might have to do once you’ve entered your data…

Read it into RStudio

Look at it

Clean it up

Visualize and summarize it

tidyverse, googlesheets4, repmis

22 of 60

Read it into RStudio (from your computer)

If your data is in a .csv: read_csv from readr in the tidyverse

Package info

If your data is in a .xlsx: read_xlsx from readxl in the tidyverse

Package info

23 of 60

Read it into RStudio (from the cloud)

If you have your data in Google Sheets: googlesheets4

Reads data directly from Google Sheets to RStudio
Package info
Cheatsheet

If you have your data in Dropbox or Box: read in using repmis and the share link

Package info (and if using Dropbox, change a little bit of the url - directions to do that)
Very useful if you have one data file that you’re using for different projects!

24 of 60

… and the packages to use!

Operations you might have to do once you’ve entered your data…

Read it into RStudio

Look at it

Clean it up

Visualize and summarize it

tidyverse, googlesheets4, repmis

dplyr, base R, skimr

25 of 60

Look at your data: what are the columns called?

26 of 60

Look at your data: what’s in the columns?

`str()` is another option!

27 of 60

Look at your data: what do the rows have in them?

28 of 60

Look at your data: what are some basic summary stats for my data?

29 of 60

Look at your data: what are some basic summary stats for my data?

30 of 60

… and the packages to use!

Operations you might have to do once you’ve entered your data…

Read it into RStudio

Look at it

Clean it up

Visualize and summarize it

tidyverse, googlesheets4, repmis

dplyr, janitor

dplyr, base R, skimr

31 of 60

Useful functions to keep in mind

clean_names(): cleans up column names
mutate(): creates new columns, changes columns (very powerful when used with case_when())
select(): selects columns from a data frame
pivot_longer(): puts the data frame in “long format” (each row is an observation)
rename(): renames columns
filter(): filters data frame

32 of 60

Pop over to the code resource to see the cleaning workflow!

33 of 60

… and the packages to use!

Operations you might have to do once you’ve entered your data…

Read it into RStudio

Look at it

Clean it up

Visualize and summarize it

ggplot2

tidyverse, googlesheets4, repmis

dplyr, janitor

dplyr, base R, skimr

34 of 60

Visualize it

Pop over to the code resource to see the visualizing examples!

35 of 60

Questions about data processing?

36 of 60

Plan your own!

Copy the worksheet here!

37 of 60

Data reproducibility

38 of 60

Data reproducibility

How can you make sure that someone (who isn’t you)...

can find your data and code?

will know what they’re looking at when they see it?

can reproduce your processing, visualizing, and analysis steps?

39 of 60

How can you make sure that someone can find your data and code?

Solution: use data-sharing platforms

Zenodo

Dryad

Environmental Data Initiative

Figshare

40 of 60

Things to consider when choosing a platform:

Do you want to connect to a GitHub repository? Choose Zenodo or Figshare

Do you want to not pay money? Choose Zenodo, Figshare, or Environmental Data Initiative

Do you want someone to curate your submission? Choose Environmental Data Initiative

41 of 60

Resources for exploring data sharing platforms

Mislan et al. Elevating the status of code in ecology. https://doi.org/10.1016/j.tree.2015.11.006

More repository comparisons: https://zenodo.org/records/3946720

User comparison : https://evodify.com/free-research-repository/

42 of 60

Data reproducibility

How can you make sure that someone (who isn’t you)...

can find your data and code?

will know what they’re looking at when they see it?

can reproduce your processing, visualizing, and analysis steps?

43 of 60

How can you make sure that someone will know what they’re looking at when they see your data and code?

Solution: write a README file!

44 of 60

README files

Describe what is in a repository (data, files, etc.)
Usually in plain text format
On GitHub, in .md format
Useful sections:

General overview
Data and file overview
Sharing and accessing information
Methodological information
Data-specific information

Resource from Research Data Services at the library: README, please!

Repo used for the following examples

45 of 60

General information

What’s the project?
Who’s working on it?
What’s it about?

46 of 60

Data and file overview

What files are there?
Create a tree map of files in the repository using `tree` in the Terminal (in RStudio), then just copy/past the output

47 of 60

Sharing and accessing information

Types of licenses:

Public domain
Attribution
Open database license

See more here

You decide what license you want to give your work: these are community standards, meaning that they are not necessarily legally binding, but everyone follows them.

48 of 60

Methodological information

Where did the data come from?
How did you collect the data?

49 of 60

Data-specific information

What types of data do you have?

Can get unwieldy if you have a lot of data files! You decide the resolution of describing this information (at the level of the columns/rows, or at the level of the file)

50 of 60

Data reproducibility

How can you make sure that someone (who isn’t you)...

can find your data and code?

will know what they’re looking at when they see it?

can reproduce your processing, visualizing, and analysis steps?

51 of 60

How can you make sure that someone can reproduce your processing, visualizing, and analysis steps?

Solution: review your code!

52 of 60

The 4 Rs of reproducible code standards

Is the code as Reported?

Do the methods and code match up?

Does the code Run?

Code should run without any modifications

Is the code Reliable?

Code actually does what you think it does

Are the results Reproducible?

Could someone get the same results you do?

Cook et al. 2023

53 of 60

Is the code as Reported?

Keep your thoughts organized

Outline your methods as you are working on your data cleaning and analysis!!
Keep a “journal” of data cleaning steps: packages, functions, etc.

Keep your code organized

Organize your code following the order of your methods (though this can change, so give yourself some flexibility to change too!)

54 of 60

Does the code Run?

Make sure the data is available

Solution: share data and code on data sharing platform

Make sure the packages/functions are available

`renv`: recreates the package environment as an object that people can read in

55 of 60

Is the code Reliable?

Double check that your transformation, summarizing, etc. steps are doing what you think they’re doing

Solution: run your processing steps on a small subset of your data that would be easy to double check on your own
An’s example: calculating the proportion of total community biomass for each individual species: hard when you have 11 years of data across 5 sites sampled 4 times a year, easier when you pick out two sampling points, do the calculation, then double check the numbers

If you are developing a package of functions (or writing a bunch of your own functions): `testthat`

56 of 60

Are the results Reproducible?

Make sure what you write matches up with the output of your code

One way: if the data have changed, re-run all analyses and double check statistics at the end of every version of a draft
Better solution: write papers in RStudio using Quarto or RMarkdown where you can directly extract components of an output

57 of 60

More resources:

British Ecological Society Reproducible Code guide
SORTEE 17-Step checklist for reproducible code (part of a larger workshop on code review)
Jenkins et al. Reproducibility in ecology and evolution: Minimum standards for data and code. https://doi.org/10.1002/ece3.9961

58 of 60

Questions about data reproducibility?

59 of 60

Plan your own!

Copy the worksheet here!

60 of 60

Thanks for coming!

Please fill out the survey for us to learn what we can improve on and what future iterations of this workshop should be like!

Survey link: https://bit.ly/4bM5Bmk