EEMB Data Management Workshop
Spring 2024
Slides: bit.ly/4aRWcbo
From our survey, topics in order of interest are:
Workshop format
Sections:
Data storage and redundancy
Data processing
Data reproducibility
For each section:
A mini-lecture
Q&A
Plan your own!
At the end of this workshop, you will come away with a plan to get from
data collection to processing to publishing!
Panel
Zoe: Nearly graduated ancient PhD candidate studying food webs and nutrient subsidies in California intertidal habitats.
Caitlin: PhD student in the Briggs Lab studying frog disease in California
An: thinking about communities on land and in the ocean
Data storage and redundancy
If your data exist in multiple forms, that’s redundancy!
Copies
Physical storage
Cloud storage
Primary copy and back-ups
Data copies - Caitlin
Tips for not losing the primary data sheets:
Source: Tails from the Field
Caitlin’s field notebook scan, covered in mud
Data copies - lessons learned
Copies
Physical storage
Cloud storage
Primary copy and back-ups
On your computer, on an external drive
Multiple types of physical storage: if you lose one, you will always have another!
Physical data sheets in a binder in lab
Scans of data sheets on computer and external drive
Entered data in .csv on computer
Wherever they are, keep it consistent!
Scan with flatbed scanner connected to computer
Save all files to external hard drive that lives at my house
Another option: scan to flash drive using Noble Hall scanner!
Technically these .csv files are downloaded from Google Sheets… which we will talk about in a bit!
Copies
Physical storage
Cloud storage
Primary copy and back-ups
On your computer, on an external drive
Dropbox, Google Drive, Box, etc.
Platforms and benefits
Dropbox
Google Drive
Box
Other options: OneDrive, WeTransfer, etc.
Easiest transition from working at UCSB to elsewhere
Easiest for collaborative data collection
Unlimited storage with UCSB
Copies
Physical storage
Cloud storage
Primary copy and back-ups
On your computer, on an external drive
Dropbox, Google Drive, Box, etc.
Questions about data storage and redundancy?
Plan your own!
Data processing
Resource to follow along:
Data processing and reproducibility
Design your data entry sheet to mimic your data sheet as closely as possible to minimize errors
… and the packages to use!
Operations you might have to do once you’ve entered your data…
Read it into RStudio
Look at it
Clean it up
Visualize and summarize it
tidyverse, googlesheets4, repmis
Read it into RStudio (from your computer)
If your data is in a .csv: read_csv from readr in the tidyverse
If your data is in a .xlsx: read_xlsx from readxl in the tidyverse
Read it into RStudio (from the cloud)
If you have your data in Google Sheets: googlesheets4
If you have your data in Dropbox or Box: read in using repmis and the share link
… and the packages to use!
Operations you might have to do once you’ve entered your data…
Read it into RStudio
Look at it
Clean it up
Visualize and summarize it
tidyverse, googlesheets4, repmis
dplyr, base R, skimr
Look at your data: what are the columns called?
Look at your data: what’s in the columns?
`str()` is another option!
Look at your data: what do the rows have in them?
Look at your data: what are some basic summary stats for my data?
Look at your data: what are some basic summary stats for my data?
… and the packages to use!
Operations you might have to do once you’ve entered your data…
Read it into RStudio
Look at it
Clean it up
Visualize and summarize it
tidyverse, googlesheets4, repmis
dplyr, janitor
dplyr, base R, skimr
Useful functions to keep in mind
… and the packages to use!
Operations you might have to do once you’ve entered your data…
Read it into RStudio
Look at it
Clean it up
Visualize and summarize it
ggplot2
tidyverse, googlesheets4, repmis
dplyr, janitor
dplyr, base R, skimr
Visualize it
Questions about data processing?
Plan your own!
Data reproducibility
Data reproducibility
How can you make sure that someone (who isn’t you)...
can find your data and code?
will know what they’re looking at when they see it?
can reproduce your processing, visualizing, and analysis steps?
How can you make sure that someone can find your data and code?
Solution: use data-sharing platforms
Things to consider when choosing a platform:
Do you want to connect to a GitHub repository? Choose Zenodo or Figshare
Do you want to not pay money? Choose Zenodo, Figshare, or Environmental Data Initiative
Do you want someone to curate your submission? Choose Environmental Data Initiative
Resources for exploring data sharing platforms
Mislan et al. Elevating the status of code in ecology. https://doi.org/10.1016/j.tree.2015.11.006
More repository comparisons: https://zenodo.org/records/3946720
User comparison : https://evodify.com/free-research-repository/
Data reproducibility
How can you make sure that someone (who isn’t you)...
can find your data and code?
will know what they’re looking at when they see it?
can reproduce your processing, visualizing, and analysis steps?
How can you make sure that someone will know what they’re looking at when they see your data and code?
Solution: write a README file!
README files
Resource from Research Data Services at the library: README, please!
General information
Data and file overview
Sharing and accessing information
Types of licenses:
You decide what license you want to give your work: these are community standards, meaning that they are not necessarily legally binding, but everyone follows them.
Methodological information
Data-specific information
What types of data do you have?
Can get unwieldy if you have a lot of data files! You decide the resolution of describing this information (at the level of the columns/rows, or at the level of the file)
Data reproducibility
How can you make sure that someone (who isn’t you)...
can find your data and code?
will know what they’re looking at when they see it?
can reproduce your processing, visualizing, and analysis steps?
How can you make sure that someone can reproduce your processing, visualizing, and analysis steps?
Solution: review your code!
The 4 Rs of reproducible code standards
Is the code as Reported?
Do the methods and code match up?
Does the code Run?
Code should run without any modifications
Is the code Reliable?
Code actually does what you think it does
Are the results Reproducible?
Could someone get the same results you do?
Is the code as Reported?
Keep your thoughts organized
Keep your code organized
Does the code Run?
Is the code Reliable?
Are the results Reproducible?
Make sure what you write matches up with the output of your code
More resources:
Questions about data reproducibility?
Plan your own!
Thanks for coming!
Please fill out the survey for us to learn what we can improve on and what future iterations of this workshop should be like!
Survey link: https://bit.ly/4bM5Bmk