1 of 50

Data, data everywhere! Managing & organizing data

July 13, 2023

Sara Samuel & Kate Saylor

2 of 50

Technology Overview

  • Live transcription is available
  • Recording is on
  • Remain muted during presentation
  • Use chat to share questions and comments during the presentation
  • Evaluation at the end
  • Email follow-up with slides & recording

3 of 50

Learning Objectives

  • Recognize why organizing your data is important for enabling high quality research.
  • Understand recommended practices for keeping your data organized.
  • Identify ways to describe your data.
  • Be able to experiment with practical strategies to start organizing your data.

4 of 50

Have you ever lost some data or notes because of lack of organization?

Poll

5 of 50

What is data?

  • Text
  • Numerical
  • Multimedia
  • Models
  • Software

Raw Data

Analyzable Data

Shareable Data

6 of 50

Categories of data

  • Observational: sensor readings, telemetry, survey results, videos, images
  • Experimental: gene sequences, chromatograms, magnetic field readings, and spectroscopy
  • Simulation: climate models, economic models, and systems engineering
  • Compiled: text and data mining, compiled database, systems engineering, and 3D models
  • Reference: gene sequence databanks, census data, chemical structures

7 of 50

Why is organizing and managing data important?

8 of 50

1 rat heart

100s

of slices

100s of

slides

1000s of image files

TIF

TIF

TIF

TIF

TIF

TIF

100s of huge images

3 postdocs

5-7 experiments

a week…

9 of 50

Data management for yourself

  • Efficiently find your files
  • Track your methods for reproducibility
  • Better version control of data
  • Quality control
  • Avoid data loss
  • Document your data for your own recollection, accountability, and re-use
  • Gain credibility and recognition for your science efforts through data sharing!

10 of 50

Data management for science

  • Data is a valuable asset. It is expensive and time consuming to collect, take good care of it!
  • Well-managed data:
    • improves quality, accuracy, and integrity of your research
    • maximizes the effective use of data
    • ensures appropriate use of data and information
    • strengthens the reliability of the research - promotes transparency, encourages accountability, reduces bias and errors
    • ensures sustainability and accessibility allowing others to reproduce your findings

11 of 50

Research reproducibility

  • Reproducibility is the ability for other researchers to reach similar results when using the same methods and data.
    • Replicability is achieving similar results by conducting a new study with different methods or approaches
  • Accurate, comprehensive, and transparent reporting allows for reproducibility.

12 of 50

Data management challenges

  • Good data management take time and planning
  • Researchers may lack knowledge about best practices in handling data
  • General lack of incentives for doing good data management
  • There can be a cost to managing data (human, technology, etc.)

13 of 50

Poor data management affects everyone

MEDICARE PAYMENT ERRORS NEAR $20B” (CNN) December 2004. Miscoding and billing errors from doctors and hospitals totaled $20 billion in FY 2003 (9.3% error rate). The error rate measured claims that were paid despite being medically unnecessary, inadequately documented, or improperly coded. This error rate actually was an improvement over the previous fiscal year (9.8% error rate).

“SOCIAL SECURITY DATA CAN TURN PEOPLE INTO THE LIVING DEAD” (NPR) August 2016. In 2011, an audit found that about 1,000 people a month in the U.S. were marked deceased when they were very much alive. Rona Lawson, who works in the Office of the Inspector General at the Social Security Administration, says that number has gone down. It's now around 500 people a month. Lawson says 90 percent of the time, the cascade of misinformation starts with an input error by Social Security staff — a regular mistake on a regular office day that just happens to kill a person off, at least on paper."

14 of 50

The climate scientists at the centre of a media storm over leaked emails were cleared of accusations that they manipulated their results and silenced critics, but a review found they had failed to be open enough about their work.

15 of 50

“had published duplicate pictures in several cases and had repeatedly failed to exert due diligence in organising her area of study over a long period of time.”

16 of 50

Don't end up here!

  • Multiple errors in table
  • Did not alter conclusions in article
  • BUT, could not locate primary data

17 of 50

Data management and documentation could have prevented these problems!

18 of 50

How can I manage & organize my data?

19 of 50

Keys to success

  • Maintain good documentation
    • Readme files
    • Data management plan (DMP)
  • Communicate thoroughly
    • Provide training
    • Make documentation findable
    • Use variety of mediums (email, Slack, lab manual)
    • Regularly check-in and/or remind all contributors

If you can, assign one person to be responsible

20 of 50

Organizing tactics

  1. Identify storage solution(s)
  2. Establish directory/folder structures
  3. Develop file naming convention(s)
  4. Decide on file formats

21 of 50

  1. Identify storage solution(s)

Document!

22 of 50

2. Establish directory/folder structures

  • Organize directories hierarchically
  • Group files of similar information together in a single directory
  • Name directories after aspects of the project rather than individual researchers
  • Separate ongoing and completed work
  • Once you have decided on a directory structure, follow it consistently and audit it periodically

Document!

23 of 50

24 of 50

3. Develop file naming convention(s)

File names should:

  • embody the content of the file
  • have intuitive (non-cryptic) names where possible
  • be extensible
  • be unique, where possible and practical
  • not use special characters – restrict file names to numbers, letters, and underscores
  • be named using consistent, documentable rules

Document!

May need to include:

  • versioning
  • multiple conventions

25 of 50

Implement version control (versioning)

26 of 50

ISO 8601 - Formatting Dates

Standard way to format date and time - extremely helpful for file naming conventions

YYYY-MM-DD

YYYYMMDD

27 of 50

Examples

AtherRat_012_056_mb_0423_raw.csv

AtherRat = experiment name�012 = experiment number�056 = sample number�mb = stain used, methylene blue�0423 = 2-digit coordinates of image (4 across, 23 down)�Raw = data stage

28 of 50

4. Decide on file formats

Whenever possible, select file formats that are:

  • non-proprietary
  • unencrypted
  • uncompressed
  • in common usage by the research community
  • adherent to an open, documented standard
  • interoperable among diverse platforms and applications
  • royalty-free and without intellectual property restrictions
  • developed and maintained by an established open standards organization

Consider instrument/device settings

Document!

29 of 50

Recommended file formats

Audio: WAVE, AIFF, MP3, MXF

Containers: TAR, GZIP, ZIP

Databases: XML, CSV

Statistics: ASCII, DTA, POR, SAS, SAV

Still images: TIFF, JPEG 2000, PNG, GIF

Tabular data: CSV

Text (documentation, scripts): XML, PDF/A, Plain Text (ASCII, UTF-8)

Video: MOV, MPEG, AVI, MXF

30 of 50

How should I describe

my data?

31 of 50

Documentation

Good data documentation helps ensure accurate reporting of data and methods:

  • Project level - describes the procedures and method use for data collection and analysis (workflow, protocols, instruments, etc.)
  • Metadata - data about your data
  • Data dictionaries - describes the variables in a dataset
  • README files - dataset description, guide to files and technology

32 of 50

Workflows

The meaning of data depends on the context of how it was collected.

33 of 50

Internal workflow

  • How and when will the work be done?
  • Will data be reviewed for quality?
  • Who manages the entire process?

34 of 50

What is metadata?

It's the “data about data.” Structured (may follow documentation standards) information that makes it easier to retrieve, use, or manage data. It can include things like:

  • Dates, times, locations
  • Hardware/software information and parameters
  • Methodology
  • Creators
  • Copyright
  • Formats

35 of 50

Perkel, J. M. (2023). How to make your scientific data accessible, discoverable and useful. Nature, 618(7967), 1098–1099. https://doi.org/10.1038/d41586-023-01929-7

36 of 50

Share in chat

What are we looking at in this table?

What questions or concerns arise?

37 of 50

Document your variable names

  • Intuitive / meaningful variable names e.g. study_id
  • What do variable names mean?
  • What does each variable contain?
  • Are there a limited set of possible values?

Name

Field Type

Description

Possible values

Units

study_id

text

Unique ID of study

8-digit number

date_enrolled

date

Initial subject enrollment date

Date in format YYYY-MM-DD; All dates later than 2011-09-01

weight

integer

Weight of subject

lbs

38 of 50

Data Dictionaries

Data dictionaries provide detailed descriptions of the data in a dataset. They help provide context about the structure and content of data. They can also guide the process for collection and use of the data.

  • List of all data objects
  • Description of data elements (size, type, classification, etc.)
  • Relationships to other data
  • Variables and coding information
  • Creation date

39 of 50

40 of 50

41 of 50

README files

What do we need to know to use your data?

  • Where to find it
  • How to access it
  • What can it be used for?
  • Known problems, inconsistencies, limitations
  • Collection methods, units of measure, variable names
  • Data integrity
  • Ethical/privacy restrictions
  • Licensing
  • Who to cite

42 of 50

README: Recommended practices

  • Create 1 readme file for each data file/dataset
  • Name the readme so that it’s easily associated with the datafile(s) it describes
  • Write your readme document as a plain text file
  • Format multiple readme files identically (Tip: use a template - create or find one!)
  • Follow the conventions for your discipline

43 of 50

44 of 50

Ways to get started

  • Pick 1 project, implement 1 new tactic
  • Choose 1 aspect of a file naming convention and apply it to your files
  • Plan out folder names and hierarchy before starting a new project
  • Do a scan to figure out where all your data files are & document everything in a readme file
  • Review current resources/tools for where you can gather metadata and documentation without changing your workflow
  • Weave in recommended practices where you can; practice leads to improvement!
  • Get an organization accountability buddy & check in occasionally

Consider your situation and goals!

45 of 50

Check yourself!

  1. Can you easily locate & understand the raw data?
  2. Can you connect different types of related data you collected?
  3. If a file gets misplaced, can you put it back in the correct folder?
  4. Are your naming conventions consistent with others on your team?
  5. If another researcher were to ask you for a copy of your data 5 years after the close of your project, would you be able to easily find it and send it to them? Could all the members of your research team find it?
  6. If a researcher were to receive a copy of your data, would they be able to use it without asking you too many questions?

46 of 50

What’s worked

for you? ��(Tools, systems, getting started, etc.)

Share in chat

47 of 50

Resources

48 of 50

49 of 50

50 of 50

Thanks! Questions?

Sara Samuel, henrysm@umich.edu

Kate Saylor, kmacdoug@umich.edu

Parts of this presentation were adapted from presentations by:

  • NYU Langone Health RDM Teaching Toolkit slides by Kevin Read & Alisa Surkis.
  • Marisa Conte, from when she was a THSL informationist.
  • DataONE Education Module: Data Management.