Data, data everywhere! Managing & organizing data
July 13, 2023
Sara Samuel & Kate Saylor
Technology Overview
Learning Objectives
Have you ever lost some data or notes because of lack of organization?
Poll
What is data?
Raw Data
Analyzable Data
Shareable Data
Categories of data
Why is organizing and managing data important?
1 rat heart
100s
of slices
100s of
slides
1000s of image files
TIF
TIF
TIF
TIF
TIF
TIF
100s of huge images
3 postdocs
5-7 experiments
a week…
Data management for yourself
Data management for science
Research reproducibility
Data management challenges
Poor data management affects everyone
“MEDICARE PAYMENT ERRORS NEAR $20B” (CNN) December 2004. Miscoding and billing errors from doctors and hospitals totaled $20 billion in FY 2003 (9.3% error rate). The error rate measured claims that were paid despite being medically unnecessary, inadequately documented, or improperly coded. This error rate actually was an improvement over the previous fiscal year (9.8% error rate).
“SOCIAL SECURITY DATA CAN TURN PEOPLE INTO THE LIVING DEAD” (NPR) August 2016. In 2011, an audit found that about 1,000 people a month in the U.S. were marked deceased when they were very much alive. Rona Lawson, who works in the Office of the Inspector General at the Social Security Administration, says that number has gone down. It's now around 500 people a month. Lawson says 90 percent of the time, the cascade of misinformation starts with an input error by Social Security staff — a regular mistake on a regular office day that just happens to kill a person off, at least on paper."
The climate scientists at the centre of a media storm over leaked emails were cleared of accusations that they manipulated their results and silenced critics, but a review found they had failed to be open enough about their work.
http://retractionwatch.com/2016/11/02/leading-diabetes-researcher-acted-negligently-probe-concludes/
“had published duplicate pictures in several cases and had repeatedly failed to exert due diligence in organising her area of study over a long period of time.”
Don't end up here!
Data management and documentation could have prevented these problems!
How can I manage & organize my data?
Keys to success
If you can, assign one person to be responsible
Organizing tactics
Document!
2. Establish directory/folder structures
Document!
3. Develop file naming convention(s)
File names should:
Document!
May need to include:
Implement version control (versioning)
ISO 8601 - Formatting Dates
Standard way to format date and time - extremely helpful for file naming conventions
YYYY-MM-DD
YYYYMMDD
Examples
AtherRat_012_056_mb_0423_raw.csv
AtherRat = experiment name�012 = experiment number�056 = sample number�mb = stain used, methylene blue�0423 = 2-digit coordinates of image (4 across, 23 down)�Raw = data stage
4. Decide on file formats
Whenever possible, select file formats that are:
Consider instrument/device settings
Document!
Recommended file formats
Audio: WAVE, AIFF, MP3, MXF
Containers: TAR, GZIP, ZIP
Databases: XML, CSV
Statistics: ASCII, DTA, POR, SAS, SAV
Still images: TIFF, JPEG 2000, PNG, GIF
Tabular data: CSV
Text (documentation, scripts): XML, PDF/A, Plain Text (ASCII, UTF-8)
Video: MOV, MPEG, AVI, MXF
How should I describe
my data?
Documentation
Good data documentation helps ensure accurate reporting of data and methods:
Workflows
The meaning of data depends on the context of how it was collected.
Internal workflow
What is metadata?
It's the “data about data.” Structured (may follow documentation standards) information that makes it easier to retrieve, use, or manage data. It can include things like:
Perkel, J. M. (2023). How to make your scientific data accessible, discoverable and useful. Nature, 618(7967), 1098–1099. https://doi.org/10.1038/d41586-023-01929-7
Share in chat
What are we looking at in this table?
What questions or concerns arise?
Document your variable names
Name | Field Type | Description | Possible values | Units |
study_id | text | Unique ID of study | 8-digit number | |
date_enrolled | date | Initial subject enrollment date | Date in format YYYY-MM-DD; All dates later than 2011-09-01 | |
weight | integer | Weight of subject | | lbs |
Data Dictionaries
Data dictionaries provide detailed descriptions of the data in a dataset. They help provide context about the structure and content of data. They can also guide the process for collection and use of the data.
Public Use Microdata Sample Documentation (US Census)
README files
What do we need to know to use your data?
Source: Quick & Dirty Data Management
README: Recommended practices
Source: Quick & Dirty Data Management
Ways to get started
Consider your situation and goals!
Check yourself!
What’s worked
for you? ��(Tools, systems, getting started, etc.)
Share in chat
Resources
Thanks! Questions?
Parts of this presentation were adapted from presentations by: