Research Data Management:

A Primer

State University of New York

College of Environmental Science and Forestry

F. Franklin Moon Library

Alison Bressler, Master of Wetland and Water Resource Studies

Zachary Lafaver, Master of Professional Studies in Ecology

Jessica Clemons, Interim Director of College Libraries


About the authors:

Alison Bressler is studying the impact of shrub willow on ecosystem services in the Lake Ontario watershed in SUNY-ESF’s Graduate Program in Environmental Science. Alison’s research focuses on biogeochemical reactions that impact greenhouse gas emissions at the soil/atmosphere interface in managed agricultural ecosystems and she is passionate about minimizing the negative impacts of conventional agriculture on global water quality and on climate change. Alison became interested in data management while learning to manage extensive data collection during her undergraduate research with the River Basin’s Research Initiative in South Carolina.

Zachary Lafaver is pursuing a Master of Professional Studies in Ecology.  Zach completed his B.S. in Biology at Nazareth College. His scientific interests lie in aquatic habitats where he can observe and study the interactions and characteristics among these diverse areas. Zach became interested in data management after realizing how much work it takes to effectively organize and manage the data that goes along with research.

Jessica Clemons is the Interim Director of College Libraries at SUNY ESF and has led the team that created this data management handbook.  She has a strong background in digital collections in a variety of settings from oral histories to datasets. She is an open access advocate and teaches classes on research data management and information literacy.  Jessica enjoys the challenge of thinking about the future of libraries and how we can balance traditional services with cutting-edge opportunities.

This project was funded by a I2NY Library as Publisher Grant in 2015.  F. Franklin Moon Library is grateful for the support of that innovative grant as we look into sustainable methods of publishing and scholarly communication.

Table of Contents

  1. Managing and Sharing Data
  2. Documenting and Formatting Your Data
  1. File Naming Conventions
  2. Spreadsheets
  3. Organizing Fields
  4. Data Dictionary
  5. Quality Assurance/Quality Control and Project Management
  6. Databases and Queries
  1. Backing Up Data
  1. 3-2-1 Rule
  2. Choosing a Backup System
  3. Data Backup Locations
  4. Recommended Data Backup Services
  1. Metadata
  1. Metadata Standards
  2. Metadata Schema
  3. Controlled Vocabularies
  4. Best Practices
  5. Existing Metadata Standards
  1. Dublin Core
  2. Darwin Core
  3. EML - Ecological Metadata Language
  4. Geospatial Metadata Standards
  5. Morpho
  1. Data Management Plan (DMP)
  2. Citation
  3. Ownership
  1. ORP POLICY 10 - Ownership of Data
  2. Definitions Set by SUNY-ESF
  1. Data Repositories
  1. What is a data repository?
  2. How do they work?
  3. Examples

  1. Managing and Sharing Data

Data that is loved tends to survive

Kurt Bollacker, Data Scientist

Research data management and data management planning have been receiving a lot of attention in the past decade, and rightfully so. The National Institute of Health (NIH) has had data management requirements in place since 2003 and the National Science Foundation (NSF) followed with their own requirements in 2011. These two research funding agencies represent a large portion of available federal research funds. Simply put, if researchers will be funded by these agencies, they need to comply with the application requirements.

Research data management (RDM) or data management plans (DMPs) are processes that describe how data will be collected, described, documented, shared, and preserved as part of a project. DMPs may be created for personal use, as part of laboratory groups with multiple users, or federally mandated and required for funding agencies to distribute awards.  From file naming to sharing datasets via data repositories, this multi-faceted process is an essential piece of the modern research process.

Even without the consideration of federal funding requirements, data management is an important practice. Data that are not represented in literature are essentially lost without management. Future research may be entirely dependent on the data management practices of today.

It is important to note that this primer just scratches the surface of what can be learned about managing data.  The authors have included links and references to help those who want to learn more find some of the resources that we have found to be helpful.

You are part of a network of researchers and data managers. We hope that this resource will help you manage your data more confidently. Future researchers will thank you.


  1. Documenting and Formatting Your Data

        Keeping track of thousands of data points requires organization and planning. It is best to start good data management habits at the beginning of data collection, rather than trying to clean up and organize your data set at the end of data collection, when the task seems insurmountable. This section will provide you with tips on how to organize your data effectively, so that you can easily reference different data points and fields. This will aid other researchers in understanding your data once it is published in the data repository. This section will provide a framework for file naming, and will provide tips on creating spreadsheets that can be easily analyzed using multiple platforms.

  1. File Naming Conventions

It is important to name your files using a consistent system. This helps you find files later on and organize them in chronological order. Before collecting data, design a template to use for all of your data, and stick to it throughout the data collection process. Consistency is key.

Examples:  

For Data: 2015-04-12-Tulley-234-698

                    2015-01-04-Vincent-1-320

        For Documents: Smith_2014_Survey

                           Jones_2015_DataDictionary

                           Mattson_2013_Thesis

This will help you find specific data and documents later on without having to search through each file. This standardized method will also allow others to easily interpret your data.

B. Spreadsheets

A spreadsheet is most useful when it is simplified to the point that it can be plugged into a software analysis package and easily manipulated. The main goal is to remove extraneous information other than the necessary values and labels.

        C.   Organizing Fields

Example Excel Spreadsheet:

Screen Shot 2015-04-15 at 8.57.44 AM.png

Leaving a cell blank or writing N/A signifies that data was not available. You can put a note in your data dictionary (explained in the next section) indicating why data is missing.

        D.    Data Dictionary 

A data dictionary should provide a context within which anyone could interpret the variables in your data set. Instead of including notes and explanations in your data sheet, attach a separate file in the data repository with detailed information. Think of this as an expansion of the metadata attached to your data.

The following are examples of elements to include in your data repository:

Include anything else that someone looking at your data for the first time would need to know to understand the raw data. For instance, if data from one of your field sites is missing due to instrument failure or extreme weather conditions, add a note in the data dictionary explaining the anomaly. If you use acronyms in your spreadsheets, provide the full term and a brief explanation. If you use a coding system, explain how the system works.

Screen Shot 2015-04-15 at 9.22.52 AM.png

E.  Quality Assurance/Quality Control and Project Management

 

Step 1: Plan - define start and end dates and establish line item resource budgets

Step 2: Organize - specify roles and responsibilities for project personnel

Step 3: Control - organizing, focusing, and motivating project personnel

F.   Databases and Queries

        Databases can be used as a way to organize, store, manage, and retrieve information. A database is an organized list of facts and information that can come in many forms, from simply text or numbers to images, sounds, and videos. You can import your data from spreadsheets and other formats into a database to make finding the data easier later. Relational database systems are likely what you work with in your research. They include one or more tables that are uniquely identified by the labels for the columns and rows. A database is searchable through a query, meaning that a specific term or number can be searched to find specific information quickly within the database.

        Many databases use Structured Query Language (SQL) standard query format, which is the American National Standards Institute’s standard language for relational database management systems. SQL is used to communicate with a database. Standard SQL commands include: “select,” “insert,” “update,” “delete,” “create,” and “drop.” For instance, our course management system involves an interface where we enter commands into the course database to add and drop classes. For more information on how to use SQL visit http://www.sqlcourse.com/ for a tutorial.

        

Sources and further materials:

Data Management Video Series. File Naming Conventions. University of Wisconsin Data

Services. <https://www.youtube.com/watch?v=4NH1KIXV6qI> Accessed (Apr 13 2015).

Data Management Video Series. Data Dictionaries. University of Wisconsin Data

Services. <https://www.youtube.com/watch?v=Fe3i9qyqPjo> Accessed (Apr 13 2015).

Data Management Video Series. Spreadsheet Best Practices. University of Wisconsin Data

Services.< https://www.youtube.com/watch?v=f11-0Ce1i3I> Accessed (Apr 13 2015).

Dripdot. How to Create a Data Dictionary.

<https://www.youtube.com/watch?v=AeVJy-ow2b0> Accessed (Apr 14 2015).

File Naming Conventions. Data Management: Data. University of Nebraska, Lincoln.

<http://unl.libguides.com/c.php?g=51638&p=333885> Accessed (Apr 15 2015).

ISO 8601:2004(en). Data Elements and Interchange Formats - Information Interchange -

Representation of Dates and Times.  <https://www.iso.org/obp/ui/#iso:std:iso:8601:ed-3:v1:en> Accessed (Apr 13 2015).

Portny, Stanley E. and Jim Austin. 2002. Project Management for Scientists. Science.

<http://sciencecareers.sciencemag.org/career_magazine/previous_issues/articles/200

2_07_12/nodoi.11589789757837229753> Accessed (April 23 2015).

Project Management Advisor. <http://www.pma.doit.wisc.edu/> Accessed (Apr 23 2015).

SQL Course Curriculum. Interactive Online SQL Training. <http://www.sqlcourse.com/>

Accessed (April 15 2015).

III.    Backing Up Data

In addition to storing your data in a repository for the use of future researchers, it is also important to backup your data throughout the research process. Properly storing data in multiple locations will help prevent your data from being lost in case there is an accident involving your computer. Backing up data may have associated costs. However, these costs must be weighed against the cost of losing your data. 

A.  3-2-1 Rule: 3 backup copies, 2 different storage media types, 1 of which is offsite.

        B. Choosing a Backup System

C.  Data Backup Locations

D. Recommended Data Backup Services

        Carbonite http://www.carbonite.com/ 

        IDrive https://www.idrive.com/ 

        SugarSync  https://www.sugarsync.com/ 

Sources and further materials:

Behind the Scenes of SSl Cryptography: Everything You Want to Know About the

Cryptography Behind SSL Encryption. Digicert.

<https://www.digicert.com/ssl-cryptography.htm>. [Accessed April 30, 2015].

Carbonite. Personal Plans. <http://www.carbonite.com/online-backup/personal/how-it-works>.

[Accessed April 30, 2015].

2015 Best: Online Storage Services Review: Reviews and Comparisons.

<http://online-storage-service-review.toptenreviews.com/>. [Accessed April 30, 2015].

University of Oregon Libraries. “Data Storage and Backup.” Research Data Management.

<https://library.uoregon.edu/datamanagement/storage.html>. [Accessed April 30, 2015].

Levkina, Maria. 2014. How to follow the 3-2-1 backup rule with Veeam Backup & Replication.

<http://www.veeam.com/blog/how-to-follow-the-3-2-1-backup-rule-with-veeam-backup-replication.html>. [Accessed May 4th 2015].

IV.    Metadata

Defined as “data about data” or more specifically, “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource.”

Zeng & Qin (2008)

  1. Metadata Standards:

  1. Facilitate the exchange of information in an accurate manner, maintaining the quality of the data so others can easily understand it.
  2. Guides the “design, creation, and implementation of data structure, values, content, and exchange in an efficient and consistent manner” (Zeng, 2008).  
  3. Created by an organization or community with a focus on making metadata work smoothly by fitting specific criteria.

Metadata standards are able to work with help from controlled vocabularies, metadata schemas, and best practices.

      B.  Controlled vocabularies are used in order to achieve consistency in the description of information

resources

  1. Range in complexity from a short list to a thesaurus with thousands of terms and associated relationships among terms.
  2. Terms in a controlled vocabulary must have unambiguous, unique definitions.
  3. Allows for ease of translation of record to other languages
  4. Provides consistency/uniformity when describing a resource and an easy way to label and browse (to locate desired content objects), to retrieve information.
  5. One of the most commonly used controlled vocabularies is the Library of Congress Subject Headings: http://id.loc.gov/authorities/subjects.html
  6. There are different controlled vocabularies that can be used.  Each discipline usually has its own specific vocabulary. The key is to be consistent and use the official terms.

C.   Metadata Schema:

  1. Method for encoding metadata into XML or other machine readable languages.  This is guided by the use of an element set.
  2. Element Set - defines each and every field of information that is required, repeatable, or optional in a metadata record.  
  1. Dictionary/guide for a metadata schema with examples provided of how to encode the information.  
  2. Include many different and specific elements depending on the schema.
  3. For example, the Dublin Core Metadata Standard has fifteen required fields: contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, and type.

D.  Best Practices

  1. The researchers, faculty and students, need to add metadata as completely as possible.  
  2. Ideally, the metadata schema is determined in the project planning stages.
  3. Consult with librarians and/or information professionals early in the process to determine what language or schema will be used, which data can be collected or automated, and to streamline the process.
  4. Existing guides can help inform you of exactly how to create your metadata.  Most metadata standards have active communities that constantly add to and amend best practices.

Scientists/researchers build off of each other’s work and constantly repeat experiments to check the validity of results.  Metadata ensures that the information that you create and record has a way to be distributed properly, accurately, and effectively to members of your specific field of study.

       

 

References:

 

Zeng, M., & Qin, J. 2008. Metadata. New York: Neal-Schuman.

E.  Existing Metadata Standards

1.  Dublin Core

2. Darwin Core

 http://tools.gbif.org/dwca-assistant/ 

3. EML - Ecological Metadata Language

For more information:

http://www.dcc.ac.uk/resources/metadata-standards/eml-ecological-metadata-language

4. Geospatial Metadata Standards: 

FGDC: ISO

http://www.fgdc.gov/metadata/geospatial-metadata-standards 

5. Morpho

Data management for earth, environmental and ecological scientists

https://knb.ecoinformatics.org/#about 

V. Data Management Plan (DMP)

A DMP is a brief document that describes how the researcher will create and share their research results. DMPs are usually created to comply with federal funding requirements and they are excellent resources for lab groups to create to facilitate and coordinate data gathering.  According to NSF policies, a DMP considers five main points:

1.  The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project;

2.  The standards to be used for data format, metadata format, and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies);

3.  Policies for access and sharing including: provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements;

4.  Policies and provisions for re-use, re-distribution, and the production of derivatives;

5. plans for archiving data, samples, other research products, and for preservation of access to them.

*The DMP plan and requirements will depend upon the funding agency, but most of them are very similar*

The Data Management Planning Tool (https://dmptool.org/) uses standards and templates from the requirements of various funding agencies to help create a DMP document. The DMP tool prompts you to create a login, and pick a funding agency. Once selected, fill in the different sections and answer the required questions and the DMP tool will create the document for you.

Here is an example taken from DMP tool: (https://dmptool.org/plans/8278.pdf)

VI. Citation

Citing datasets is just as important as citing journal articles. Give credit to the creator of the data set, the way you would give credit to an author.  Citation styles for datasets and metadata differ from traditional citations of the written word. Although there are multiple citation standards (MLA, APA, Chicago…) to chose from for traditional citations, there are limited standards for citing data. Datacite Initiative (https://www.datacite.org/) is developing a set of standards outlined below:

DataCite Recommended format:

  1. Creator (PublicationYear): Title. Publisher. Identifier
  2. Creator (PublicationYear): Title. Version. Publisher.ResourceType. Identifier

        Examples:

Note: DataCite recommends that DOI names are displayed as linkable, permanent URLs. This will make finding the data later much easier. For easy reference, here is a link to a citation formatter that requires you to input a DOI: http://crosscite.org/citeproc/ 

If the journal article that describes the dataset is available, cite the article along with the data set.

Citation Managers:

Through the use of research tools like Zotero, or Mendeley, citing your data can be much easier. These programs are citation managers, which consolidate your references and make writing up the citation sections of a paper much easier and faster.  Zotero Versus Mendeley: https://www.youtube.com/watch?v=53QRA_SNSzE

VII. Ownership

SUNY TITLE J – Patents, inventions, and copyright policy:  http://www.suny.edu/sunypp/documents.cfm?doc_id=88

A. ORP Policy 10 - Ownership of Data 

Summary - Sections Pulled Directly from Original Document

This policy was developed by SUNY-ESF in order to provide the basis for identification of ownership of information. Data ownership can be complicated when a team of researchers collectively acquires, stores, analyzes, and publishes data. Regulations and standards associated with different funding sources may further complicate things. There is a need for clear policies on the rights and responsibilities of data ownership in the University setting. These policies may be specific to each university (ORP Policy 10). For example:

Eastern Michigan University’s Policy:

 “Raw data (including laboratory notebooks) and processed data are generally owned by the University.  University ownership is subject to conditions established by granting agencies or contracts with sponsors. Management of research data according to these conditions is implicitly delegated to the Principal Investigator and the Administrator of the unit in which he/she works.” (2009 Policy)

 

SUNY-ESF adopted Eastern Michigan’s principles of Data Ownership (ORP Policy 10).

B.  Definitions Set by SUNY-ESF

Data: NIH Grants policy statement (2001) definition of data:        

“Recorded information………including writings, films, sound recordings, pictorial reproductions, drawings, designs or other graphic representations, procedural manuals, forms, diagrams, work flow charts, equipment descriptions, data files, data processing or computer programs (software), statistical records and other research” materials.

We clarify further that other research materials include field notebooks, physical collections, biological specimens, cell lines, digital database files and other tangible artifacts.

 

Ownership: Termed a “bundle of rights” by Shores (2001-2). We define data ownership as:

         The right to possess, use, manage, gain income from, copyright, patent, register, destroy, control indefinitely, and transmit data; and the duty to refrain from its use in ways harming others.  (adapted from A.M. Honore 1961 and Washington State interagency agreement)

 

Usual University resources: Resources commonly provided or made available to faculty (University of Michigan 2007).  These include, but are not limited to, libraries and other buildings, computers, secretarial and administrative support staff, and supplies.

 

Independent work: This is separate from extramurally-funded grants or contracts.  It may be carried out utilizing University space, facilities, or usual University resources, by faculty or by students, but not as a directive from University administration or faculty.” (ORP Policy 10)

A full list of policies including this one can be found at the link below:

        http://www.esf.edu/research/resources/

VIII. Data Repositories

A.  What is a data repository?

A data repository is a location set for collecting data sets and metadata for the use of archiving and sharing. This allows others to view and search for data they may be interested in as well as preserving data for future use.

B.  How do they work?

First, you must choose the repository that you would like to use. Certain repositories have guidelines for the type of data and files that can be shared within their repository. Once you choose the repository, you can view their guidelines for what types of files are accepted. You can search/view, or share your own data now by fulfilling their guidelines. It may useful to include a “read-me” file with metadata for others to read so they may understand the data sets that they’re observing.

        C. Examples

There are many data repositories out there set for all types of data. Whether someone is interested in ecology, geographical systems, chemistry, etc. there are different repositories that one may utilize for searching, archiving, or sharing their data.

Example of the types of data stored on Digital Commons @ ESF

Screen Shot 2015-10-08 at 12.04.16 PM.png