Presented by Todd Vision, Department of Biology
With Barrie Hayes, Bioinformatics Librarian, UNC Health Science Library, and
Ruth Merinshaw, AVC for Research Computing at UNC
The third in a series of short courses from the Odum Institute and the UNC Libraries on data management plans and resources. This course focuses on guidelines and resources appropriate to life scientists.
When: Monday Jan 31, 2011, 12pm - 1pm
Where: 107 Wilson Hall (or online)
Abstract: The NSF has instituted a new requirement for all grant proposals starting January 18, 2011 to include a Data Management Plan. I will explore what exactly this means, and discuss some of the resources available to researchers faced with this new requirement. I will then present Dryad, a repository for digital data associated with published articles in the biosciences. I will use Dryad to illustrate the different decision points researchers face in formulating a data management plan. There will be ample time for questions and discussion with the presenters.
URL for this document: http://bit.ly/hQscUP (temporary) or https://docs.google.com/document/d/1JuGwQ93uAAlGYy_KOXB7UzKttooIvHK46cUp0AzooCo/edit?hl=en#
Data collection represents a significant fraction of the reseach enterprise from the perspective of a funding agency. It is generally in the interests of science, and individual researchers, that data is made broadly available in order to validate previous findings, explore new analysis methodologies, compare findings across studies, and repurpose data for research questions unanticipated by the original authors. The essence of the case was nicely put in graphic form in this slideshow from Heather Piwowar.
Accordingly, NSF is following many years of recommendations from the National Academy of Science, the White House, the National Science Board, and others, to institute a policy encouraging sharing of scientific data and materials. As of January 18th, 2011, all grant proposals to the National Science Foundation (NSF) are required to include a data management plan that describes how the proposal will conform to NSF policy on the dissemination and sharing of research results.
This document can serve as a reference to the pertinent policies and help resources as of January 2011, as I have made an attempt to copy the relevant information here. Note that many institutions, like UNC Libraries, are actively assembling online resources on how to develop DMPs, and are likely to stay more current over time than this document.
You may be aware that the National Institutes of Health have had a requirement for Data Management plans since 2003 for extramural projects with annual budgets greater than $500K. By contrast, the new NSF mandate means that grant applicants will need to consider a DMP no matter what the scale of funding.
At the same time, a growing number of journals and publishers are requiring archiving of data associated with publications, including Nature, PLoS, BMC, and (as of January) many leading journals in ecology, evolution, and genetics.
Also worth noting is that the Office of Management and Budget has maintained since 1999 that data from federally funded research is subject to the Freedom of Information Act. And in the UK, at least, filing a data management that specifies how and when data will become available has been deemed to provide sufficient grounds for denying the equivalent of a Freedom-of-Information request. This may be of importance to those doing research on matters of relevance to public policy. (Caveat: I am not, thank goodness, a lawyer).
2. What exactly is required?
According to the Grant Proposal Guide, the new requirement is the following: Proposals must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (see AAG Chapter VI.D.4), and may include:
Some administrative details:
3. NSF Policy on Dissemination and Sharing of Research Results
Data management requirements and plans specific to the Directorate, Office, Division, Program, or other NSF unit, relevant to a proposal are available. If guidance specific to the program is not available, then the requirements established [above] apply.
The specific guidelines are intended to reflect the “community norm” in each field. They will surely evolve as researchers get more used to preparing and reviewing DMPs, and best practices emerge. They generally include such things as
Unlike most other NSF Directorates, the Biology Directorate (BIO), as of January 2011, has not issued more specific guidelines, though it may still do so in the future. As an example of how this mandate is being implemented within BIO, the Division of Environmental Biology has this somewhat generic statement on its homepage:
Proposals submitted to all programs in DEB must adhere to the general NSF policy on data sharing as described in the Grant Proposal Guide... Thus, proposals should describe plans for specimen and information management and sharing, including where data and metadata will be stored and maintained, and the likely schedule for release. These plans will be considered as part of the review process.
As an example of how much the specific guidance can among the units within NSF, take as an example the time at which data should be made available:
Thus, it is important to be aware of the specific guidance relevant to your program, as it can vary considerably.
Of course, the expectations for what goes in a DMP will also vary across agencies (NOAA, NIH, etc) should you be lucky enough to have funding from different sources.
NSF provides answers to the common questions it receives from the community on its website. As of the last update in November 2010, there were 17 FAQs (all copied here, for the sake of completeness).
1. What constitutes “data” covered by a Data Management Plan?
What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models.
2. Is a plan for Data Management required if my project is not expected to generate data or samples?
Yes. It is acceptable to state in the Data Management Plan that the project is not anticipated to generate data or samples that require management and/or sharing. PIs should note that the statement will be subject to peer review.
3. Am I required to deposit my data in a public database?
What constitutes reasonable data management and access will be determined by the community of interest through the process of peer review and program management. In many cases, these standards already exist, but are likely to evolve as new technologies and resources become available.
4. There is no public database for my type of data. What can I do to provide data access?
Contact the cognizant NSF Program Officer for assistance in this situation.
5. Should the budget and its justification specifically address the costs of implementing the Data Management Plan?
Yes. As long as the costs are allowable in accordance with the applicable cost principles, and necessary to implement the Data Management Plan, such costs may be included [...]
6. My institution's policy is that the data and all supporting materials from all research are owned and must remain with the institution if I leave. How does this policy affect what I can say about data management and access?
Data maintenance and archieving by an institution is one avenue by which data preservation and access can be achieved. However, the data access plan must address the institutional strategy for providing access to relevant data and supporting materials.
7. Does data management and access include supporting documentation and metadata, such as validation protocols, field notebooks, etc.?
All researchers are expected to be able to explain and defend their results. Doing so usually entails maintaining complete records of how data were collected. The manner in which one maintains such records and makes them available to others will vary from project to project. What constitutes reasonable procedures will be determined by the community of interest through the process of peer review and program management. These standards are likely to evolve as new technologies and resources become available.
8. How long should data be archived and made accessible?
What constitute reasonable procedures will be determined by the community of interest through the process of peer review and program management.
9. Does this policy mean that I must make my data available immediately, even before publication?
Not necessarily. The expectation is that all data will be made available after a reasonable length of time. However, what constitutes a reasonable length of time will be determined by the community of interest through the process of peer review and program management.
10. What are NSF’s expectations regarding the release of data that include sensitive information (e.g., information about individuals or locations of endangered species)?
Such data must be maintained and released in accordance with appropriate standards for protecting privacy rights and maintaining the confidentiality of respondents. Within legal constraints, what constitutes reasonable data access will be determined by the community of interest through the process of peer review and program management. [I would add that if your research involves handling sensitive information, it should be addressed through a filing with the relevant Institutional Review Board, and you should look carefully at the infrastructure for keeping that data private. We have had a recent experience at UNC with a security breach of identifiable medical records, and it did not go down well for the researcher involved].
11. My data include information of potential commercial value. Am I required to make that information available?
Not necessarily. It is NSF’s strong expectation that investigators will share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. However, it is also necessary to protect intellectual property rights and potential commercial value. The Data Management Plan should describe the proposed approach, which will then be subject to peer review and program management. (For example, research use of sensitive data is often allowed through reasonable binding agreements that contain confidentiality provisions.)
12. Does NSF have particular requirements for archiving and accessibility of samples, physical collections and so forth?
No. If appropriate, your Data Management Plan should describe the types of samples, and/or collections, etc., that you will use, as well as personal, institutional or other repositories for archiving and providing access to others. What constitutes reasonable archiving and accessibility will be determined by the community of interest through the process of peer review and program management.
13. Does NSF have particular requirements for what types of samples, physical collections, and so forth should be saved?
No. What constitutes reasonable requirements will be determined by the community of interest through the process of peer review and program management. These standards are likely to evolve as new technologies and resources become available.
14. If data or samples are requested before I have completed all analyses on them, must I share them?
No. The expectation is that all data will be made available after a reasonable length of time. One standard of timeliness is to make the data or samples accessible immediately after publication. However, what constitutes a reasonable length of time will be determined by the community of interest through the process of peer review and program management
15. How does this policy relate to the issue of open access publishing?
Open-access publishing (making all published articles freely available) is a separate issue that is not addressed in the implementation of the data management plan requirement.
16. If I participate in a collaborative international research project, do I need to be concerned with data management policies established by institutions outside the United States?
Yes. There may be cases where data management plans are affected by formal data protocols established by large international research consortia or set forth in formal science and technology agreements signed by the United States Government and foreign counterparts. Be sure to discuss this issue with your sponsored projects office (or equivalent) and your international research partner when first planning your collaboration.
17. My proposal is interdisciplinary and there are multiple sets of guidance to follow on NSF's website (http://www.nsf.gov/bfa/dias/policy/dmp.jsp), which one do I follow?
All proposals are submitted to a lead program, with the option to specify other programs that the PI would like to consider the project. If the guidance appropriate to the lead program applies, it should be followed. Otherwise, provide a clear explanation of what you would do if the project were funded. Bear in mind that the merit review is conducted by colleagues from the communities of relevance and that your plan should be convincing to them. It should also be commensurate with the level of effort requested and required, and appropriate to the long-term value and benefit to your colleagues of any generated data products.
For many projects, the data of value will eventually be tied to one or more publications. Many major journals now encourage or require the data associated with the published article to be made available to other researchers. In fact, most major scientific, technical and mathematics (STM) publishers are signatories to the 2007 Brussels Declaration on STM Publishing [note: PDF link], which states:
Raw research data should be made freely available to all researchers. Publishers encourage the public posting of the raw data outputs of research. Sets or sub-sets of data that are submitted with a paper to a journal should wherever possible be made freely accessible to other scholars.
While this statement is generally interpreted as a stand against open access publishing models, and has engendered controversy for that reason, it is at the same an endorsement of the principles of open data.
There are several mechanisms for sharing data associated with publications:
A public archive is generally the preferred way to ensure the custodianship needed to ensure future access. The Joint Data Archiving Policy (which came into force this month at the American Naturalist, Evolution, Heredity, Journal of Evolutionary Biology, Molecular Biology and Evolution, Molecular Ecology, and Systematic Biology, among others), for instance, gives preference to a public archive. It reads:
Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. As a condition for publication, data supporting the results in the paper should be archived in an appropriate public archive, such as << list of approved archives here >>. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.
Dryad is a repository associated with data underlying publications in the Biosciences. It is my close involvement with Dryad that makes me remotely qualified to lead this short course. Dryad is an important resource for investigators who are looking to comply with new funder and journal mandates for the sharing of data, because it does not specialize on one data type or another -- it can provide a home for any data that a researcher may need to archive.
Some key features of Dryad include:
The Wellcome Trust offers this set of questions to ask yourself when developing a DMP:
You can easily imagine filling this out in the form of a table, one row for each distinct collection of data or materials, and then simply using this table directly, or putting into prose, for the plan itself. If you expand this to include software and materials, than you have a good template for the NSF plan. More detailed checklists are available from other sources.
For instance, here are some examples from a DMP that I submitted to NSF only last week as part of a grant proposal with Jason Reed and Esther Van der Knaap. One experiment concerns data for which no specialized repository exists, so will be archived with Dryad.
Time course of fruit development and auxin levels. Measurements of seed number, fruit size/weight, endosperm development, and auxin levels will be made in mutant genotypes and introgression lines as well as recurrent parent genotypes (MicroTom, Ailsa Craig), through the stages of fertilization, endosperm proliferation, and fruit growth. Data will be archived and disseminated upon publication of the results in the form of self-documenting spreadsheets through the Dryad Digital Data Repository (http://datadryad.org). Publications expected in year 2.
The following text was used to describe the timing, terms of reuse, and other policies governing Dryad submissions.
Data submitted to Dryad will be made publicly available … upon online publication of the associated article (i.e. without the optional 1-year embargo). It is required that all data in Dryad be released to the public domain without legal restrictions on reuse through a Creative Commons Zero waiver. There is a (legally non-binding) expectation of attribution of the Dryad data record and associated article. A one-time data deposit charge is paid by the authors or the associated journals, which allows Dryad data to be available for download without cost to users.
But not all data from the proposal were designated to go to Dryad.
Expression data. RNA-Seq data from tomato ovules/seeds and ovaries at multiple time points during fertilization in multiple mutant and wild-type genotypes will be archived and made freely available for reuse through the Gene Expression Omnibus, following MIAME standards, prior to publication … [expected in Yr 3].
And, in some cases, we needed to consider the sharing of materials rather than data, and respect the policies of collaborating organizations:
TILLing mutants for ARF and PRC2 genes. Seeds from the originally screened tomato mutant lines will be available through the INRA Fruit Biology Laboratory or the UC Davis TILLing facility, according to the terms of each facility. Policies are under development at both institutions, but likely will be similar to the current UC Davis policy for rice TILLing, which honors a one year non-disclosure agreement on the identity of the mutation to the original investigator, and subsequently makes seed available to researchers in the community upon request. Late-stage introgression lines derived from these materials, including possible multiple-mutation introgression lines, in MicroTom and other genetic backgrounds, will be offered for gratis distribution through the C.M. Rick Tomato Genetics Resource Center ... [expected in Yr 3].
The policy is new, so there are not many available examples of NSF DMPs, but ICSPR (a social sciences data archive) maintains a long list of example DMPs from different disciplines (though these are mostly for relatively “big science” projects).
See the UNC Data Management Toolkit
11. For more information
I highly recommend Whitlock (2010) as a starting place to explore the issues, and in particular best practices, in data sharing.
For updates on Dryad, such as what new journals are coming online, we encourage you to subscribe to our low volume mailing list, or follow our blog and tweets, which cover the world of data archiving more broadly.