Data Curation Network: A Cross-Institutional Staffing Model for Curating Research Data
A Cross-Institutional Staffing Model for Curating Research Data
Release date: 7-27-2017
This report represents the primary outcome of the project “Planning the Data Curation Network” funded 2016-2017 by the Alfred P. Sloan Foundation grant G-2016-7044.
Authors: Lisa R Johnston, University of Minnesota
Jake Carlson, University of Michigan
Cynthia Hudson-Vitale, Washington University in St. Louis
Heidi Imker, University of Illinois at Urbana--Champaign
Wendy Kozlowski, Cornell University
Robert Olendorf, Pennsylvania State University
Claire Stewart, University of Minnesota
This work is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Recommended Citation: Lisa R Johnston, Jake Carlson, Cynthia Hudson-Vitale, Heidi Imker, Wendy Kozlowski, Robert Olendorf, and Claire Stewart. (2017). Data Curation Network: A Cross-Institutional Staffing Model for Curating Research Data. http://hdl.handle.net/11299/188654.
Executive Summary 3
1.0 Introduction 6
2.0 Literature Review 11
2.1 History of Library Collaborative Staffing 11
2.2 Current Support for Data Curation in Academic Libraries 13
3.0 Methodology 15
3.1 Baseline Assessment of Local Curation Services 16
3.2 Researcher Engagement Sessions 18
3.3 Data Curation Pilots 21
3.4 Surveying the Data Curation Community 22
3.5 Developing a Financial Model 24
3.6 Local Metrics Tracking 26
4.0 A Cross-Institutional Staffing Model for Curating Research Data 28
4.1 Benefits of Using the Network 29
4.2 Roles and Responsibilities 30
4.3 Tiers of Participation 31
4.4 Criteria for New Partners 32
4.5 DCN Submission Workflow 33
5.0 Implementing the Data Curation Network 35
5.1 Phased Implementation Plan 36
5.2 Sustainability Plan 37
5.3 Assessment Plan 38
6.0 Acknowledgements 42
Author Bios 46
Appendix A: Roles and Responsibilities of Key DCN Staff 49
Appendix B: Draft Memorandum of Understanding for Institutional Partners 51
Appendix C: Draft DCN Workflows for DCN Curators 53
Appendix D: Functional Requirements for the DCN Tracking Form 57
Funders increasingly require that data sets arising from sponsored research must be preserved and shared, and many publishers either require or encourage that data sets accompanying articles are made available through a publicly accessible repository. Additionally, many researchers wish to make their data available regardless of funder requirements both to enhance their impact and also to propel the concept of open science. However, the data curation activities that support these preservation and sharing activities are costly, requiring advanced curation practices training, specific technical competencies, and relevant subject expertise. Few colleges or universities will be able to hire and sustain all of the data curation expertise locally that its researchers will require, and even those with the means to do more will benefit from a collective approach that will allow them to supplement at peak times, access specialized capacity when infrequently-curated types arise, and to stabilize service levels to account for local staff transition, such as during turn-over periods. The Data Curation Network (DCN) provides a solution for partners of all sizes to develop or to supplement local curation expertise with the expertise of a resilient, distributed network, and it creates a funding stream to both sustain central services and support expansion of distributed expertise over time. Our model will accelerate local capacity, strengthen collaboration between libraries and disciplinary projects, and significantly enhance libraries’ collective voice in conversations about the future of research data.
The Data Curation Network will serve as the “human layer” in a local data repository stack that provides expert services, incentives for collaboration, normalized curation practices, and professional development training for an emerging data curator community. Data curation enables data discovery and retrieval, maintains data quality, adds value, and provides for re-use over time through activities including authentication, archiving, management, preservation, and representation. Data curation requires a specialized skill set that spans a wide variety of data types (e.g., spatial/GIS, tabular, database, etc.) and discipline-specific data formats (e.g., chemical spectra, 3D images, genomic sequence, etc.). The Data Curation Network addresses this need by creating a cross-institutional staffing model that seamlessly connects expert data curators to local datasets.
The Data Curation Network model, and the research findings supporting it, are presented in this report as the primary outcome of the Alfred P. Sloan funded grant titled “Planning the Data Curation Network” that ran from May 2016–June 2017. To implement the Data Curation Network we propose:
Next, the Data Curation Network will be implemented to accomplish our mission to better support researchers that are faced with a growing number of requirements to ethically share their research data. Our vision for the Data Curation Network is to:
Research data have value beyond their original purpose. They can be used to demonstrate findings, enable new discoveries, reproduce and validate results, and are repurposed in surprising new ways that their creator may never have imagined. Yet data, captured in a multitude of digital file formats through an ever-increasing number of techniques, are constantly at risk of falling short of their long-term reuse potential. Data can be messy and incomprehensible. They often lack important documentation, metadata, and other characteristics that might otherwise secure their long-term usefulness. In addition, their fragile, digital shells are not resilient enough to preserve them from format obsolescence and other ill-effects of digital deterioration, such as bit-rot. Finally, the reality for most data created today is that they never leave the local environment in which they were first stored and, therefore, as time goes by, become the victim of benign neglect.
On the other hand, well-curated data are valued by the scholarly communities that produce them. Professionally curated data are easier for fellow scholars and future collaborators to understand, are more likely to be trusted, and the research they represent more likely to be reproducible (Roche, Kruuk, Lanfear, & Binning, 2015; McNutt et al., 2016; Smith & Roberts, 2016; Beagrie & Houghton, 2014). As a consequence, and to counteract their ephemeral and swiftly eroding nature, requirements for digital research data to be managed, shared, and preserved have emerged. Researchers worldwide face emerging mandates and altruistic pressures to share their research data in ways that make them findable, accessible, interoperable, and reusable, or FAIR (Wilkinson et. al, 2016). For example, in the United States, many recipients of federal research funding must address how their research will be “publicly accessible to search, retrieve, and analyze” in a written Data Management Plan appended to their grant applications (Holdren, 2013). Policies from funders such as the National Science Foundation (https://www.nsf.gov/bfa/dias/
policy/dmp.jsp) and the Bill and Melinda Gates Foundation (http://www.gatesfoundation.org/How-We-Work/General-Information/Open-Access-Policy) serve as examples of these requirements.
But funders are not the only drivers for researchers to share their data. Researchers also face a growing number of publisher expectations to include digital data in the peer review process and share them alongside resulting publications. Journal data sharing policies, such as those held by PLoS ONE (http://journals.plos.org/
plosone/s/data-availability) and Nature Publishing Group (http://www.nature.com/authors/policies/availability.html#data), require all data underlying their research results to be made openly available for sharing and reuse (Stodden, 2012). Often, reproducibility is a driving factor for these policies. Some disciplines have embraced the open data movement as a positive development that will foster expanded practices in validation and replication (Munafò et. al, 2017), and may even safeguard against scientific fraud or the dissemination of erroneous results (Fecher, Friesike, Hebing & Linek, 2017).
Academic and research libraries have followed research and scholarly communications trends related to research data with great interest. Libraries are experts in identifying, selecting, organizing, describing, preserving, and providing access to information materials in print and digital formats. And as a critical agency of their parent institutions, academic libraries are persistent, with demonstrated and sustainable models for providing services such as collection management, preservation, and access to a broad variety of information. Librarians and archivists understand the value, and challenges, of creating and preserving information for future generations, and recognize that specialized curatorial actions must be taken to preserve data and other materials for reuse. This curation enables discovery and retrieval, maintains data quality, adds value, and provides for re-use over time through activities including authentication, archiving, management, preservation, and representation (Craigin, Heidorn, Palmer, Smith, 2007). Thus data curation, the active and ongoing management of data through its lifecycle of interest and usefulness, is central to our mission and has become an important role for academic research libraries as we transform our workforce to assume greater digital stewardship responsibilities in the academy (National Research Council, 2015). For example, over the last decade institutional repositories (IRs) that were originally launched to support open access to traditional scholarship, such as articles and theses, have risen to the challenge of providing access to the many types of digital data, in a variety of formats, that the overwhelmingly multi-disciplinary institutions generate. Based on well-established archival models, such as standardized OAIS-compliant software architecture (Consultative Committee for Space Data Systems, 2011), IRs provide the technical infrastructure to make digital research data accessible, retrievable, and reliably persistent in all the ways that a trusted digital repository might aspire.
Yet, among the advances in technical aspects underlying a digital repository--those that provide storage, ingest, description, access, and preservation--one challenge looms large: the expertise of a data curator. The curation staff, or the “human layer” in the repository stack, bring the disciplinary knowledge and software expertise necessary for reviewing and curating data deposits to ensure that the data are reusable. Due to the heterogeneous and multidisciplinary nature of research data generated in our academic institutions, the skills and expertise required to curate data (to prepare, arrange, describe, and test data for optimal reuse) cannot be fully automated nor reasonably provided by a few experts siloed at a single institution. Multiple data curation experts are needed to effectively curate the diverse data types an IR typically receives (Bloom et. a., 2016; Johnston, 2014). Yet, given limited resources, it is unrealistic to expect that every academic library can hire a data curator for every data type (e.g., GIS, tabular spreadsheets, statistical survey, video and audio, computer code) or discipline-specific data set (e.g., genomic sequence, chemical spectra, biological image) an IR might encounter. Similarly, each type of data curation expertise might only be utilized intermittently depending on the disciplinary focus at each institution.
The Data Curation Network (DCN) addresses the challenge of scaling domain-specific data curation services collaboratively across a network of multiple institutions and digital repositories in order to provide expert data curation services in disciplines and domains beyond what any single institution might offer alone. The planning phase project called “Planning the Data Curation Network” ran from May 2016–June 2017 with support from the Alfred P. Sloan foundation. The project team for the DCN planning phase brought together research data librarians, data curation experts, and academic library administrators from six academic institutions that each, separately, provide repository and curation services to their campuses: the University of Minnesota, Cornell University, Penn State University, the University of Illinois, the University of Michigan, and Washington University in St. Louis.
Over the course of the year planning phase, our team sought opportunities to broadly present our work and discuss our ideas with colleagues. For example we were featured at several conferences, including the 2016 SHARE Users Meeting sponsored by the Association of Research Libraries and the Center for Open Science held in Charlottesville, NC, the Joint 8th Research Data Alliance Plenary and SciDataCon 2016 conference held in Denver, CO, the 2016 Digital Library Federation Forum held in Milwaukee, WI, the winter 2016 Coalition for Networked Information meeting in Washington, DC, the 2017 International Digital Curation Center conference held in Edinburgh, UK, the 2017 Research Data Access and Preservation summit held in Seattle, WA, the IMLS-Funded Preservation Quality Tool (PresQT) Workshop held in Notre Dame, IN, the 2017 Big 10 Academic Alliance Library conference held in West Lafayette IN, and the 2017 International Association for Social Science Information Services and Technology (IASSIST) meeting held in Lawrence, KS. As a result of these conversations it became clear that although our planning phase work was focused on the needs of US academic research institutions similar to the six represented by the project team, this model would scale to a wider range of organizational make-ups and affiliations such as federal government agencies, international academic institutions, and small- and mid-sized liberal arts colleges. We very much welcome the opportunity to explore these and other avenues for broader interpretation of the DCN model.
We preface our model for implementing the Data Curation Network with a literature review and a summary of the practical lessons learned from our interviews with peer service programs and successful collaborative networks. Next a methods section provides a summary of our research activities that informed the DCN model development, including holding focus groups with researchers, running controlled data curation pilots, and surveying the library community around their existing support and plans for future services in these areas. Finally we present our DCN model with a summary of the staffing roles, participation levels, criteria for bringing on new partner institutions, a proposed path to financial sustainability, and an implementation and assessment plan that are grounded in the measurable metrics and observed demand for data curation services across our six planning phase institutions.
The Data Curation Network model builds on a rich history of well-established collaborative service models in libraries. Not unlike our vast interlibrary loan networks that deliver books, articles, and other library collections across networked libraries, or the collective contributions of catalogers adding unique and specialized MARC records to national and international cataloging databases, or the more recent response to on-demand web-based user needs with the successful implementation of 24-7 library reference chat services, the DCN will build from our common need to provide scaled services in a shared way.
According to Weber (1976), collaborative work between libraries initiated in the later half of the 19th century. One of the first areas of librarianship to be tackled in a networked manner was indexing and cataloging. By coming together to standardize cataloging of materials, libraries of this period felt that expertise across institutions could be leveraged to create higher quality records.
The end of the nineteenth century also saw an increasing interest in libraries lending materials to one another. By lending materials libraries could meet, according to a quote in the 1898 The Library Journal, “the growing demands of scholars, incapable of satisfaction by any one library, and the economical management of library finances.” (Stuart-Stubbs, 1975). Interlibrary loan grew out of this grassroots and informal movement into a network of material exchange that has been successful throughout the United States and abroad.
Collective collection development was also identified in the mid to late nineteenth century as a necessity for libraries, and as a mechanism to fill patron needs and address limited budgets and limited space. This solution has come to encompass projects focused on centralized infrastructure for all types of collections and stewardship responsibilities, including digital services, print storage, preservation, and discovery, among others. For initiatives specifically around collective digital preservation and digital collection development and discovery, community focused solutions have helped solve collective issues. “LOCKSS and HathiTrust represent community-sourced solutions that have enabled academic libraries to externalize stewardship functions that were previously organized locally at a much higher cost,” (Dempsey, Malpas, & Lavoie, 2014, p 30). Similar projects that have been built as community-supported solutions to digital collections include the Digital Public Library of America, the Digital Preservation Network, and Duraspace.
Recently there has been momentum around managing shared print materials, or the ability to collectively share the management and preservation of print literature while decreasing local holdings. Dempsey et. al. (2014) find that the development of, “shared print management schemes represent a cost-effective alternative to institution-scale solutions, redistributing the costs of library stewardship across a broader pool of participants,” (p30). Rather than this initiative being driven by community-focused solutions like those mentioned above, Dempsey et. al. find that consortia are playing a strong role in organizing libraries for shared print services based on geography. Examples of this include the 2CUL project, the Cooperative Cataloging Partnership, the Association of Southeastern Research Libraries, the Statewide California Electronic Library Consortium, and the Western Regional Storage Trust.
In a similar vein, the appeal for a network of expertise model for delivering unique library services has been expressed through recent research (Kirchner et al., 2015). The authors recommend “...a pilot project in which experts at multiple institutions consciously create a shared approach to address specialized information needs or to solve a common problem” (p17). Additionally, Erway (2012) calls for a collaborative expert network for handling the variety of born digital media handled in the nation's libraries.
Data curation is a subset of a broader suite of research data management services (figure 1). A number of studies and surveys have explored the extent of research data services provided by academic libraries and found that support for research data management, including data curation, has increased steadily over time (Soehner, Steeves, & Ward, 2010; Tenopir, et al., 2011; Tenopir et al., 2015). More recent explorations by Lee & Stvilia (2017) found that support for data curation in libraries is mainly built upon existing and local IRs. IRs only account for a small percent of the data repositories available to researchers, while discipline-specific data repositories (e.g., ICPSR, GenBank) and general-purpose repositories for data (e.g., FigShare, Zenodo) are enjoying growing use (Kindling et al., 2017).
Figure 1: Data curation as a component of research data services
Collaborative projects related to research data management (though not specifically focused on data curation services) are also underway. The Research Data Alliance (RDA, https://www.rd-alliance.org) launched as a community-driven international organization in 2013 and its special interest groups provide a venue for developing and establishing standards for data curation, such as those by the Publishing Data Workflows group (Bloom et al, 2015) and the newly formed Assessment of Data Fitness for Use working group. The Stewardship Gap project is an 18-month Sloan funded project (http://www.colorado.edu/ibs/cupc/stewardship_gap) that reports gaps in how sponsored research data is preserved for future generations (York, Gutmann, & Berman, 2016). Educational preparation for data curation services, like the DigCCuRR Professional Institute (https://ils.unc.edu/digccurr/institute.html) and the CLIR data curation post-doctoral fellowship program (https://www.clir.org/fellowships/postdoc), as well as information sharing networks such as the Digital Liberal Arts Exchange (https://dlaexchange.wordpress.com) and the DataQ Project (http://researchdataq.org) are leading the way in training data curators on relevant best practices in the field as well as providing valuable forums for community building and networking. Finally, academic research library initiatives focused around data curation issues provide a platform for peer groups to share experiences and best practice. Groups such as the SHARE Curation Associates program (http://www.share-research.org/about/our-team/associates-program), which is highly focused on computational-thinking competencies and technical skill development of repository staff in the United States, the UK-based JISC Research Data Shared Service Project (https://www.jisc.ac.uk/rd/projects/research-data-shared-service), which seeks to build shared software and repository infrastructure for higher education institutions in the UK, and other emerging collaborative efforts such as the Curating for Reproducibility Consortium (http://cure.web.unc.edu) combine staff and best practices for furthing data curation service offerings in libraries.
Our cross-institutional team held regular discussions via bi-weekly conference calls and two in-person meetings over the course of the one year planning grant to develop the Data Curation Network. The project kicked off with a two-day meeting held June 30-July 1, 2016 in Minneapolis, Minnesota, and facilitated by Santiago Fernandez-
Gimenez, a team collaboration expert based at the University of Minnesota. Key outcomes from this meeting allowed us to focus our efforts around a shared vision for the DCN in the next 3-5 years, acknowledge the potential barriers, and prioritize strategic directions for our team in the planning phase. Discussed in greater detail in this section, the following research activities were performed to inform and develop the DCN:
To understand the existing levels of support for data curation across our six planning institutions, we ran a baseline assessment in May 2016 that compared our services, repository technologies, local policies, and staffing and organizational structures. Our results indicated a strong alignment: we all provide data curation services that were aimed at institutionally-affiliated users, each operate a data repository using one of the three popular open source technologies (DSpace, Hydra/Fedora, and Bepress), and all are committed to providing data curation services with similar levels of staffing and well-aligned policies in place (figure 2).
Figure 2: Baseline comparison of six data repository and curation services
Local Data Repository
Staffing Full (FT) and Part Time (PT)
University of Minnesota
Launched Nov 2014
0 FT / 7 PT
(director, coordinator, and 5 data curators at 10% FTE each)
Launched Fall 2012
0 FT /2 PT
University of Illinois
Launched May 2016
1 FT (developer) /8 PT (director, curation, subject and functional specialists at 5-30% FTE each)
University of Michigan
Sufia 7.x (Hydra and Fedora)
Launched Feb 2016
1 FT / 5 PT (RDS manager, project manager, developers, subject specialists at 5-20% FTE each)
Penn State University
Sufia 7.x (Hydra and Fedora)
Launched Fall 2012
0FT / 5 PT (product owner 40%, scrum master 10%, project manager and developers 75%)
Washington University in St. Louis
Digital Commons BePress
Launched Jan 2015
0 FT / 5 PT (coordinator, repository manager, copyright specialist, subject specialists)
The baseline assessment greatly influenced how we envisioned the DCN submission workflow. For example, in our cohort, most data curation services for local deposits occurred post-ingest, meaning that the dataset was first self-deposited to the local system by a researcher and either automatically accepted or accepted following appraisal (e.g., meet local policy, etc.). In other words, the data went “live” for public viewing and access before curation staff took further action (figure 3). Our results suggested that the DCN should utilize a similar post-ingest curatorial review workflow to alleviate any concern about gaining access to datasets that are not publicly available (e.g., behind password protection) or interacting with unfamiliar repository technologies.
Similarly, a common limitation found in our baseline assessment was an inability to host or publish large and active data sets. Acceptable deposit sizes ranged from 500MB to 15 GB per file (larger ingests mediated), and no institution offered repository services for active databases. Anticipating innovations in this area, we intentionally developed the DCN model independent from local repository infrastructure.
Figure 3: Comparison of workflows for data curation at six institutions
Mediated vs Self-deposit?
Workflow Steps by Institution
Staging Area for deposit
Approval to accept or reject
Go Live Here
Review metadata and files
* On request
Building on other user-needs assessments of researchers, performed via survey (Tenopir et al., 2011) and focus groups (Bardyn, Resnick, & Camina, 2012), the DCN team engaged researchers on the importance and utilization of data curation activities. Between October 21, 2016 and November 18, 2016 the team held six focus groups, one at each of the planning institutions, that were aimed at identifying the data curation areas where the DCN should place its focus. Using a mixed-methods approach (discussed in detail in our full report) we identified the data curation activities most important to our researchers, identified which activities were currently happening for their data, and, finally, asked how those activities were happening and level of satisfaction with the results.
In total we engaged with 91 researchers representing a good mix of experience (e.g., faculty, graduate student, post-doc) and disciplines, that directly informed the DCN model. We found that most of the data curation activities presented to researchers were viewed as important or having value to themselves or to their communities of practice. The activities that ranked most highly across two or more groups were:
Only four activities presented to researchers were ranked below a “3” on a 5-point scale and these were: Emulation, Restricted Access, Correspondence (with data author), and Full-Text Indexing. Our focus groups also revealed that while researchers were actively engaged in a variety of data curation activities for their data, no activity was happening in a satisfactory way for a majority of respondents. The activity that came the closest was Secure Storage, which was happening for 75% of our researchers and in ways that satisfied 38% of our researchers (figure 4).
Figure 4: Levels of satisfaction with the top 12 data curation activities* for researchers
"Does this activity happen for your data?"
"If Yes, Are You Satisfied with the Results?"
Data Curation Activity
No, not satisfied
File Format Transformations
* Based on several dictionary and glossary sources we defined 35 “Data Curation Activities” used in our researcher engagement sessions. Full list and definitions at http://hdl.handle.net/11299/188638..
** Other responses included “I Don't Know,” “Sometimes,” “Not applicable,” and not answered.
Our focus groups included discussions that revealed the various ways in which researchers engage in data curation activities as well as the barriers preventing them. This gave our project key issues to address and specific areas of curation for the DCN model to focus on. For example, we identified “gaps” in highly valued data curation activities that either did not happen for a majority (>50%) of researchers or happened in a unsatisfactory way (figure 5). The DCN model will most benefit from emphasizing, investing in, and/or heavily promoting these highly valued services that may not be available to many researchers, including: minting and managing persistent identifiers, maintaining a software registry, providing tools and support for auditing file integrity, creating and managing metadata that place data within context of related publication sources, and providing code review services.
Figure 5: Gaps and areas of opportunity for Data Curation Network services
Top Rated Data Curation Activities
Not Happening for Most Researchers
Happening, but Not Satisfactorily
Similarly, the DCN might support better tools and/or best practices to increase the levels of satisfaction for these commonly occurring data curation activities that are falling short of expectations, including maintaining up-to-date data documentation templates that could be used by a variety of researchers, providing best practices for secure storage, creating quality assurance checklists and review procedures for a variety of data formats and types, recommending best practices or tools for data visualization, promoting better adoption of metadata standards across disciplines, recommending tools and file naming schemas for versioning datasets, and by being more transparent about the conditions and procedures that call for file format transformations.
From September 2016–November 2016 our team conducted data curation pilots with 17 curator staff in order to identify and compare the actual and individual curation practices taking place at our partner institutions. The results allowed us to identify any issues, misaligned expectations, and/or conflicts prior to implementation of the Network. Namely the pilots gave us a real-world glimpse of the DCN in practice and informed how the DCN model should function, including:
In order to design a Network of data curators, we began with a survey to better understand existing data curation services in academic libraries. Our team partnered with the Association of Research Libraries (ARL) to develop SPEC Kit #354 for data curation. We surveyed the 124 ARL institutions (which include mainly academic libraries based in the US and Canada) in January 2017 to understand current data curation practice and highlight examples and best practices for other libraries to build from. Our results showed that of the 80 ARL Libraries that responded (65% response rate), 51 institutions are providing data curation services and another 13 institutions indicated that they are developing these services. Only 20% of the sample, or 16 libraries, indicated that they do not provide nor are they actively developing data curation services. Of particular note to the DCN, our survey respondents ranked having “expertise in curating certain domain data” as their greatest challenge.
Levels of staffing for providing data curation service was a key consideration of our survey. Indeed, the lack of skilled data curators was one of the challenges that the DCN is aiming to address. Our results showed that the majority of institutions place responsibility for data curation services on individuals who have other duties to carry out (partial or part-time staff). The number of partial staff ranges from one to 15 per library. The percentage of time they spend varies widely by institution, with some reporting 5–10% of time and others indicating it may be as high as 40–50% (figure 6). Twenty-eight institutions only have staff devoting a part of their time (a total of 143 individuals). Seventeen institutions have both partial focus and exclusive focus staff (88 partial and 39 exclusive). Three libraries have one person who spends all their time on data curation. Interestingly, there appeared to be little relationship between the number of data sets curated on a monthly basis and the level of staffing. One institution reported approximately 16 data sets curated a month, however, from their comments and other data, it appears they may be including all deposits not just curated deposits. An outlier reported 20 staff devoted exclusively to these activities.
Figure 6: Heat map displaying the reported staffing levels for data curation (part time or exclusive full time) vs the number of monthly data sets curated in ARL Institutions (blue scale with no response indicated in grey). Most provide data curation services with partial staff.
There is a growing body of literature comparing the various models for supporting sustainable data curation and repository services (Kitchin, Collins, & Frost, 2015; Ember et al., 2013; Nilsen, 2017). We evaluated several financial models in order to develop a sustainable plan for supporting the DCN post grant funding phase. In particular we found the ITHAKA S+R 2016 report, “A Guide to the Best Revenue Models” useful in identifying an approach that will best support the financial needs of the DCN for the next 6 years.
Successful models in the library and information science discipline provide exemplars for collaborative sustainability. The DCN planning team engaged with several peer groups that were doing similar work with providing shared data services to learn from their experiences. For example we interviewed Anne Kenney, now former University Librarian and lead PI on the Cornell University Columbia University collaborative 2CUL project that supports shared collection development and cataloging services jointly at Columbia University and Cornell University (http://www.2cul.org), and Jonathan Markow, lead technologist at DuraSpace (http://www.duraspace.org), a distributed open source digital repository service which supports Fedora and DSpace, and whose code base and service models are supported by a global community. Their experiences taught us to emphasize the community building aspects of the DCN, versus the economic or cost-savings benefits, and that our collaborative project must be built on trust with those that staff the project (e.g., data curators across institutions).
Additionally, we held information exchanges with representative staff, and (when possible) reviewed MOU agreements from the NSF DataNet SEAD project (http://sead-data.net), Canada’s emerging shared data service Portage Network (https://portagenetwork.ca), the Texas Digital Library consortial data repository (https://www.tdl.org/texas-data-repository), the Federal Drug Administration data policy (https://open.fda.gov), the Data Conservancy based at Johns Hopkins University Library (http://dataconservancy.org), the California Digital Library’s UC3 project (http://www.cdlib.org/uc3), and the statistical data focused Curate Research Data for Reproducibility (CuRe) project.
Based on our research, anticipated costs for the DCN central services, and a benefit analysis of various stakeholders, we drafted several scenarios for support including tiered membership, fee-for-service, and, in-kind (all effort donated by institutions) models. The details of these models (ie. estimated membership fee costs per institution, tiers of participation, benefits, etc.) were vetted with library administrative staff at our institutions (e.g., Dean of the Library). Our potential scenarios to sustain the DCN included:
One theme that emerged from our research was “membership fatigue,” a result of requests for support for numerous collective projects. Therefore, our resulting 6-year financial plan to sustain the DCN (presented in the Sustainability Plan section of this report) will enable us to transition beyond grant support to a model that sustains operations and offers users curation services on a fee-for-service basis, expanding both the funding base for the Network and opportunities for unaffiliated researchers and strategic partners (publishers, new disciplinary projects, etc.) to consume services on a pay-as-you-go model.
From May 1, 2016–April 30, 2017 the planning phase team tracked key metrics related to the demand for curation services across the six institutions. Using a shared Google form, we imputed the frequency, file types, disciplines, and levels of curation needed for all datasets curated by our individual institutions to better anticipate future staffing needs and demand. The results showed:
Figure 7: Curation levels needed vs. taken for six institutions (n=175)
Major edits to the metadata and/or major changes to the files (new or missing)
Edits to the metadata and/or basic changes to the files
Small edits to the metadata
The Data Curation Network model that we propose harnesses the expertise of well-aligned institutions that collectively provide data curation services to researchers in a multitude of disciplines, ensuring that valuable scholarly datasets are findable, accessible, interoperable and reusable, or FAIR. Offered through a unique collaboration between academic libraries and disciplinary projects, DCN curators at distributed sites are matched with data sets according to their technical and disciplinary expertise, and conduct a rigorous review of the data using an established set of protocols that seamlessly fits within any local institutional workflow (figure 8).
Users of the Network will be able to more efficiently work with investigators to capture as much context and description of the data as possible, expertly review data quality and validate code, assess risks and verify file integrity, and validate and transform files. DCN curators also provide guidance around secure storage, citation and persistent identification strategies, and curated data may be deposited into the repository of the researcher’s choice for ongoing stewardship.
Implementing the DCN will support and expand the data curation community. Our model will bring together staff with diverse expertise (e.g., domain-specific data curators, informaticians, digital records archivists, preservation specialists, data librarians, etc.) currently siloed in single institutions into a shared network that will collectively, and more effectively, develop standards-driven data curation techniques for all types of data housed in any repository infrastructure (e.g., Fedora/Hydra, DSpace, custom-build, etc.). By expanding local curation expertise through structured, regular training and hosting community-wide educational opportunities, the DCN will build an innovative community that enriches capacities for data curation writ large.
Figure 8: Curation workflow for the Data Curation Network
(link to full size image)
Academic libraries with existing data curation services:
Academic libraries with limited to no resources for data services:
Disciplinary and subject data repositories:
The DCN will function through supportive organizational layers and dedicated staff that contribute to a shared governance system to be determined in the implementation phase. An important consideration of the DCN staffing model is maintaining and strengthening local relationships with researchers. Therefore, to provide opportunity for future engagement, our model incorporates the following DCN staff roles and local or institutional resources the DCN staff will interact with. Roles in the DCN include (more detailed descriptions of responsibilities for each role detailed in Appendix A):
The DCN will operate as an alliance of partner institutions (e.g., academic libraries or disciplinary data repositories, etc.) who contribute staffing and funds to sustain and offer central services to potential users (e.g., academic libraries, publishers, or individual researchers). The proposed levels of participation will include, but are not limited to:
Applications for new partner institutions will be considered on a rolling basis. A Memorandum of Understanding (draft presented in Appendix B) will outline the functional aspects of the model, roles and responsibilities of the staff involved, and other normative practices and expectations. Institutions interested in joining the Network will review the MOU and provide an expression of interest via an online DCN application form (to be created).
Draft DCN partnership criteria may include, but are not limited to:
The DCN model is intended to accommodate a wide variety of local curation workflows while remaining repository-technology agnostic. The submission workflow assumes that all technical functionality (ingest, storage, access, dissemination, and preservation) are the responsibility of the local institution. Therefore, local researchers may submit data to their local curation service like normal. Then the Local Curator must determine if the dataset should be submitted to the DCN for expert curation and review. Figure 9 briefly describes this process while more detailed workflows and curator checklists are presented in Appendix C.
Datasets received by the Network will be handled via a submission tracking tool (functional requirements listed in Appendix D) to track where a dataset is in the DCN workflow and the duration in each step.
DCN submissions receive a preliminary check from the DCN Coordinator before being assigned to an appropriate DCN Curator (based on expertise match and availability). Once assigned a dataset, the DCN Curator is responsible for reporting any questions, changes, augmentations, and corrections for the data back to the the Local Curator. Researchers may choose not to take recommend actions and therefore the last step in the DCN workflow is for the DCN Curator to assess the final result in order to determine if it meets standards for FAIRness (Dunning, de Smaele & Böhmer, 2017).
Any issues (e.g., problems with a particular dataset) can be discussed at the regular curator virtual meetings where all DCN curators may participate. Here peers may recommend additional actions be taken or collaborate on resolutions for copyright issues, documentation, etc.
Figure 9: DCN Workflow Steps
DCN Curators take standardized and file-type specific actions when reviewing the data for fitness for reuse using their expert skills and domain specific knowledge.
Specifically, curators will take CURATE steps (detailed in Appendix C) for each data set, that includes:
C – Check data files and read documentation
U – Understand the data (try to),
R – Request missing information or changes
A – Augment the submission with metadata for findability
T – Transform file formats for reuse and long-term preservation
E – Evaluate and rate the overall submission for FAIRness.
Next, our team will launch a valuable new service that will benefit researchers, their disciplines, and the end users of research data world-wide. The implementation phase of the Data Curation Network will put the model presented here into action by incrementally adding new partners from academic institutions and disciplinary organizations (figure 10). Our proposed curation-as-service model will allow the DCN to grow and sustain with controlled member-driven expansion into new service areas in the years to come. Finally a two-pronged assessment approach will track and assess DCN success and also aim to demonstrate that data curated by the Network are more valuable to users than non-curated data. Along the way the project team will develop and share standards-driven data curation techniques, measure the impact of data curation services, and provide essential training to a cohort of data curators.
Figure 10: Six year plan for implementing the Data Curation Network
Grant Funded (Y1-Y2) transition to partnership model (Y3)
6 academic institutions
8 academic institutions
and 2 disciplinary partners
Recruit new partners as use and demand dictate
The first two years of a three-year implementation phase, the DCN will aim to be supported by startup grant funding and the contributed efforts of the six planning phase institutions (Minnesota, Cornell, Illinois, Michigan, Penn State, and Wash U.) plus two additional academic partners and two disciplinary partners.
Each of the 8-10 partners will contribute a minimum of 5% of a DCN representatives’ time and also contribute between 5-10% FTE of 1-2 additional data curation specialists. A lead institution (currently the University of Minnesota) will also contribute 15% of the DCN Lead’s time (Lisa Johnston) to provide overall direction and supervise a full-time DCN Coordinator, to be funded by the grant. Disciplinary partners may commit either 5-10% of a specialist’s time or some other in-kind service that will add value to the Network. Depending on the disciplinary partner, this could be submission access to their repository, reduced or eliminated fees to partners, or some other benefit.
During the implementation phase, several activities will take place. DCN staff will establish communication channels (e.g, a shared listserv, Slack, regular video conferencing etc.) and set up the submission tracking form. An in-person DCN meeting will bring DCN Curators together for training and networking. Another key activity will be to establish and maintain an up-to-date skills inventory of DCN Curators to document available curation expertise and identify gaps for future recruitment.
The implementation phase of the DCN will continue to track trends in the types of domains or file types that come to the Network and work to recruit new institutions that might fill any gaps in expertise support. Capacity for curating data in the Network will grow as new partners join. For example, we found from our one-year of metric tracking that curators spend an average of 2 hours to curate a dataset (ranging from less than 1 hour to 8+ hours). In year 3, if each institution contributes 10% of a DCN curator time (assuming 10% FTE = 16 hours/month) then with 10 institutions the DCN will have roughly 160 curation hours or the capacity to curate an average of 80 data sets each month. Finally, the DCN will establish a public facing directory of datasets that were successfully curated by the Network. This web resource will be directional and link to the distributed and locally housed datasets. The technical mechanism for bringing together the DCN-approved data sets will aim to utilize the OAI-standardized metadata from each institution's open API feeds to function autonomously.
Our proposed model will allow the DCN to grow and sustain with controlled expansion into new service areas in the years to come. In the third year of the implementation phase, the DCN will transition to a self-sustaining service where institutional and disciplinary partners contribute data curation staff and share the central operations costs.
The core partner institutions will share any central costs so that the Data Curation Network will continue beyond the implementation phase and without the additional aid of grant support. Any financial support contributed by partner institutions (along with in-kind curator staff) will sustain a number of potential centralized services, including the hire of one full-time DCN Coordinator and annual DCN Curator training events (figure 11). Costs may be offset by potential revenue streams (figure 11), as fee-for-service users increase, and/or if the DCN becomes affiliated with a parent association to act as fiscal agent and cover some of the overhead burden.
The DCN planning phase team reviewed several governance documents of peer organizations, including the 2CUL Project, arXiv, DataOne, HathiTrust, Portage, and the Texas Digital Library, in order to draft a Memorandum of Understanding for partner institutions. Our DCN draft MOU anticipates the need for a governance body that advises on any major issues encountered by the Network staff. However, details for the makeup and responsibilities of this governing board will be determined in the Implementation phase of the DCN. An updated MOU will reflect any changes to the Network based on lessons learned from the Implementation phase and will be used to normalize and sustain operations of the DCN moving forward.
Figure 11: Central costs and potential revenue streams for the Data Curation Network
Potential Central Costs
Events, Travel, Training
Potential Revenue Streams (future)
The planning phase enabled our team to envision what metrics will be important to track the impact and success of the Data Curation Network over time. Therefore our two-pronged assessment plan identifies several key metrics to be tracked from the start of the implementation phase. First, we will closely monitor the number of datasets curated by the Network, the frequency of submission (high-volume time periods, etc.), and the variety and types of data (e.g., unique file formats and range of disciplines that utilize DCN services). An important factor in our assessment will be to track the effectiveness of data curation across the Network by tracking the time a dataset spends at each stage of our workflow (e.g., time from ingest to assignment, time with curator, time with Local Curator before finalized, etc.). Building on our metric track during the planning phase, the DCN will track the overall level of data curation actions taken on each dataset. We will do this by documenting the level of curation needed for a dataset vs. level of curation taken and how well the finalized data scored on meeting FAIR standards. Figure 12 details the draft DCN “CURATE” procedures which include the steps: Check, Understand, Request, Augment, Transform and Evaluate.
Second, in addition to the above metrics, we plan to monitor overall service impact in the following ways:
Figure 12: Draft procedures checklist of DCN CURATE steps and FAIRness scorecard
Check data files and read documentation
Understand the data (or try to)
Varies based on file formats and subject domain. For example….
Tabular Data Questions (Microsoft Excel)
Request missing information or changes
Narrative describing the concerns, issues, and needed improvements to the data submission
Augment the submission
Transform file formats
Evaluate and rate the overall data record for FAIRness.*
* Rubric evaluating the FAIR principles are based on the scoring matrix by Dunning, de Smaele, & Böhmer (2017).
The Data Curation Network project was supported by the Alfred P. Sloan funded grant project “Planning the Data Curation Network,” that ran from May 2016–June 2017. We invite feedback and suggestions for improvement to make this approach as useful to community as possible. Please send comments to the authors at firstname.lastname@example.org.
Research Releases from the DCN Planning Phase:
Bardyn, T. P., Resnick, T., & Camina, S. K. (2012). Translational researchers’ perceptions of data management practices and data curation needs: findings from a focus group in an academic health sciences library. Journal of Web Librarianship, 6(4), 274-287. DOI:10.1080/19322909.2012.730375.
Beagrie, N., & Houghton J.W. (2014) The Value and Impact of Data Sharing and Curation: A synthesis of three recent studies of UK research data centres, Jisc. http://repository.jisc.ac.uk/5568/1/iDF308_-_Digital_Infrastructure_Directions_Report%2C_Jan14_v1-04.pdf.
Bloom, T., Dallmeier-Tiessen, S., Murphy, F., Austin, C. C., Whyte, A., Tedds, J., Nurnberger, A., Raymond, L., Stockhause, M., Vardigan, M., & Clarke, T. (2015). Workflows for Research Data Publishing: Models and Key Components. International Journal on Digital Libraries-Research Data Publishing Special, (27). https://www.rd-alliance.org/system/files/Workflows_for_Research_Data_Publishing-_Models_and_Key_Components_submitted.pdf.
Consultative Committee for Space Data Systems. (2011). Audit and Certification of Trustworthy Digital Repositories, Recommended Practice, CCSDS 652.0-M-1, Magenta Book, Issue 1 Washington, DC: CCSDS Secretariat. http://public.ccsds.org/publications/archive/652x0m1.pdf.
Cragin, M., Heidorn, P. B., Palmer, C. L., Smith, L. C. (2007). An Educational Program on Data Curation. ALA Science & Technology Section Conference. http://hdl.handle.net/2142/3493.
Dempsey, L., Malpas, C., & Lavoie, B. (2014). Collection directions: the evolution of library collections and collecting. portal: Libraries and the Academy, 14(3), 393-423. DOI:10.1353/pla.2014.0013.
Dunning, A., de Smaele, M., and Böhmer, J. (2017, January 31). Are the FAIR Data Principles fair?. Zenodo. DOI:10.5281/zenodo.321423.
Ember, C., R. Hanisch, G. Alter, H., Berman, F., Hedstrom, M., & Vardigan, M. (2013). Sustaining domain repositories for digital data: A white paper. http://datacommunity.icpsr.umich.edu/sites/default/files/WhitePaper_ICPSR_SDRDD_121113.pdf.
Erway, R. (2012). Swatting the Long Tail of Digital Media: A Call for Collaboration. Dublin, Ohio: OCLC Research. http://www.oclc.org/research/publications/library/2012/2012-08.pdf.
Fecher, B., Friesike, S., Hebing, M., & Linek, S. (2017). A reputation economy: how individual reward considerations trump systemic arguments for open access to data. Palgrave Communications. 3 (17051). doi:10.1057/palcomms.2017.51.
Holdren, J. P. (2013). Increasing access to the results of federally funded scientific research. Office of Science and Technology Policy, Executive Office of the President. https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf.
Ithaka S+R. (2016). A Guide to the Best Revenue Models and Funding Sources for Your Digital Resources. http://www.sr.ithaka.org/publications/a-guide-to-the-best-revenue-models-and-funding-sources-for-your-digital-resources/.
Johnston, L. R. (2014). A Workflow Model for Curating Research Data in the University of Minnesota Libraries: Report from the 2013 Data Curation Pilot. University of Minnesota Digital Conservancy. http://hdl.handle.net/11299/162338.
Johnston, L. R., Carlson, J., Hswe, P., Hudson-Vitale, C., Imker, H., Kozlowski, W.,. Olendorf, R. K., & Stewart, C. (2017). Data Curation Network: How Do We Compare? A Snapshot of Six Academic Library Institutions’ Data Repository and Curation Services. Journal of eScience Librarianship 6(1): e1102. DOI:10.7191/jeslib.2017.1102.
Kindling, M., Pampel, H., van de Sandt, S., Rücknagel, J., Vierkant, P., Kloska, G., … Scholze, F. (2017). The Landscape of Research Data Repositories in 2015: A re3data Analysis. D-Lib Magazine, 23, 3-4. DOI:10.1045/march2017-kindling.
Kirchner, J., Diaz, J., Henry, G., Fliss, S., Culshaw, J., Gendron, H., and Cawthorne, J. E. (2015). The Center of Excellence Model for Information Services. Retrieved from the Council on Library and Information Resources, http://www.clir.org/pubs/reports/pub163.
Kitchin, R., Collins, S., & Frost, D. (2015). Funding models for Open Access Repositories. Maynooth: Maynooth University. Dublin:the Royal Irish Academy and Trinity College Dublin. DOI:10.3318/DRI.2015.4.
Kouper, I., Kollen, C., Ishida, M., Williams, S., & Fear, K. (2017). Research Data Services Maturity in Academic Libraries. In L. R. Johnston (Ed.), Curating Research Data Volume One: Practical Strategies for Your Digital Repository (33-60). Chicago, IL: Association of College & Research Libraries (ACRL), American Library Association. http://hdl.handle.net/10150/622168.
Lee, D. J., & Stvilia, B. (2017). Practices of research data curation in institutional repositories: A qualitative view from repository staff. PloS one, 12(3), e0173987. DOI:10.1371/journal.pone.0173987.
McNutt, M., Lehnert, K., Hanson, B., Nosek, B. A., Ellison, A. M., & King, J. L. (2016). Liberating field science samples and data. Science, 351(6277), 1024-1026, DOI:10.1126/science.aad7048.
Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., du Sert, N. P., ... Ioannidis, J. P. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 0021. DOI:10.1038/s41562-016-0021.
National Research Council, Committee on Future Career Opportunities and Educational Requirements for Digital Curation, Board on Research Data and Information, Policy and Global Affairs. (2015) Preparing the Workforce for Digital Curation. Washington, DC: National Academies Press. http://www.nap.edu/catalog.php?record_id=18590.
Nilsen, K. (2017). “Beyond Cost Recovery: Revenue Models and Practices for Data Repositories in Academia.” In L. R. Johnston. Curating Research Data Volume One: Practical Strategies for Your Digital Repository. ACRL: Chicago, IL. p 193-211, http://www.ala.org/acrl/sites/ala.org.acrl/files/content/publications/booksanddigitalresources/digital/9780838988596_crd_v1_OA.pdf.
Roche, D. G., Kruuk, L. E., Lanfear, R., & Binning, S. A. (2015). Public data archiving in ecology and evolution: how well are we doing?. PLoS Biol, 13(11), e1002295. DOI:10.1371/journal.pbio.1002295.
Smith, R., & Roberts, I. (2016). Time for sharing data to become routine: the seven excuses for not doing so are all invalid. F1000Research, 5. DOI:10.12688/f1000research.8422.1.
Soehner, C., Steeves, C., & Ward, J. (2010). E-Science and Data Support Services: A Study of ARL Member Institutions. Association of Research Libraries. http://www.arl.org/storage/documents/publications/escience-report-2010.pdf.
Stodden, V., Guo, P., & Ma, Z. (2012, September). How journals are adopting open data and code policies. In The First Global Thematic IASC Conference on the Knowledge Commons: Governing Pooled Knowledge Resources. https://pdfs.semanticscholar.org/fde2/8f99bc049044c8191abdbcead9d396668028.pdf.
Stuart-Stubbs, B. (1975). An Historical Look at Resource Sharing. Library Trends, 23, 4, p. 649-64. http://hdl.handle.net/2142/6812.
Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U., Wu, L., Read, E., ... Frame, M. (2011). Data sharing by scientists: practices and perceptions. PloS one, 6(6), e21101. DOI:10.1371/journal.pone.0021101.
Tenopir, C., Dalton, E. D., Allard, S., Frame, M., Pjesivac, I., Birch, B., ... Dorsett, K. (2015). Changes in data sharing and data reuse practices and perceptions among scientists worldwide. PLoS One, 10(8), e0134826. DOI:10.1371/journal.pone.0134826.
Weber, D. (1976). A Century of Cooperative Programs Among Academic Libraries. College & Research Libraries, 37(3), 205-221. DOI:10.5860/crl_37_03_205.
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., ... Bouwman, J. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3. DOI:10.1038/sdata.2016.18.
York, J., Gutmann, M., & Berman, F. (2016). What Do We Know About The Stewardship Gap? University of Michigan Deep Blue. http://hdl.handle.net/2027.42/122726.
Lisa R. Johnston is the Research Data Management/Curation Lead at the University of Minnesota Twin Cities Libraries. Johnston coordinates the library's efforts around research data management and leads a team of five data curation experts for curating research data in the Data Repository for the University of Minnesota (DRUM). Since 2012, Johnston has also served as the co-director of the the University Digital Conservancy, the University of Minnesota’s institutional repository for research and publications. Johnston has authored numerous publications, and most recently edited and authored the two-volume Curating Research Data: Practical Strategies for Your Digital Repository (2017) published by the Association of College & Research Libraries (ACRL) Press.
Jake Carlson is the Research Data Services Manager at the University of Michigan Library. Carlson oversees the creation, implementation and operation of Research Data Services (RDS) at the Library, which includes the development of the library-based data repository, Deep Blue Data, launched in 2016. Carlson is a primary architect of the Data Curation Profile Toolkit (http://datacurationprofiles.org) and the PI of the Data Information Literacy project (http://datainfolit.org). He is the co-editor of Data Information Literacy: Librarians, Data, and the Education of a New Generation of Researchers (2015, Purdue University Press) and the author of numerous articles on roles for librarians in managing and curating research data.
Cynthia Hudson Vitale is the Data Services Coordinator at Washington University in St. Louis Libraries. In this position, Hudson-Vitale leads research data services and curation efforts for the Libraries. Since coming into this role in 2012, Hudson-Vitale has worked on several funded faculty projects to facilitate data sharing and interoperability, while also providing scaleable curation services for the entire University population. Hudson-Vitale currently serves as the Visiting Program Officer for SHARE with the Association of Research Libraries.
Heidi Imker is the director of the Research Data Service (RDS) at the University of Illinois at Urbana-Champaign. Imker came to the University Library in 2014 to lead the Illinois RDS, a campus-wide initiative that provides the Illinois research community with the expertise, tools, and infrastructure necessary to manage and steward research data. Prior to this position, Imker was the Executive Director of a large scale collaborative grant funded by NIH, called the Enzyme Function Initiative. There Imker was the co-director of the Data Core which aimed to manage, disseminate, and integrate research data produced by 15 different research groups across the disciplines of microbiology, metabolomics, molecular biology, structural biology, enzymology, and computational biology.
Wendy Kozlowski is the Data Curation Specialist at Cornell University. Kozlowski is coordinator of the Cornell Research Data Management Services Group, a cross-campus, collaborative organization that provides data management services to faculty, staff and students throughout the entire research process. Operating within Cornell University Library’s Metadata Services group and as part of the library’s institutional repository (eCommons) administrative team, Kozlowski is the point person for both repository-wide and scientific metadata, and works with subject liaisons to curate data sets deposited into eCommons. Kozlowski has a B.A. in biology and a M.S. in ecology, and spent 19 years in biology and oceanography research, working on multidisciplinary data sets and with teams from numerous institutions both in and outside the United States.
Robert Olendorf is the Research Data Librarian at Penn State University. Olendorf chairs the committee to develop comprehensive data services at the library and build collaborative relationships with other data service providers at the university. Since 2015 Olendorf has worked with primarily science faculty and students to help them better manage and curate their data often in collaboration with other campus partners. Olendorf is also the product owner of the Penn State institutional repository, ScholarSphere. Prior to Penn State Olendorf worked at Los Alamos National Laboratory and University of New Mexico. Prior to life as a librarian, Olendorf was an evolutionary biologist focused on the evolution of cooperative behavior and also sexual selection usually working in cross disciplinary groups among multiple institutions. This work incorporated a variety of data including field experiments and observations, high performance computing simulations and large molecular genetic data sets.
Claire Stewart is the Associate University Librarian for Research and Learning at the University of Minnesota. Prior to arriving at Minnesota in 2015, Stewart held several positions at Northwestern University over a 21-year period, including directing the Center for Scholarly Communication and Digital Curation and serving as Head of Digital Collections. At Northwestern, Stewart served as campus lead for repository services and e-science, directing the creation of an E-Science Working Group and data management services as a collaboration between the office for research, information technology, and the library. At the University of Minnesota, Stewart is a member of the Libraries senior leadership team and co-sponsor of the Data Management and Curation Initiative. She directs the Libraries’ education and research support programs, leading staff who provide general and specialized support, including GIS, digital humanities, and data management and curation services.
Each operational role in the Data Curation Network will have key responsibilities.
DCN Coordinator: This individual, centrally funded through the DCN, oversees the daily operations of the Network, tracks and monitors all datasets that flow through the Network, and assigns incoming data sets to the appropriate DCN Curator. DCN Coordinator responsibilities include:
Local Curator: Each DCN user will designate a staff member who submits a dataset from their home institution to the Network. Local Curator responsibilities include:
DCN Curators: Each partner institution will contribute 1-2 data curation staff (at 5%-10% FTE) to provide expert curatorial services for the Network. DCN Curator responsibilities include:
DCN Representatives: Each partner institution will select one DCN Representative to participate in the Network as the institutional representative. DCN Representatives are also the DCN planning phase collaborators and authors of this report. Responsibilities include:
DCN Lead Representative: A DCN Lead Representative, based at the lead institution (currently the University of Minnesota), will provide overall direction, outreach, and marketing for the Network. DCN Lead Representative responsibilities include:
Background: The DCN planning phase team reviewed several governance documents of peer organizations, including the MOU’s from the 2CUL Project, arXiv, DataOne, HathiTrust, Portage, and the Texas Digital Library. Our DCN draft MOU (figure 13) anticipates the need for a governance body that advises on any major issues encountered by the Network staff. However, details for the makeup and responsibilities of this governing board will be determined in the Implementation phase of the DCN.
Figure 13: Draft MOU for the Data Curation Network partner institutions
Subject to discussion and change during the first two years of the implementation phase.
This section first provides an overview (figure 14) and then details the workflow steps (figure 15) drafted for the various roles in the DCN. Once implemented, DCN curators and representatives will be expected to communicate on a regular, ongoing basis (e.g., bi-weekly conference calls) in order to share out on curation assignments and make adjustments and changes to the workflow as new situations arise.
Figure 14: Swimlane diagram of the roles and steps involved with the DCN workflow
Figure 15: Detailed workflow steps in the DCN Model
Post-Ingest Curation Scenario: Data Intended for an Open Access Institutional Repository
Roles and Responsibilities
Data Curation Activities
Role: Local Researcher
Action: Self-deposits dataset to local open access data repository.
Role: Local Curator
Action: Appraise the data submission and determines if the data should be submitted to the DCN for curation.
Chain of Custody
Arrangement and description
Transfer to DCN
Review + Assign
Role: DCN Coordinator
Action: Reviews submission and assigns to appropriate DCN curator based on file format, discipline expertise, and other factors such as availability.
CURATE (see steps)
Role: DCN Curators
Action: Performs a timely review of the data and deliver a report of the recommended actions needed for the data to become DCN-approved.
Responsibilities: Perform and document each C-U-R-A-T-E step. CURATE steps are
Check files and read documentation.
Understand the data (or try to), if not...
Request missing information or changes.
Augment metadata for findability.
Transform file formats for reuse.
Evaluate for FAIRness.
File Format Transformations
Role: DCN Coordinator
Action: Mediates recommendations identified by the DCN Curator to the Local Curator.
Communications with Local Curator
Role: Local Curator
Action: Works with researcher to address any changes, augmentations, or corrections to the data (in person, via email, etc.).
Responsibilities: Level of local support will vary.
Communications with Author
Role: Data Author
Action: Respond to any curatorial issues and submits any files or changes to Local Curator as needed.
File Format Transformations
Role: Local curator
Action: Finalize data submission.
Tech. Monitoring and Refresh
Role: DCN Coordinator
Action: Review final data publication to determine if necessary actions were taken. If so, grants “DCN Approval.”
DCN Stamp of Approval
The Data Curation Network will operate and function primarily through a tool or application that fulfills the requirements displayed in figure 16.
Figure 16: Functional requirements for a DCN tracking form
Project and project component features
Templates for Curation Checklists
Analysis of Available Options
There are numerous available tracking systems on the market, all having different strengths and weaknesses. Focusing specifically on workflow tracking software rather than project management, or IT service management software, our initial evaluation shows that JIRA is a promising option for accomplishing many of our needed tasks. However, a combination of platforms to allow for both issue tracking and email integration may also have to be considered. A full analysis of software under consideration is shown in Figure 17.
Figure 17: Workflow and Issue Tracking Software and Tools Evaluation
Free version supports just 15 users. $10/user/mo billed annually. Web-based.
Pros/Cons Does track workflow, but project management focused. Is possible to set up with little or no IT expertise needed on our part to get it set up.
Web-based. $100/mo or $1000/year
Built for project management. Looks good in many ways, but no outside email integration.
All about “customer support”.
Hosted or server based. Not free.
Pros/Cons Super powerful, but has a bit of a learning curve. Works best when connected to Confluence, but that would incur additional costs. Has outgoing email integration/notification.
Atlassian has cloud-hosting options that we could use.
Free, open, web-based
Pros/Cons No email connectivity, but would potentially facilitate sharing of materials from existing storage locations.
Web Based, Open Source
Full-service ITSM product, like Remedy. Probably more than we need.
Focus is on bug tracking in website development. Generates tickets and URLs for tracking.
Open Source. Requires local install on any OS with Ruby on Rails.
Simple but complete features. Email integration. More focused on project management than customer service.
Local install required
Not recommended for cross-institutional work; steep user learning curve, esp for non-full-time users
Open source but not web-based.
Pros/Cons Will require a linux box and underlying relational database to run our own installation. Does track workflow.
Very much designed to support IT service needs.
Boards, Lists, Cards. Web-based.
Pros/Cons . People seem to love or hate it. May not do everything we need. Has email integration.
Github powered. Free.
Will really work only if we decide to keep everything else in GitHub as well.
Web based, free version available.
Pros/Cons Has great app integration (eg. and email in GMail can trigger a file to be moved to Box and a message sent to Slack).
Free version will not be adequate. Not sure it can do all our other required tasks.
 Data Curation Network: How Do We Compare? A Snapshot of Six Academic Library Institutions’ Data Repository and Curation Services. Journal of eScience Librarianship 6(1): e1102. https://doi.org/10.7191/jeslib.2017.1102.
 SPEC Kit #354: Data Curation, Association of Research Libraries (ARL), May 2017, http://publications.arl.org/Data-Curation-SPEC-Kit-354/~~FreeAttachments/Data-Curation-SPEC-Kit-354.pdf.