1 of 16

Budgeting for Data Management and Open Science

Fernando Rios

Research Data Management Specialist, UA Libraries

Tina Lee

User Engagement Officer, CyVerse

2 of 16

Budgeting assumptions

  • The resources needed for open science data management include hardware, software, and personnel with the expertise to execute all aspects of data management (full path data management) as described the proposal.
  • Your team includes experienced personnel who are allocated sufficient time to execute data management activities throughout the research data life cycle.
  • Grant funder/funding allows purchase of resources necessary for open science data management.
  • Use of open source software whenever possible.

3 of 16

Data Management activities that need to be budgeted for:

Active Research

  • Storage for active work (cloud, local)
  • Compute resources (cloud, HPC, local)
  • Data management
    • Implementing backup strategy
    • Lab management, ELNs
  • Training (time mainly)
    • Could involve travel, e.g., this workshop
  • Documentation
    • File, folder structure
    • Data
    • Code comments, code cleaning
  • DM Planning itself!

Sharing outputs openly and reproducibly

  • Data curation
    • Documentation, organization
    • Format conversions
    • Packaging for reproducibility (Docker, VMs, Singularity, Binder, Jupyter, etc)
    • Metadata
    • Working with a curator
  • Long-term archiving
    • Trusted repository
    • Some free some are not
  • Open access publishing costs (if applicable)

Mostly need to budget someone’s time to take ownership of implementing data operating procedures, do curation, documentation, and sharing

4 of 16

Under-budgeting of time...

80% is spent finding data and cleaning it

5 of 16

Ballpark amounts

  • The EU Open Science Cloud recommends 5% of allocated funds be dedicated to data stewardship
  • NOAA Climate Observation Division: projects spend 2-21% on data stewardship
  • Bottom line: there is no target percentage, just ballpark or guidance; so when budgeting for data management: “it depends”!

European Commission High Level Expert Group on the European Open Science Cloud “Realising the European Open Science Cloud, First report and recommendations of the Commission High Level Expert Group on the European Open Science Cloud”. 2016.

Saleem Arriago. “Facilitating and Ensuring Data STewardship: Data Challenges of NOAA’s Climate Observation Division”. 2016

6 of 16

Costs at different points in the research lifecycle

Credit: Digital Curation Centre

Time/Personnel,�Data licensing costs

Time/Personnel (experience vs cost)�Storage

Compute

Documentation

Long-term storage

Data curation &

Archiving costs

7 of 16

Active-phase research

8 of 16

Personnel budgets should...

...consider:

  • Costs at all phases of data life cycle
  • Qualifications/experience of personnel [data science, stats, standards, version control, identifiers, FAIR data, data management, tools for reproducible research (e.g., OSF, containers)]
  • Cost vs. experience matrix (more experience = time saving)

...include time and experience for:

  • Necessary training or certifications
  • Researching and implementing standards that will be used for all phases
  • Collecting and formatting metadata
  • Quality assurance and control (before, during and after data collection)
  • Data cleaning, organization, and integration (AKA data munging)

9 of 16

Hardware and software budgets should include:

  • Hardware (e.g., laptop[s] with adequate storage and compute power for your data generation and potentially analysis needs)
  • Connection/access costs for ethernet or wifi of sufficient bandwidth for uploading data
  • Free Open Source Software (FOSS) costs nothing (but this often requires personnel with sufficient knowledge/expertise to use it and/customize it to your needs)
  • Any enterprise-level software or services you consider essential to the project operations (e.g., G-suite, Dropbox, Slack, Zoom, AWS, Azure, etc.)
  • Long-term storage and archival fees for trustworthy repositories
  • Data publication fees

10 of 16

Sample Storage Costs

Assume 100 TB storage, 10 TB egress per year.

Think about what the storage time frame and purpose are

Storage Type

Storage Cost�

Egress (10 TB/yr)

Total (1 yr, 100TB)

Note

Institutional research computing

$3900 - $4500

$0

$3900 - $4500

Meant for active Storage

Amazon S3

~$1500

~$900

$2500

Glacier Deep Archive (infrequent access)

LTO-8 Tape

~$1500 + $7500

0

$9000

10 x 12TB tapes + drive. Long-term storage. Does not include labor or storage

11 of 16

Wrapping up research

12 of 16

What needs to happen?

  • Publishing - the steps that need to happen even before publishing
    • Understanding and meeting requirements for repository storage
      • Metadata
      • Permanent Identifiers (PIDs)
      • Curating your data (quality control again!)
      • Submitting data into repos (archiving)
      • Documenting analytical methods and parameters for reproducibility
    • Submitting to journal(s)
      • Fees
      • Publication fees
      • Data Accessibility Statement requirements by publishers (paywall or open access)

13 of 16

Data Archiving Costs

Easier to budget for than other costs

Public vs non-public�Non-public: See previous slide on storage costs

  • Choice of archive matters
    • Disciplinary may be better but less sustainable due to size - check first
  • Some cost money, some are free
  • More at osf.io/jdzgc/

14 of 16

Data Curation Costs

Highly variable.

  • Type of data, complexity of project, experience of researcher with open science, metadata etc.

Example 1: inexperienced student in the discipline that requires training on data sharing and publishing

  • $5000 + travel/training/stipend for summer student for “just” preparing a data package for publication

TERRA-REF project job posting for student at UA

A lot of the work is documentation

15 of 16

Data Curation Costs

Example 2: Experienced researcher in the ecological sciences with experience in open science

  • 35h for a relatively self-contained paper + dataset (stat. analysis in R)
  • $690 for archiving and open access costs

http://brunalab.org/blog/2014/09/04/the-opportunity-cost-of-my-openscience-was-35-hours-690/

Item

Cost

Note

Double checking the main dataset and doing some reformatting to prepare it for submission

5h

Had already spent a fair amount of time reformatting it to best-practices

Creating missing supplementary datafile and metadata

3h

Missing datafile may not have been needed after checking again for errors

Submission to Dryad

0.75h �$90

Prepare a geographic map of the locations in Dryad submission

1h

Realized not everyone is familiar with locations in dataset

Submission of map to Figshare

0.25h

Revising, cleaning up code, uploading it to GitHub

25h

Much work needed to clean it up

Archive code & making code citable w/DOI in Zenodo

0.5h

Editing bibliography in paper to follow best practices for data and code citation

0.5h

Open access costs

$600

Article Processing Charges (APCs)

16 of 16

How do I find what I need to budget for in my project?

Data Stewardship Wizard

  • Step-by-step questionnaire on all steps of data collection and sharing