1 of 27

Dataset Curation & Publication 2:

Reality vs Theory

PI: Maryann Martone

PRECISION Human Pain Network

2 of 27

Overview of process

Register study metadata with HEAL

Upload data + experimental metadata to SPARC

Publish data

View and search data through HEAL

human whole genome PHI?

BAM or fastq

Yes

No

Work on metadata w/DCIC Team

and provide accession #

Public DOI

3 of 27

First datasets are now available through HEAL via SPARC

https://healdata.org/portal/discovery

https://doi.org/10.26275/ZQCB-QH3L

4 of 27

Journal requirements vs HEAL requirements

  • Some journals have requirements about where data sets can be deposited

Recent example:

    • Nature Journal said transcriptomics data must be deposited in dbGAP and not SPARC

  • Our current policy:
    • Primary (raw) data is deposited in dbGAP
    • Derived (non-identifiable) data is deposited in SPARC
    • Subject and sample identifiers must be synced across platforms (work with DCIC before submission)

5 of 27

Actions

  • If you experience this issue, please contact mmartone@ucsd.edu
  • Will bring the issue to the HEAL Data Collective
  • Start data submission early!!!

6 of 27

Metadata

Let’s talk about it…

7 of 27

PRECISION Metadata Standard

  • Metadata standard V2.0 was approved in January

  • Documentation is now available via the PRECISION home page on SPARC

8 of 27

Data dictionary

Download both the metadata specification (data dictionary) and the template

Metadata template

9 of 27

  • Between HEAL and PRECISION metadata requirements for surgical subjects, there are :
    • Required: 99 fields
    • Recommended: 4 fields
    • If relevant: 2 fields
  • There are 10 mandatory fields for post mortem subjects (cadavers)

Curation Team checks it and marks it off if missing

Data dictionary

Metadata template

10 of 27

HEAL CDE requirements

  • Heal requires some CDEs to be collected that should not be submitted for publication in SPARC
  • Collection instrument may not be the same as data submission template
  • It is the responsibility of the investigator to ensure that PHI is not shared with SPARC

11 of 27

Required optional

  • PRECISION investigators decided on the metadata standard; data curators will enforce it

  • Understand the metadata standard BEFORE you start collecting data

  • “I didn’t collect it” is not a valid excuse - we will provide codes for legitimately missing values

  • Full metadata must be provided for each dataset even if you received a sample from someone else

  • If the same sample or subject is used, a unique identifier should be supplied (critical for dbGAP submissions)

Subject 1

Sample 1

Dataset 1

Sample 2

Dataset 2

12 of 27

Complete Metadata: Required for Acceptance

  • The SPARC Curation Team reviews all datasets for completeness
  • The SPARC Curation Team flags and returns any dataset missing required fields
  • No exceptions - all required fields must be included
  • Submissions missing required fields will need revision before acceptance
  • Returning submissions extends timelines by 2-4 weeks per review cycle
  • Each review cycle adds time to your publication timeline
  • Collecting missing data after the fact can be challenging

Metadata 3.0?

  1. It can it be automatically extracted, e.g., imaging metadata
  2. Is it essential for search or comparison? I.e., does it have to be structured or can it be provided in a protocol?

13 of 27

Best practices: HEAL & PRECISION Metadata Standards

Current Standard: Version 2.0

  • Always consult the PRECISION Consortium metadata standard
  • Requirements vary by collection type and specimen source
  • Fields are categorized as Required, Recommended, or "If Relevant"
  • Identify your subject type before starting collection (samples collected from patients, versus samples collected postmortem)
  • Plan data capture to include all required fields
  • Collect recommended fields whenever possible
  • Document study-specific deviations from the standard

Comprehensive metadata collection at the point of acquisition prevents costly gaps that cannot be filled retrospectively.

Metadata 3.0?

  • It can it be automatically extracted, e.g., imaging metadata
  • Is it essential for search or comparison? I.e., does it have to be structured or can it be provided in a protocol?

14 of 27

Metadata Header Standardization: Critical for Data Usability

Why Standardization Matters

  • Accurate interpretation: "Body Mass" ≠ "BMI" ≠ "Weight"
  • Interoperability: Systems can only map fields they recognize
  • Discoverability: Standard terms improve search functionality
  • Analysis integrity: Misidentified data leads to invalid conclusions

Best Practices

  • Always use exact header names specified in the standard
  • Maintain consistent capitalization and formatting
  • Adhere to controlled vocabularies for field values
  • Validate your metadata before submission

Common Pitfalls

  • Creating "similar" but non-standard headers
  • Using abbreviations not specified in the standard
  • Entering not permitted values
  • Assuming semantic equivalence between related terms

"In metadata, precision isn't just a virtue—it's a requirement for scientific reproducibility."

15 of 27

Best practice: Consult the data dictionary

16 of 27

HEAL requirement: Data dictionary

  • A data dictionary provides definitions and acceptable values for all variables in a dataset
  • HEAL requires a data dictionary for each dataset so that the HEAL platform can query variable-level metadata
  • We can use the PRECISION data dictionary across all PRECISION data provided that you conform to the metadata standard
  • Any additional variables will have to be defined

17 of 27

Dataset publishing expectations

The Curation Team works on multiple submissions simultaneously

  • Once submitted, they will usually get initial feedback to you in ~7 to 10 days
  • At this point you will iterate on corrections and improvements. This time depends on how well the data or protocol is prepared and responsiveness of investigators
  • Curation team tries to give a rapid turnaround on feedback to modifications

18 of 27

Contributors

  • Create dataset on Pennsieve
  • Reserve DOI for manuscript publication (optional, DOI is non-functional at this point)
  • Convert images
  • Upload data and metadata files
  • Creates protocol on protocols.io
  • Share dataset and protocol with Curation Team
  • Submit dataset for review and lock the dataset

Curation Team

ONCE PROTOCOL HAS BEEN SUBMITTED

  • Reviews protocol
  • ~7-10 days

  • ~1 week

  • Receives corrections
  • Performs final checks for alignment with SDS, HEAL, and Consortium requirements
  • Proofing dataset for publication
  • Releases dataset to SPARC portal (releases happen on FRIDAYS)
  • Respond to all curation requests
  • Make corrections
  • Provide missing information
  • Publish protocol and send DOI

It can take up to 3 days for DOI to resolve

  • Approve changes made by Curation Team
  • Authorize public data release

ONCE ALL DATA, METADATA HAS BEEN SUBMITTED

  • Inspects all data and metadata files

Provide INITIAL feedback to contributors

  • Up to 1 week AFTER

PI responds

  • Highly variable, depends on quality of submission

19 of 27

Dataset publishing expectations

  • The first time you prepare a dataset, it can take a while; over time it gets faster and easier

  • Think of this as a process similar to publishing a manuscript; preparing the manuscript, submitting it for review and revising can take many months

  • The curation team is your reviewer; the platform is your publisher. Just as with an article, there are standards, quality checks and processes that must be performed

  • When you start preparing your manuscript, start preparing your dataset.

20 of 27

Summary

  • SPARC is the official HEAL-approved repository for PRECISION data with the exception of human transcriptomic raw data → dbGAP
    • Consult with the DCIC before submitting to dbGAP to make sure the records in SPARC and dbGAP are in sync

  • All datasets submitted must conform to HEAL and PRECISION standards
    • Consult the PRECISION data dictionary BEFORE you start collecting data
    • Do NOT change variable names
    • Collect ALL required variables
    • The DCIC will provide a list of codes for missing values

21 of 27

Extra of slides

22 of 27

Dataset publishing expectations

  • Once the Curation Team performs final checks and proofs dataset for publication, the Investigator must sign off changes before publication

  • Datasets are published on Fridays

  • It can take up to 24 h for dataset DOI to resolve
    • It will be on SPARC Portal - URL works, but search may not work for a few days
    • Some metadata in the Abstract section might be initially missing (e.g., subject information or experimental approach), but should appear within a week from the publication date

What happens when the dataset is not public

23 of 27

Dataset publishing tips

  • Consider submitting your dataset under embargo (for up to a year) before you start writing your manuscript to ensure it’s ready to be published once your manuscript is ready for review/publishing. Your dataset won’t be published until you are ready.

  • You can work on formatting your dataset and protocol in parallel and can reuse parts of protocols.

  • Cite the DOI in your paper not the SPARC URL

https://doi.org/10.26275/pxwy-sric

24 of 27

Dataset publishing tips

  • Consult the Curation Team before submitting for the first time or a new type of data. This will save you from extensive dataset restructuring

  • Explain your timeline and needs to prevent future delays

  • To save you time down the road the Curation Team will be happy to preview your dataset before you formally submit it

  • Don’t forget to share your dataset with the curation team or they won’t see it

25 of 27

Publish data through

SPARC Portal

Upload data + experimental metadata to SPARC

View and search data through HEAL

metadata will be findable on the

HEAL platform,

but published on the

SPARC Portal (https://sparc.science/)

Data on

Pennsieve

Protocols

on protocols.io

Use SODA to prepare data for upload

26 of 27

SPARC Data Submission Steps

27 of 27

Dataset publishing tips

    • Give access to your dataset to the Curation Team - they will not see it otherwise
    • Request to publish - this will lock the dataset and initiate formal review