1 of 27

Dataset Curation & Publication 2:

Reality vs Theory

PI: Maryann Martone

PRECISION Human Pain Network

2 of 27

Overview of process

Upload data + experimental metadata to SPARC

Publish data

View and search data through HEAL

human whole genome PHI?

BAM or fastq

Yes

Work on metadata w/DCIC Team

and provide accession #

Public DOI

3 of 27

First datasets are now available through HEAL via SPARC

https://healdata.org/portal/discovery

https://doi.org/10.26275/ZQCB-QH3L

4 of 27

Journal requirements vs HEAL requirements

Some journals have requirements about where data sets can be deposited

Recent example:

Nature Journal said transcriptomics data must be deposited in dbGAP and not SPARC

Our current policy:

Primary (raw) data is deposited in dbGAP
Derived (non-identifiable) data is deposited in SPARC
Subject and sample identifiers must be synced across platforms (work with DCIC before submission)

https://www.nature.com/nature-portfolio/editorial-policies/reporting-standards#availability-of-data

5 of 27

Actions

If you experience this issue, please contact mmartone@ucsd.edu
Will bring the issue to the HEAL Data Collective
Start data submission early!!!

6 of 27

Metadata

Let’s talk about it…

7 of 27

PRECISION Metadata Standard

Metadata standard V2.0 was approved in January

Documentation is now available via the PRECISION home page on SPARC

8 of 27

Data dictionary

Download both the metadata specification (data dictionary) and the template

Metadata template

9 of 27

Between HEAL and PRECISION metadata requirements for surgical subjects, there are :

Required: 99 fields
Recommended: 4 fields
If relevant: 2 fields

There are 10 mandatory fields for post mortem subjects (cadavers)

Curation Team checks it and marks it off if missing

Data dictionary

Metadata template

10 of 27

HEAL CDE requirements

Heal requires some CDEs to be collected that should not be submitted for publication in SPARC
Collection instrument may not be the same as data submission template
It is the responsibility of the investigator to ensure that PHI is not shared with SPARC

https://docs.sparc.science/docs/is-sparc-hipaa-compliant

11 of 27

Required optional

PRECISION investigators decided on the metadata standard; data curators will enforce it

Understand the metadata standard BEFORE you start collecting data

“I didn’t collect it” is not a valid excuse - we will provide codes for legitimately missing values

Full metadata must be provided for each dataset even if you received a sample from someone else

If the same sample or subject is used, a unique identifier should be supplied (critical for dbGAP submissions)

Subject 1

Sample 1

Dataset 1

Sample 2

Dataset 2

12 of 27

Complete Metadata: Required for Acceptance

The SPARC Curation Team reviews all datasets for completeness
The SPARC Curation Team flags and returns any dataset missing required fields
No exceptions - all required fields must be included
Submissions missing required fields will need revision before acceptance
Returning submissions extends timelines by 2-4 weeks per review cycle
Each review cycle adds time to your publication timeline
Collecting missing data after the fact can be challenging

Metadata 3.0?

It can it be automatically extracted, e.g., imaging metadata
Is it essential for search or comparison? I.e., does it have to be structured or can it be provided in a protocol?

13 of 27

Best practices: HEAL & PRECISION Metadata Standards

Current Standard: Version 2.0

Always consult the PRECISION Consortium metadata standard
Requirements vary by collection type and specimen source
Fields are categorized as Required, Recommended, or "If Relevant"
Identify your subject type before starting collection (samples collected from patients, versus samples collected postmortem)
Plan data capture to include all required fields
Collect recommended fields whenever possible
Document study-specific deviations from the standard

Comprehensive metadata collection at the point of acquisition prevents costly gaps that cannot be filled retrospectively.

Metadata 3.0?

It can it be automatically extracted, e.g., imaging metadata
Is it essential for search or comparison? I.e., does it have to be structured or can it be provided in a protocol?

14 of 27

Metadata Header Standardization: Critical for Data Usability

Why Standardization Matters

Accurate interpretation: "Body Mass" ≠ "BMI" ≠ "Weight"
Interoperability: Systems can only map fields they recognize
Discoverability: Standard terms improve search functionality
Analysis integrity: Misidentified data leads to invalid conclusions

Best Practices

Always use exact header names specified in the standard
Maintain consistent capitalization and formatting
Adhere to controlled vocabularies for field values
Validate your metadata before submission

Common Pitfalls

Creating "similar" but non-standard headers
Using abbreviations not specified in the standard
Entering not permitted values
Assuming semantic equivalence between related terms

"In metadata, precision isn't just a virtue—it's a requirement for scientific reproducibility."

15 of 27

Best practice: Consult the data dictionary

16 of 27

HEAL requirement: Data dictionary

A data dictionary provides definitions and acceptable values for all variables in a dataset
HEAL requires a data dictionary for each dataset so that the HEAL platform can query variable-level metadata
We can use the PRECISION data dictionary across all PRECISION data provided that you conform to the metadata standard
Any additional variables will have to be defined

https://zenodo.org/records/14725464

17 of 27

Dataset publishing expectations

The Curation Team works on multiple submissions simultaneously

Once submitted, they will usually get initial feedback to you in ~7 to 10 days
At this point you will iterate on corrections and improvements. This time depends on how well the data or protocol is prepared and responsiveness of investigators
Curation team tries to give a rapid turnaround on feedback to modifications

18 of 27

Contributors

Create dataset on Pennsieve
Reserve DOI for manuscript publication (optional, DOI is non-functional at this point)
Convert images
Upload data and metadata files
Creates protocol on protocols.io
Share dataset and protocol with Curation Team
Submit dataset for review and lock the dataset

Curation Team

ONCE PROTOCOL HAS BEEN SUBMITTED

Reviews protocol

~7-10 days

~1 week

Receives corrections
Performs final checks for alignment with SDS, HEAL, and Consortium requirements
Proofing dataset for publication

Releases dataset to SPARC portal (releases happen on FRIDAYS)

Respond to all curation requests
Make corrections
Provide missing information
Publish protocol and send DOI^⍏

^⍏It can take up to 3 days for DOI to resolve

Approve changes made by Curation Team
Authorize public data release

ONCE ALL DATA, METADATA HAS BEEN SUBMITTED

Inspects all data and metadata files

Provide INITIAL feedback to contributors

Up to 1 week AFTER

PI responds

Highly variable, depends on quality of submission

19 of 27

Dataset publishing expectations

The first time you prepare a dataset, it can take a while; over time it gets faster and easier

Think of this as a process similar to publishing a manuscript; preparing the manuscript, submitting it for review and revising can take many months

The curation team is your reviewer; the platform is your publisher. Just as with an article, there are standards, quality checks and processes that must be performed

When you start preparing your manuscript, start preparing your dataset.

20 of 27

Summary

SPARC is the official HEAL-approved repository for PRECISION data with the exception of human transcriptomic raw data → dbGAP

Consult with the DCIC before submitting to dbGAP to make sure the records in SPARC and dbGAP are in sync

All datasets submitted must conform to HEAL and PRECISION standards

Consult the PRECISION data dictionary BEFORE you start collecting data
Do NOT change variable names
Collect ALL required variables
The DCIC will provide a list of codes for missing values

21 of 27

Extra of slides

22 of 27

Dataset publishing expectations

Once the Curation Team performs final checks and proofs dataset for publication, the Investigator must sign off changes before publication

Datasets are published on Fridays

It can take up to 24 h for dataset DOI to resolve

It will be on SPARC Portal - URL works, but search may not work for a few days
Some metadata in the Abstract section might be initially missing (e.g., subject information or experimental approach), but should appear within a week from the publication date

What happens when the dataset is not public

23 of 27

Dataset publishing tips

Consider submitting your dataset under embargo (for up to a year) before you start writing your manuscript to ensure it’s ready to be published once your manuscript is ready for review/publishing. Your dataset won’t be published until you are ready.

You can work on formatting your dataset and protocol in parallel and can reuse parts of protocols.

Cite the DOI in your paper not the SPARC URL

https://doi.org/10.26275/pxwy-sric

24 of 27

Dataset publishing tips

Consult the Curation Team before submitting for the first time or a new type of data. This will save you from extensive dataset restructuring

Explain your timeline and needs to prevent future delays

To save you time down the road the Curation Team will be happy to preview your dataset before you formally submit it

Don’t forget to share your dataset with the curation team or they won’t see it

25 of 27

Publish data through

SPARC Portal

Upload data + experimental metadata to SPARC

View and search data through HEAL

https://docs.sparc.science/docs/data-submission-walkthrough

metadata will be findable on the

HEAL platform,

but published on the

SPARC Portal (https://sparc.science/)

Data on

Pennsieve

Protocols

on protocols.io

Use SODA to prepare data for upload

1 of 27

2 of 27

3 of 27

4 of 27

5 of 27

6 of 27

7 of 27

8 of 27

9 of 27

10 of 27

11 of 27

12 of 27

13 of 27

14 of 27

15 of 27

16 of 27

17 of 27

18 of 27

19 of 27

20 of 27

21 of 27

22 of 27

23 of 27

24 of 27

25 of 27

26 of 27

27 of 27