1 of 27

Intro to the new SPARC dataset structure (SDS) 2.0

June 28^th, 2021

Dr Anita Bandrowski, University of California San Diego

Dr. Bhavesh Patel, California Medical Innovations Institute

Dr. Anna Pilko , University of California San Diego

Facilitating: Dr. Jyl Boline, K-Core Project Manager

Date

Presented by

Affiliation

2 of 27

�

SPARC consortium mandates a data sharing policy ensuring FAIR data sharing principles:

Findable, Accessible, Interoperable and Reusable

Image: https://book.fosteropenscience.eu/

Has a persistent identifier
Has rich metadata
Is searchable and discoverable online

Is retrievable online using standardized protocols
Stores the data safely
Describes the data appropriately (metadata)

Common formats and standards
Controlled vocabularies

Is well-documented
Has clear license and provenance information

3 of 27

Motivation of Organizational Structure of SPARC datasets

~50 awards

>80

institutions

Complex and diverse data

20 organs

6 species

Over 10K files in some datasets

Various types (electrophysiology & physiology, microscopy, morphology & histology, and transcriptomics)

4 of 27

SPARC investigators taught us

Experiments are complicated and can be organized in many different ways
Many files/file types per subject
Members of the same lab may use different ways to arrange data
Difficulty sharing data within and across large scale projects

Our templates are still insufficient to handle all cases
Our templates have ambiguity

5 of 27

SPARC Dataset Structure 1.2.3

Structure is currently driven by data obtained from experimentation
Provides method to organize and name dataset files
Enforces mandatory dataset and file descriptions

Contains a single ”dataset” object

Required Collection

Data goes into the Primary folder
“Primary” folder is required

Additional Collections

5 optional folders to organize additional information

Standard SDS Descriptor Files

4 required files

Additional Descriptor Files

2 optional files

Provided as a downloadable versioned template; Described in a white paper

6 of 27

What is coming in September? Release Notes V 2.0

Major changes

New requirements for imaging metadata for microscopy data
Introduction of unique subject and sample identifiers
Modified SDS structure for computational data

Minor changes based on improvements of the current 1.2.3 version.

Making currently optional files mandatory (manifests)
Rename columns normalizing names, removing confusion and enhancing clarity

Subjects and samples optional columns aligned with openMinds, Dandi, NEMO, and BCDC. Other additional columns were selected based on the most common columns appearing in existing subjects and samples sheets.
Remove description and example rows from templates

Documentation available: https://github.com/SciCrunch/sparc-curation/releases/tag/dataset-template-2.0.0

Dataset templates that support this release:

https://github.com/SciCrunch/sparc-curation/releases/download/dataset-template-2.0.0/DatasetTemplate.zip

7 of 27

SDS 1.2.3 SDS 2.0

sam-1

sam-2

Data files

sam-3

sam-4

Data files

primary

manifest

file

manifest

file

manifest

file

manifest

code

docs

source

protocol

derivative

dataset_description.xlsx

subjects.xlsx

submission.xlsx

CHANGES.txt

README.txt

samples.xlsx

code_description.xlsx

code_parameters.xlsx

manifest

sub-1

sub-2

dataset_description.xlsx

subjects.xlsx

samples.xlsx

submission.xlsx

CHANGES.txt

README.txt

primary

code

docs

source

protocol

derivative

sam-1

sam-2

data

manifest

sub-1

file

manifest

file

manifest

sam-1

sam-2

data

manifest

sub-2

file

manifest

file

manifest

file

manifest

Bold = required

Blue = conditional (if applicable)

Green = optional

New metadata files specific to code based and computational datasets

resources.xlsx

performance.xlsx

New optional metadata files

8 of 27

Top Level Folders (Dataset)

�

Required folder that contains the primary data to be used for further analysis after any initial transformations from raw source data. The structure of this folder must match the subjects, samples, and pools described in the top level metadata files (e.g., subjects.xlsx, and samples.xlsx).

Optional folder that contains supporting documentation of the protocol already submitted to protocols.io

Optional folder that contains other supporting documents including the picture representative for the dataset, any figures the researchers wish to include, etc.

�

Optional folder that contains all the source codes used in the study such as text and source code (e.g. MATLAB, etc.). Any supporting code

�

Optional folder that contains data derived from the data in the primary data folder. For example, image stacks that are annotated via the MBF tools, or smoothed overlays of current and voltage that demonstrate a particular effect.

Optional folder that contains raw data prior to conversion to the format contained in the primary data folder.�

primary

code

derivative

source

protocol

docs

Top level folder structure is exactly the same!

9 of 27

Top Level Files (Dataset)

Subjects: excel file that contains information about subjects involved in the data collection

Samples: excel file that contains information about samples involved in the data collection

(each sample should reference a subject from the subjects file)

Dataset Description: excel file that contains information to describe the dataset and make it citable. The link to your protocol file needs be included.

Submission Information: excel file that contains information to describe the submission and related milestone information

subjects.xlsx

samples.xlsx

dataset_description.xlsx

submission.xlsx

code_description.xlsx

code_parameters.xlsx

Code Description: excel file that contains information to describe the code

in terms of its quality . Code RRIDs and ontologies are required

Code Parameters: excel file containing information describing specific parameters to run a code

performances.xlsx

resources.xlsx

Performance: excel file containing information about performance or experimental conditions for subjects contained in perf- folders

Resources: excel file containing information to describe resources used in the experiments RRID, URL, Vendor, Version, Additional Metadata

This Photo by Unknown Author is licensed under CC BY-SA

10 of 27

Dataset_Description Changes

Not all fields apply to code-related submissions.

New study information section

Template 1.2.3

Template 2.0

11 of 27

Template 2.0

SODA will surface these automatically!

sparc.science

12 of 27

Dataset_Description Excel File

Everything in GREEN Required, Yellow is optional

Visual cues:

fill / not fill

experimental by default; can be computational

DOI for protocol required

Fields that only accept a single value have had additional cells grayed out to indicate that additional values should not be provided.

Add gray coloring to mark cases where only a single value is allowed.
Remove non-default width that was set on all columns which was causing errors when opening in Libre Office Calc due to the fact that 1024 columns were mentioned but Calc doesn't support that many.
Remove incorrect embedded authoring metadata.
Rename Metadata Version DO NOT REMOVE -> Metadata Version
Move Metadata Version to second row and mark it blue.
Add Type row which currently only accepts experimental or computational as a value. Set the default to experimental since most computational datasets are likely to generate their metadata.
Rename Name -> Title to make it consistent with internal naming.
Move Funding to follow Keywords
Change Funding to be optional.
Rename Acknowledgements -> Acknowledgments.
Move Acknowledgments to follow Funding
Change Acknowledgments to be optional.
Add Study purpose.
Add Study data collection.
Add Study primary conclusion.
Add Study organ system.
Add Study approach.
Add Study technique.
Rename Title for complete data set -> Study collection title.
Rename Contributors -> Contributor Name
Rename Contributor ORCID ID -> Contributor ORCiD
Change Contributor role value ContactPerson -> CorrespondingAuthor.
Change Contributor role cells for Value 1 and Value 2 to be PrincipalInvestigator and CorrespondingAuthor. Need to consider whether to make DataManager required as well. Will need to note to wranglers that the ordering of contributors is the order the will appear in on the dataset publication, so it is ok to move the PI and CA records to their appropriate place.
Delete Is contact person. Redundant with CorrespondingAuthor role.
Delete Originating Article DOI.
Delete Protocol URL or DOI.
Delete Additional Links.
Delete Link Description.
Add Identifier replacing Originating Article DOI, Protocol URL or DOI, and Additional Links.
Add Identifier type.
Change Identifier type cells for Value 1 and Value 2 to be HasProtocol and IsDescribedBy, replacing Protocol URL or DOI and Originating Article DOI.
Add Relation type matching the DataCite Relation type from this dataset to the related identifier.
Add Identifier description replacing Link Description and ensuring that there is no ambiguity about which link the description applies to, in the previous template the description could refer to any of the originating article doi, the protocol doi, or the first additional link.
Delete Completeness of data set.
Delete Parent dataset ID. This is replaced by querying for datasets that use the same protocol. Other relations can be added via related identifiers.

13 of 27

Code_Description Excel File

New Metadata fields, to describe the code submitted in terms of quality.
Ten Simple Rules guidelines are included: https://www.imagwiki.nibib.nih.gov/content/10-simple-rules-conformance-rubric
Code RRIDs and ontologies are required
Metadata format is based on the dataset_submission.xlsx format
This solution encourages creating validated, documented, and version-controlled code without restricting submitters

14 of 27

Code_Parameters Excel File

This will contain specific parameter information needed to run the code including:

Ontologies
Data types
Data units

15 of 27

Submission Excel File

Dataset needs to become public 1 year after milestone completion date
The only person able to publish the dataset is the owner of dataset
Only a request to publish will initiate curation

If the dataset is not a part of the milestone use N/A

16 of 27

Is subject 1 = subject 1?

17 of 27

Subjects and Samples Excel Files

Subjects:

Samples:

Removed description and example rows
The headings of columns are all normalized
Cells in Blue and Green are required Yellow is optional, however please provide as much information as you can. Feel free to add additional yellow columns as needed
The names in the subject and samples should match the folders in the primary and derivatives folders

18 of 27

Subjects Changes

Add member of for cases where we need to include a specimen in a population.

Add also in dataset for including the Pennsieve id(s) for other datasets that have data about the same specimen

Template 1.2.3

Template 2.0

19 of 27

Samples Changes

Add member of for cases where we need to include a specimen in a population.

Add laboratory internal id to provide a mapping for groups that have incompatible internal identifier conventions.

Add also in dataset for including the Pennsieve id(s) for other datasets that have data about the same specimen

Template 1.2.3

Template 2.0

20 of 27

sam-3444

sam-3445

sub-pig1

sam-3446

sub-pig2

sub-pig3

sam-3447

sam-3448

sam-3449

sam-3450

primary

Manifest file

Folder naming and requirements for subject and sample identifiers

Sample ids must be unique for the dataset
ID can include no special characters and empty space only 0-9, A-Z, a-z, and - (hyphen-dash) characters are allowed.
Have to have prefixes : sub- and sam-
Folder naming must reflect exact subject/sample ID in the name of the folder
Sample folders must be inside corresponding subject folder
Each datafile must be listed in the main manifest with the adequate description

21 of 27

SPARC Data Structure (SDS)

		primary	Contains folders named to match the identifiers for subjects and/or samples depending on the study design
		dataset_description.xlsx	Contains the study metadata to describe the dataset, including but not limited to, a short description of the study, contributors, associated journal articles and a protocol.io URL
		samples.xlsx	Contains information about samples involved in the data collection.
		subjects.xlsx	Contains information on every subject involved in the data collection.
		submission.xlsx�performances.xlsx�resources.xlsx	Contains information to describe the submission and related milestone information. Contains information about performances, experimental parameters for perf-folders Contains information about
		CHANGES	Contains information about the history of the dataset or changes since initial publication.
		README (required)	Contains additional documentation about dataset.
		code	Contains all the source codes used in the study such as text and source codes.
		derivative	Contains derived data files, e.g., image stacks that are annotated via the MBF tools or smoothed overlays of current and voltage that demonstrate a particular effect.
		docs	Contains all the supporting documents for the dataset, including but not limited to, a representative image.
		protocol	Contains supporting documentation of the protocol submitted to protocols.io.
		source	Contains data prior to any conversion (“truly” raw k-space data from MRI, raw images from microscopy dataset, etc.).

My SPARC Formatted�Experimental Dataset

Required
Required Structured Metadata Files
Optional
Optional Descriptive Files (Must contain Manifest file, Readme file optional)

22 of 27

Introduction to SODA

SODA (Software to Organize Data Automatically)

Goal: Simplify data curation and sharing for SPARC researchers

Open-source desktop computer software

Available for Windows, macOS, and Linux

Easily accessible: https://github.com/bvhpatel/SODA

Centralize all resources and information into single interface

Break curation process into logical and easy to perform steps

Include automation to reduce effort and errors

23 of 27

Overview of the software

Prepare dataset on Pennsieve

(banner image, license, subtitle, etc.)

Prepare metadata files

(submission, dataset_description, etc.)

Organize data files according to SDS

and upload them on Pennsieve

Share with the Curation Team

for review

1

2

3

4

SciCrunch

NCBI

Protocols.io

Pennsieve

Back-end: Connected to SPARC resources

Front-end: Intuitive user interface

24 of 27

Quick showcase of the actual interface

25 of 27

What’s next for SODA?

Transition to SDS 2.0
Integrate the validator developed by the Curation Team
Make the user interface even more intuitive
Etc.

We are reaching out to each SPARC group to learn how we can improve

26 of 27

Resources

Everything about SODA: https://github.com/bvhpatel/SODA
Everything curation: https://github.com/SciCrunch/sparc-curation

SDS 2.0 release notes, dataset-template-2.0.0

SPARC Support Center: https://sparc.science/help

Contact:

Anna Pilko: apilkocuration@gmail.com
Bhavesh Patel: bpatel@calmi2.org

Curation office hours held by Anna on Zoom :

Tuesday 9-10 AM PST
Thursday 1-2 PM PST

Curation Team

�

Monique Surles Zeigler

SPARC Terminoly

Tom Gillespie

MIS, curation pipeline

Anita Bandrowski curation lead

Anna Pilko

SPARC curator

SODA Team

�

Bhavesh Patel

SODA lead

Tram Ngo

Developer

Sanjay Soundarajan

Developer

27 of 27

Learn about future webinars and tutorials at sparc.science/news-and-events��

Thank You