Intro to the new SPARC dataset structure (SDS) 2.0
June 28th, 2021
Dr Anita Bandrowski, University of California San Diego
Dr. Bhavesh Patel, California Medical Innovations Institute
Dr. Anna Pilko , University of California San Diego
Facilitating: Dr. Jyl Boline, K-Core Project Manager
Date
Presented by
Affiliation
�
SPARC consortium mandates a data sharing policy ensuring FAIR data sharing principles:
Findable, Accessible, Interoperable and Reusable
Image: https://book.fosteropenscience.eu/
Motivation of Organizational Structure of SPARC datasets
~50 awards
>80
institutions
Complex and diverse data
20 organs
6 species
Over 10K files in some datasets
Various types (electrophysiology & physiology, microscopy, morphology & histology, and transcriptomics)
SPARC investigators taught us
SPARC Dataset Structure 1.2.3
Contains a single ”dataset” object
Required Collection
Additional Collections
Standard SDS Descriptor Files
Additional Descriptor Files
Provided as a downloadable versioned template; Described in a white paper
What is coming in September? Release Notes V 2.0
Major changes
Minor changes based on improvements of the current 1.2.3 version.
Documentation available: https://github.com/SciCrunch/sparc-curation/releases/tag/dataset-template-2.0.0
Dataset templates that support this release:
https://github.com/SciCrunch/sparc-curation/releases/download/dataset-template-2.0.0/DatasetTemplate.zip
SDS 1.2.3 SDS 2.0
sam-1
sam-2
Data files
Data files
sam-3
sam-4
Data files
Data files
primary
manifest
manifest
file
file
file
manifest
file
manifest
file
manifest
code
docs
source
protocol
derivative
dataset_description.xlsx
subjects.xlsx
submission.xlsx
CHANGES.txt
README.txt
samples.xlsx
code_description.xlsx
code_parameters.xlsx
manifest
sub-1
sub-2
dataset_description.xlsx
subjects.xlsx
samples.xlsx
submission.xlsx
CHANGES.txt
README.txt
primary
code
docs
source
protocol
derivative
sam-1
sam-2
data
manifest
sub-1
file
manifest
file
manifest
sam-1
sam-2
data
manifest
sub-2
file
manifest
file
manifest
manifest
file
manifest
Bold = required
Blue = conditional (if applicable)
Green = optional
New metadata files specific to code based and computational datasets
resources.xlsx
performance.xlsx
New optional metadata files
Top Level Folders (Dataset)
�
Required folder that contains the primary data to be used for further analysis after any initial transformations from raw source data. The structure of this folder must match the subjects, samples, and pools described in the top level metadata files (e.g., subjects.xlsx, and samples.xlsx).
Optional folder that contains supporting documentation of the protocol already submitted to protocols.io
Optional folder that contains other supporting documents including the picture representative for the dataset, any figures the researchers wish to include, etc.
�
Optional folder that contains all the source codes used in the study such as text and source code (e.g. MATLAB, etc.). Any supporting code
�
Optional folder that contains data derived from the data in the primary data folder. For example, image stacks that are annotated via the MBF tools, or smoothed overlays of current and voltage that demonstrate a particular effect.
Optional folder that contains raw data prior to conversion to the format contained in the primary data folder.�
primary
code
derivative
source
protocol
docs
Top level folder structure is exactly the same!
Top Level Files (Dataset)
Subjects: excel file that contains information about subjects involved in the data collection
Samples: excel file that contains information about samples involved in the data collection
(each sample should reference a subject from the subjects file)
Dataset Description: excel file that contains information to describe the dataset and make it citable. The link to your protocol file needs be included.
Submission Information: excel file that contains information to describe the submission and related milestone information
subjects.xlsx
samples.xlsx
dataset_description.xlsx
submission.xlsx
code_description.xlsx
code_parameters.xlsx
Code Description: excel file that contains information to describe the code
in terms of its quality . Code RRIDs and ontologies are required
Code Parameters: excel file containing information describing specific parameters to run a code
performances.xlsx
resources.xlsx
Performance: excel file containing information about performance or experimental conditions for subjects contained in perf- folders
Resources: excel file containing information to describe resources used in the experiments RRID, URL, Vendor, Version, Additional Metadata
This Photo by Unknown Author is licensed under CC BY-SA
Dataset_Description Changes
Not all fields apply to code-related submissions.
New study information section
Template 1.2.3
Template 2.0
Template 2.0
SODA will surface these automatically!
sparc.science
Dataset_Description Excel File
Everything in GREEN Required, Yellow is optional
Visual cues:
fill / not fill
experimental by default; can be computational
DOI for protocol required
Code_Description Excel File
Code_Parameters Excel File
This will contain specific parameter information needed to run the code including:
Submission Excel File
If the dataset is not a part of the milestone use N/A
Is subject 1 = subject 1?
Subjects and Samples Excel Files
Subjects:
Samples:
Subjects Changes
Add member of for cases where we need to include a specimen in a population.
Add also in dataset for including the Pennsieve id(s) for other datasets that have data about the same specimen
Template 1.2.3
Template 2.0
Samples Changes
Add member of for cases where we need to include a specimen in a population.
Add laboratory internal id to provide a mapping for groups that have incompatible internal identifier conventions.
Add also in dataset for including the Pennsieve id(s) for other datasets that have data about the same specimen
Template 1.2.3
Template 2.0
sam-3444
sam-3445
sub-pig1
sam-3446
sub-pig2
sub-pig3
sam-3447
sam-3448
sam-3449
sam-3450
primary
Manifest file
Folder naming and requirements for subject and sample identifiers
SPARC Data Structure (SDS)
| | primary | Contains folders named to match the identifiers for subjects and/or samples depending on the study design |
| | dataset_description.xlsx | Contains the study metadata to describe the dataset, including but not limited to, a short description of the study, contributors, associated journal articles and a protocol.io URL |
| | samples.xlsx | Contains information about samples involved in the data collection. |
| | subjects.xlsx | Contains information on every subject involved in the data collection. |
| | submission.xlsx�performances.xlsx�resources.xlsx | Contains information to describe the submission and related milestone information. Contains information about performances, experimental parameters for perf-folders Contains information about |
| | CHANGES | Contains information about the history of the dataset or changes since initial publication. |
| | README (required) | Contains additional documentation about dataset. |
| | code | Contains all the source codes used in the study such as text and source codes. |
| | derivative | Contains derived data files, e.g., image stacks that are annotated via the MBF tools or smoothed overlays of current and voltage that demonstrate a particular effect. |
| | docs | Contains all the supporting documents for the dataset, including but not limited to, a representative image. |
| | protocol | Contains supporting documentation of the protocol submitted to protocols.io. |
| | source | Contains data prior to any conversion (“truly” raw k-space data from MRI, raw images from microscopy dataset, etc.). |
My SPARC Formatted�Experimental Dataset
Required |
Required Structured Metadata Files |
Optional |
Optional Descriptive Files (Must contain Manifest file, Readme file optional) |
|
|
|
|
|
|
Top Level Folders Content
Manifest.xlsx
+
+
+
+
+
+
+
+
+
+
Data
sub-1
sam-1
Code
Processed Data
Files
Files
Raw�Data
Data
sam-2
Data
sub-2
sam-3
Top Level Only
Manifest.xlsx
Manifest.xlsx
Manifest.xlsx
Manifest.xlsx
+
Manifest.xlsx
Introduction to SODA
SODA (Software to Organize Data Automatically)
Goal: Simplify data curation and sharing for SPARC researchers
Open-source desktop computer software
Available for Windows, macOS, and Linux
Easily accessible: https://github.com/bvhpatel/SODA
Centralize all resources and information into single interface
Break curation process into logical and easy to perform steps
Include automation to reduce effort and errors
Overview of the software
Prepare dataset on Pennsieve
(banner image, license, subtitle, etc.)
Prepare metadata files
(submission, dataset_description, etc.)
Organize data files according to SDS
and upload them on Pennsieve
Share with the Curation Team
for review
1
2
3
4
SciCrunch
NCBI
Protocols.io
Pennsieve
Back-end: Connected to SPARC resources
Front-end: Intuitive user interface
Quick showcase of the actual interface
What’s next for SODA?
We are reaching out to each SPARC group to learn how we can improve
Resources
Contact:
Curation Team
�
Monique Surles Zeigler
SPARC Terminoly
Tom Gillespie
MIS, curation pipeline
Anita Bandrowski curation lead
Anna Pilko
SPARC curator
SODA Team
�
Bhavesh Patel
SODA lead
Tram Ngo
Developer
Sanjay Soundarajan
Developer
Learn about future webinars and tutorials at sparc.science/news-and-events��
Thank You