1 of 27

Intro to the new SPARC dataset structure (SDS) 2.0

June 28th, 2021

Dr Anita Bandrowski, University of California San Diego

Dr. Bhavesh Patel, California Medical Innovations Institute

Dr. Anna Pilko , University of California San Diego

Facilitating: Dr. Jyl Boline, K-Core Project Manager

Date

Presented by

Affiliation

2 of 27

SPARC consortium mandates a data sharing policy ensuring FAIR data sharing principles:

Findable, Accessible, Interoperable and Reusable

Image: https://book.fosteropenscience.eu/

  • Has a persistent identifier
  • Has rich metadata
  • Is searchable and discoverable online

  • Is retrievable online using standardized protocols
  • Stores the data safely
  • Describes the data appropriately (metadata)

  • Common formats and standards
  • Controlled vocabularies
  • Is well-documented
  • Has clear license and provenance information

3 of 27

Motivation of Organizational Structure of SPARC datasets

~50 awards

>80

institutions

Complex and diverse data

20 organs

6 species

Over 10K files in some datasets

Various types (electrophysiology & physiology, microscopy, morphology & histology, and transcriptomics)

4 of 27

SPARC investigators taught us

  • Experiments are complicated and can be organized in many different ways
  • Many files/file types per subject
  • Members of the same lab may use different ways to arrange data
  • Difficulty sharing data within and across large scale projects

  • Our templates are still insufficient to handle all cases
  • Our templates have ambiguity

5 of 27

SPARC Dataset Structure 1.2.3

  • Structure is currently driven by data obtained from experimentation
  • Provides method to organize and name dataset files
  • Enforces mandatory dataset and file descriptions

Contains a single ”dataset” object

Required Collection

  • Data goes into the Primary folder
  • “Primary” folder is required

Additional Collections

  • 5 optional folders to organize additional information

Standard SDS Descriptor Files

  • 4 required files

Additional Descriptor Files

  • 2 optional files

Provided as a downloadable versioned template; Described in a white paper

6 of 27

What is coming in September? Release Notes V 2.0

Major changes

  • New requirements for imaging metadata for microscopy data
  • Introduction of unique subject and sample identifiers
  • Modified SDS structure for computational data

Minor changes based on improvements of the current 1.2.3 version.

  • Making currently optional files mandatory (manifests)
  • Rename columns normalizing names, removing confusion and enhancing clarity
    • Subjects and samples optional columns aligned with openMinds, Dandi, NEMO, and BCDC. Other additional columns were selected based on the most common columns appearing in existing subjects and samples sheets.
    • Remove description and example rows from templates

Documentation available: https://github.com/SciCrunch/sparc-curation/releases/tag/dataset-template-2.0.0

Dataset templates that support this release: 

https://github.com/SciCrunch/sparc-curation/releases/download/dataset-template-2.0.0/DatasetTemplate.zip

7 of 27

SDS 1.2.3 SDS 2.0

sam-1

sam-2

Data files

Data files

sam-3

sam-4

Data files

Data files

primary

manifest

manifest

file

file

file

manifest

file

manifest

file

manifest

code

docs

source

protocol

derivative

dataset_description.xlsx

subjects.xlsx

submission.xlsx

CHANGES.txt

README.txt

samples.xlsx

code_description.xlsx

code_parameters.xlsx

manifest

sub-1

sub-2

dataset_description.xlsx

subjects.xlsx

samples.xlsx

submission.xlsx

CHANGES.txt

README.txt

primary

code

docs

source

protocol

derivative

sam-1

sam-2

data

manifest

sub-1

file

manifest

file

manifest

sam-1

sam-2

data

manifest

sub-2

file

manifest

file

manifest

manifest

file

manifest

Bold = required

Blue = conditional (if applicable)

Green = optional

New metadata files specific to code based and computational datasets

resources.xlsx

performance.xlsx

New optional metadata files

8 of 27

Top Level Folders (Dataset)

Required folder that contains the primary data to be used for further analysis after any initial transformations from raw source data. The structure of this folder must match the subjects, samples, and pools described in the top level metadata files (e.g., subjects.xlsx, and samples.xlsx).

Optional folder that contains supporting documentation of the protocol already submitted to protocols.io

Optional folder that contains other supporting documents including the picture representative for the dataset, any figures the researchers wish to include, etc.

Optional folder that contains all the source codes used in the study such as text and source code (e.g. MATLAB, etc.). Any supporting code

Optional folder that contains data derived from the data in the primary data folder. For example, image stacks that are annotated via the MBF tools, or smoothed overlays of current and voltage that demonstrate a particular effect.

Optional folder that contains raw data prior to conversion to the format contained in the primary data folder.�

primary

code

derivative

source

protocol

docs

Top level folder structure is exactly the same!

9 of 27

Top Level Files (Dataset)

Subjects: excel file that contains information about subjects involved in the data collection

Samples: excel file that contains information about samples involved in the data collection

(each sample should reference a subject from the subjects file)

Dataset Description: excel file that contains information to describe the dataset and make it citable. The link to your protocol file needs be included.

Submission Information: excel file that contains information to describe the submission and related milestone information

subjects.xlsx

samples.xlsx

dataset_description.xlsx

submission.xlsx

code_description.xlsx

code_parameters.xlsx

Code Description: excel file that contains information to describe the code

in terms of its quality . Code RRIDs and ontologies are required

Code Parameters: excel file containing information describing specific parameters to run a code

performances.xlsx

resources.xlsx

Performance: excel file containing information about performance or experimental conditions for subjects contained in perf- folders

Resources: excel file containing information to describe resources used in the experiments RRID, URL, Vendor, Version, Additional Metadata

This Photo by Unknown Author is licensed under CC BY-SA

10 of 27

Dataset_Description Changes

Not all fields apply to code-related submissions.

New study information section

Template 1.2.3

Template 2.0

11 of 27

Template 2.0

SODA will surface these automatically!

sparc.science

12 of 27

Dataset_Description Excel File

Everything in GREEN Required, Yellow is optional

Visual cues:

fill / not fill

experimental by default; can be computational

DOI for protocol required

13 of 27

Code_Description Excel File

  • New Metadata fields, to describe the code submitted in terms of quality.
  • Ten Simple Rules guidelines are included: https://www.imagwiki.nibib.nih.gov/content/10-simple-rules-conformance-rubric
  • Code RRIDs and ontologies are required
  • Metadata format is based on the dataset_submission.xlsx format
  • This solution encourages creating validated, documented, and version-controlled code without restricting submitters

14 of 27

Code_Parameters Excel File

This will contain specific parameter information needed to run the code including:

    • Ontologies
    • Data types
    • Data units

15 of 27

Submission Excel File

  • Dataset needs to become public 1 year after milestone completion date
  • The only person able to publish the dataset is the owner of dataset
  • Only a request to publish will initiate curation

If the dataset is not a part of the milestone use N/A

16 of 27

Is subject 1 = subject 1?

17 of 27

Subjects and Samples Excel Files

Subjects:

Samples:

  • Removed description and example rows
  • The headings of columns are all normalized
  • Cells in Blue and Green are required Yellow is optional, however please provide as much information as you can. Feel free to add additional yellow columns as needed
  • The names in the subject and samples should match the folders in the primary and derivatives folders

18 of 27

Subjects Changes

Add member of for cases where we need to include a specimen in a population.

Add also in dataset for including the Pennsieve id(s) for other datasets that have data about the same specimen

Template 1.2.3

Template 2.0

19 of 27

Samples Changes

Add member of for cases where we need to include a specimen in a population.

Add laboratory internal id to provide a mapping for groups that have incompatible internal identifier conventions.

Add also in dataset for including the Pennsieve id(s) for other datasets that have data about the same specimen

Template 1.2.3

Template 2.0

20 of 27

sam-3444

sam-3445

sub-pig1

sam-3446

sub-pig2

sub-pig3

sam-3447

sam-3448

sam-3449

sam-3450

primary

Manifest file

Folder naming and requirements for subject and sample identifiers

  • Sample ids must be unique for the dataset
  • ID can include no special characters and empty space only 0-9, A-Z, a-z, and - (hyphen-dash) characters are allowed.
  • Have to have prefixes : sub- and sam-
  • Folder naming must reflect exact subject/sample ID in the name of the folder
  • Sample folders must be inside corresponding subject folder
  • Each datafile must be listed in the main manifest with the adequate description

21 of 27

SPARC Data Structure (SDS)

primary

Contains folders named to match the identifiers for subjects and/or samples depending on the study design

dataset_description.xlsx

Contains the study metadata to describe the dataset, including but not limited to, a short description of the study, contributors, associated journal articles and a protocol.io URL

samples.xlsx

Contains information about samples involved in the data collection.

subjects.xlsx

Contains information on every subject involved in the data collection.

submission.xlsx�performances.xlsx�resources.xlsx

Contains information to describe the submission and related milestone information.

Contains information about performances, experimental parameters for perf-folders

Contains information about

CHANGES

Contains information about the history of the dataset or changes since initial publication.

README (required)

Contains additional documentation about dataset.

code

Contains all the source codes used in the study such as text and source codes.

derivative

Contains derived data files, e.g., image stacks that are annotated via the MBF tools or smoothed overlays of current and voltage that demonstrate a particular effect.

docs

Contains all the supporting documents for the dataset, including but not limited to, a representative image.

protocol

Contains supporting documentation of the protocol submitted to protocols.io.

source

Contains data prior to any conversion (“truly” raw k-space data from MRI, raw images from microscopy dataset, etc.).

My SPARC Formatted�Experimental Dataset

Required

Required Structured Metadata Files

Optional

Optional Descriptive Files

(Must contain Manifest file, Readme file optional)

Top Level Folders Content

Manifest.xlsx

+

+

+

+

+

+

+

+

+

+

Data

sub-1

sam-1

Code

Processed Data

Files

Files

Raw�Data

Data

sam-2

Data

sub-2

sam-3

Top Level Only

Manifest.xlsx

Manifest.xlsx

Manifest.xlsx

Manifest.xlsx

+

Manifest.xlsx

22 of 27

Introduction to SODA

SODA (Software to Organize Data Automatically)

Goal: Simplify data curation and sharing for SPARC researchers

Open-source desktop computer software

Available for Windows, macOS, and Linux

Centralize all resources and information into single interface

Break curation process into logical and easy to perform steps

Include automation to reduce effort and errors

23 of 27

Overview of the software

Prepare dataset on Pennsieve

(banner image, license, subtitle, etc.)

Prepare metadata files

(submission, dataset_description, etc.)

Organize data files according to SDS

and upload them on Pennsieve

Share with the Curation Team

for review

1

2

3

4

SciCrunch

NCBI

Protocols.io

Pennsieve

Back-end: Connected to SPARC resources

Front-end: Intuitive user interface

24 of 27

Quick showcase of the actual interface

25 of 27

What’s next for SODA?

  • Transition to SDS 2.0
  • Integrate the validator developed by the Curation Team
  • Make the user interface even more intuitive
  • Etc.

We are reaching out to each SPARC group to learn how we can improve

26 of 27

Resources

Contact:

  • Curation office hours held by Anna on Zoom :
    • Tuesday 9-10 AM PST
    • Thursday 1-2 PM PST

Curation Team

Monique Surles Zeigler

SPARC Terminoly

Tom Gillespie

MIS, curation pipeline

Anita Bandrowski curation lead

Anna Pilko

SPARC curator

SODA Team

Bhavesh Patel

SODA lead

Tram Ngo

Developer

Sanjay Soundarajan

Developer

27 of 27

Learn about future webinars and tutorials at sparc.science/news-and-events

Thank You