1 of 127

Venkata Chandrasekhar Nainala (Chandu)

Friedrich Schiller University -JENA

Update: January 2025

2 of 127

3 of 127

40 Projects ~ 500 Compounds ~ 5500 Spectra

PUBLIC

PRIVATE / EMBARGO

4 of 127

nmrXiv

CURATION WORKFLOW

5 of 127

Curation workflow - Semi automatic

Human in the loop approach

RAW FILES

PROCESSED FILES ~ STANDARD FORMATS

LONG TERM ARCHIVAL

Version 3.0

Version 2.0

Version 1.0

WEB

API

AI/ML tools

CASE

SEARCH ENGINES

STEP 1: FILE UPLOAD

ARCHIVAL / PUBLISHING

STEP 2: ASSIGNMENTS & META-DATA

STEP 3: VALIDATION

6 of 127

Data ingestion & staging

STEP 1: FILE UPLOAD

7 of 127

Metadata Normalization

STEP 2: ASSIGNMENTS & META-DATA

8 of 127

Data Validation

Failure to meet any of the requirements - validation failures (further human attention is required)

STEP 3: VALIDATION

MIChI: Minimum Information about Chemical Investigation

Missing minimum info

File Integrity Checks

Checksums match

Missing files

Meta-Data Checks

Citation

Author

License

9 of 127

nmrXiv

COMPOUND

10 of 127

Caffeine

11 of 127

Caffeine - Spectra

12 of 127

Caffeine - Metadata

13 of 127

nmrXiv

U P D A T E S

14 of 127

15 of 127

nmrXiv

N E W D A T A S U B M I S S I O N

16 of 127

17 of 127

Onboarding Screen (Primer)

18 of 127

Upload (Drag & Drop or File Browser)

19 of 127

Parallel uploads ( Strict validations & error tracking)

20 of 127

Parallel uploads ( Strict validations & error tracking)

21 of 127

Missing & Corrupt file checks

22 of 127

Auto-Processing Spectra

23 of 127

Auto-Import metadata

(compound information - .mol,.sdf,.nmredata.sdf)

24 of 127

Auto-Import metadata

(compound information - .mol,.sdf,.nmredata.sdf)

25 of 127

Validation Report

26 of 127

Validation Report

27 of 127

Multi-Spectra Views

28 of 127

29 of 127

30 of 127

Project or Independent Sample submissions

31 of 127

Ontology driven - Organism details (Part as well)

32 of 127

Samples Overview

33 of 127

Embargo mode - Team Sharing

34 of 127

35 of 127

Optimised processing - notifications

36 of 127

37 of 127

nmrXiv

S E A R C H

38 of 127

MIChI Recommendations ( Draft Version)

https://docs.google.com/spreadsheets/d/1MxCceGO3UUAvIn-GWxxgeOUFnR34A3ileqIW3sYgZNU/edit#gid=0

39 of 127

Advanced Search

Project

Sample

Assay(s)

Molecule(s)

  • Protocol description
  • Measurement parameters
    • Solvent
    • Temperature
    • Field Strength
    • Instrument
    • Ionization / Collection method
  • Molecular Formula
  • Inchi
  • SMILES
    • Similarity
    • Substructure
    • Exact
  • Molecular Weight
  • Name / Synonyms
  • HOSE Code
  • Descriptors*
  • CAS
  • Spectrum Type
  • Spectrum category
  • Spectra Processing parameters
  • Raw Data types
  • Processed Data types

Sample study

Spectral Dataset(s)

  • Keywords
  • Description
  • Organism
  • Organism part
  • Submitter
  • Author
  • Citation
  • License Type
  • Quality Score
  • Timestamps

  • Keywords
  • Description
  • Organism
  • Organism part
  • Submitter
  • Author
  • Citation
  • ID
  • License Type
  • Timestamps

Collection

  • Keywords
  • Description
  • ID
  • Name
  • ID

40 of 127

Structure

Search

RDKit Based

~ Exact

~ Substructure

~ Similarity

41 of 127

Structure Search

Results

Browse all samples

reporting the compound

in one view

~ Compare spectra from

different samples

(UI updates pending)

42 of 127

nmrXiv

B IO S C H E M A M E T A D A T A S T R UC T U R E

43 of 127

BioSchemas Metadata Structure

Project

Repo

Study/

Sample

Dataset/

Spectrum

hasPart (1 => n)

hasPart (1 => n)

isPartOf (1)

isPartOf (1)

includedInDataCatalog (1)

Study

Study

ISA

ISA

Dataset

ISA

DataCatalog

ISA

Organization

CreativeWork

Person

publisher

author

citation

ChemicalSubstance: sample

about

MolecularEntity: molecules

hasBioChemEntityPart

'NMR solvent': 'NMR:1000330'

CDCl3

'acquisition nucleus': 'NMR:1400083'

13C

Other than the measurementTechnique (url), everything is a PropertyValue.

Dimension, probe, Temperature, frequency, field strength, number of scans, experiment…

variableMeasured

Properties: inChI, inChIKey, iupacName, molecularFormula, molecularWeight, smiles, mol (hasRepresentation), Percentage composition (description)

nmrXiv Object

Bioschemas Type

Bioschemas Type

44 of 127

A R C H I T E C T U R E

O V E R V I E W

45 of 127

nmrXiv Core

integrations

Web Application

Application Database

------

NMR Database

Search

Importers /

Exporters

Cache

SPA

Front end

File format converters

AI/ML

Tools

API

Workflows

Plugins

OAuth ~ SSO / AAI

Schemas

46 of 127

File format converters

AI/ML

Tools

Workflows

Plugins

Schemas

47 of 127

File format conversions

File format converters

48 of 127

Prediction Service

AI/ML

HOSE Codes

Lookup tables

Prediction / CASE

Assignments

CASE

Software /

Prediction

49 of 127

nmrXiv

Repository

Private

Inhouse data

Prediction

Service

Assignments

AI/ML

Prediction

Model

AI/ML

Prediction

Model*

Assignments

50 of 127

nmrXiv Core

integrations

Web Application

Application Database

------

NMR Database

Search

Importers /

Exporters

Cache

SPA

Front end

File format converters

AI/ML

Tools

API

Workflows

Plugins

OAuth ~ SSO / AAI

ELN

51 of 127

A collection of powerful microservices designed to simplify NMR data processing and analysis. nmrKit offers NMR Prediction, Validation, and Depiction via the nmrium library, seamless Format Conversion using the nmr-load-save package accessible through a unified API.

52 of 127

FAST API

CDK

RDKit

HOSE CODE

ALATIS NS

nmr-load-save

lwreg

NMR

Processing/Format conversions

NMR Prediction / Training

Spectral Assignments Validation

Search / Depiction

PostGreSQL

Redis

Minio

Graphana

Prometheus

NN-Models (Tensor Flow)

FAST API

CDK

RDKit

HOSE CODE

ALATIS NS

nmr-load-save

lwreg

NMR

Processing/Format conversions

NMR Prediction / Training

Spectral Assignments Validation

Search / Depiction

PostGreSQL

Redis

Minio

Graphana

Prometheus

NN-Models (Tensor Flow)

FAST API

CDK

RDKit

HOSE CODE

ALATIS NS

nmr-load-save

lwreg

NMR

Processing/Format conversions

NMR Prediction / Training

Spectral Assignments Validation

Search / Depiction

PostGreSQL

Redis

Minio

Graphana

Prometheus

Deep Learning Models (Tensor Flow/PyTorch)

53 of 127

Data - So far…

54 of 127

ELN Integration

Chemotion - nmrXiv

55 of 127

ELN (Chemotion) - nmrXiv

Workflow

  • Chemotion - Requests temporary Upload URL
  • Uploads - RO-Crate based (BagIt) ZIP file
  • nmrXiv jobs unzips and process metadata
  • Validation/Publishing API

56 of 127

NMR Platform

57 of 127

New Roles

  • Lab Operator

Manage Samples

Statistics

  • Facility Manager

Add or update lab

Operators

Manage instruments

Statistics

Announcements

Any other roles??

  • nmrXiv new roles to grant access to NMR platform views

58 of 127

Admin console access

  • Facility Manager & Lab Operator

Can access admin console to

Manage NMR platform from the

nmrXiv interface when they login

Users with the admin console access (NMR Platform) will have additional links in the dropdown on the top right corner.

59 of 127

Admin console options

Options to access NMR Platform in the User Admin Console

60 of 127

NMR Platform Management

NMR Platform Dashboard

  • Current Workload (Samples being processed)
  • Instruments Active
  • Settings etc.

61 of 127

Sample Management > Submission

nmrXiv University Private Page

Options to submit orders / Search etc.

62 of 127

Sample Management > Backend

Sample Details View

Options to select/upload spectra. Assignments and other meta-data

Samples overview

63 of 127

NMR Platform > Settings > Device Management

Manage devices (Add, Edit or Remove)

Quick links to other settings on the platform

This page evolves as we improve the platform integration

64 of 127

Metrics

Displays all metrics you would like to access - date range selection will be user controlled

65 of 127

  • NMRium - metadata storage (rewritten)
  • Downloads - Pre-generated URLs (faster and reliable)
  • AutoImport - API driven (lot more stable)
  • Extensive Validations (needs more)
  • File integrity Checks implemented

  • PostgreSQL (migrated to v15.3 - support until 2027)
  • RDkit cartridge enabled (Search)
  • Simplified UX/UI (less maintenance)
  • Bug fixes
  • Autobackups

Highlights - Recent release

66 of 127

  • NMR Prediction (NMRShiftDB)
  • Spectra Validation Service

  • Format conversions
  • NN training

  • Spectra - Quality control (scoring)
  • Auto-Peak picking / Report export etc.
  • Community curation

Next steps (nmrKit)

67 of 127

Thank you

68 of 127

DATA REPOSITORIES

Electronic Lab

Notebooks (ELNs)

NFDI4Chem

TERMINOLOGY

SERVICES

TS4NFDI Terminology Service Suite

69 of 127

nmrXiv

D A T A S T R U C T U R E

70 of 127

Data life cycle / Versioning

https://github.com/ScienceObjectsDB/Documentation

Project 1

Project 2

Project 3

Study

Dataset

Dataset

Dataset

Dataset

Dataset

Dataset

Dataset

Sample

Assay

Spectra

M

71 of 127

Data Structure

Sample Study

Sample Study

https://github.com/ScienceObjectsDB/Documentation

Project 1

Project 2

Project 3

Sample Study

Sample

Assay(s)

Spectral Dataset(s)

Molecule(s)

Sample Study

Sample Study

Sample Study

Sample Study

Sample Study

Sample Study

Sample Study

Sample Study

Sample Study

72 of 127

  • More Validation Checkpoints
  • Auto-imported metadata - validation
  • Group samples into project

  • Assignments auto processing
  • Stereochemistry checks (missing centers)
  • Md5 checksum (in archive files - downloadable)

  • Bug fixes / Refactoring

Next steps

73 of 127

Release

nmrkit.nmrxiv.org

O N G O I N G D E V E L O P M E N T

Prediction service will be going live soon*

Subsequent releases should have file format conversions and other API

74 of 127

S E R V E R S I D E

B A C K E N D

75 of 127

Web application framework

Database

Cache

Search

Backend

Testing

CI/CD

Deployment / Testing (Maintenance)

76 of 127

Web application framework - PHP (MIT)

Database

Cache

Search

Testing

Selenium ~ Chrome Driver

Deployment & DevOps

Backend

77 of 127

S P E C T R A V I E W E R

78 of 127

  • Spectral Assignments Processing
  • Efficient Versioning of changes
  • Automatic Spectra Snapshots

  • Releases / Maintenance
  • Bug fixes

79 of 127

D O C U M E N T A T I O N

80 of 127

81 of 127

D A T A I M P O R T & E X P O R T

82 of 127

I/O: Meta Data Model - RO-Crate

83 of 127

Chemotion Integration - todo

Workflow

  • Chemotion - Requests temporary Upload URL
  • Uploads - RO-Crate based (BagIt) ZIP file
  • nmrXiv jobs unzips and process metadata
  • Validation/Publishing API

84 of 127

BioSchema / DataCite

  • Integrated into NFDI4Chem Search service

  • Fine tuned metadata mapping
  • Bug fixes / Exception handling
  • Publish Package

85 of 127

T H A N K Y O U

86 of 127

nmrXiv

D A T A S U B M I S S I O N

(mockup based on feedback)

87 of 127

Next

Drag and Drop

Sample -1 d

Sample - 2

Sample - 3

Sample - 4

Project > Study

User Authentication

STEP 1: FILE UPLOAD

nmrXiv Data Submission

Sample 1

Sample 3

Sample 2

88 of 127

DATA UPLOADED IN PREVIOUS STEP

STEP 2: SIMILARITY SEARCH AND SPECTRA - ATOM ASSIGNMENTS

nmrXiv Data Submission

ADD MOLECULE(S)

Instrument Data

Sample 1

PREV

NEXT

NMRShiftDB

SHERLOCK

NMRium AUTO Assignments

Mol 1 95%

Mol 2 03%

Mol 3 01%

Mol 4 -

Cancel

Next

89 of 127

STEP 3: META DATA (Minimum information requirements / validations will be implemented at this stage )

nmrXiv Data Submission

Next

SAMPLE DETAILS PROVIDED

Sample Preparation Protocol

Assay Protocol

Provenance

License Information*

Cancel

90 of 127

STEP 4: COMPLETE

nmrXiv Data Submission

CLOSE

SAMPLE DETAILS PROVIDED

ID0001

ID0002

ID0003

ID0004

Data Set IDs

Release Date

Citation

Private

Public

Visibility

Author, 1., & Author, 2.. (2022). FAIR, consensus-driven NMR data repository and computational platform. The ultimate goal is to accelerate broader coordination and data sharing among natural product (NP) researchers by enabling storage, management, sharing and analysis of NMR data.

Download Zip

MD5 hashmap

Embed </>

Share

91 of 127

T E S T S I T E

92 of 127

Currently in pre-beta development stage

93 of 127

nmrXiv

D A T A S U B M I S S I O N

94 of 127

Next

Drag and Drop

Sample -1 d

Sample - 2

Sample - 3

Sample - 4

Project > Study

User Authentication

STEP 1: FILE UPLOAD

nmrXiv Data Submission

Sample 1

Sample 3

Sample 2

95 of 127

DATA UPLOADED IN PREVIOUS STEP

STEP 2: SIMILARITY SEARCH AND SPECTRA - ATOM ASSIGNMENTS

nmrXiv Data Submission

ADD MOLECULE(S)

Instrument Data

Sample 1

PREV

NEXT

NMRShiftDB

SHERLOCK

NMRium AUTO Assignments

Mol 1 95%

Mol 2 03%

Mol 3 01%

Mol 4 -

Cancel

Next

96 of 127

STEP 3: META DATA (Minimum information requirements / validations will be implemented at this stage )

nmrXiv Data Submission

Next

SAMPLE DETAILS PROVIDED

Sample Preparation Protocol

Assay Protocol

Provenance

License Information*

Cancel

97 of 127

STEP 4: COMPLETE

nmrXiv Data Submission

CLOSE

SAMPLE DETAILS PROVIDED

ID0001

ID0002

ID0003

ID0004

Data Set IDs

Release Date

Citation

Private

Public

Visibility

Author, 1., & Author, 2.. (2022). FAIR, consensus-driven NMR data repository and computational platform. The ultimate goal is to accelerate broader coordination and data sharing among natural product (NP) researchers by enabling storage, management, sharing and analysis of NMR data.

Download Zip

MD5 hashmap

Embed </>

Share

98 of 127

Meta Data Model

https://isa-tools.org/

ISA Limitations

(Repository perspective)

  • Templates
  • Redundant data
  • Captures rich description of the experimental metadata but not of the repository

Need to extend beyond the ISA Models and give total flexibility to the end user to define their own templates while still being complaint with ISA Specifications.

99 of 127

Data/File Types

- CeNAPT data of 42 IMPS, some fully interpreted and with HiFSA profiles: https://dataverse.harvard.edu/dataverse/cenapt

- Data from all our publications with NMR data (39 papers, have not counted but should be 100-200 cpds) since 2015: https://dataverse.harvard.edu/dataverse/gfpuic

1H, HSQC, HMBC plus COSY, NOESY, 13C/APT

Any minimum requirements?

Any format conditions (zipped, un/processed, TopSpin/Xwinnmr, size)?

100 of 127

Data Conversions

Redis / RabbitMQ Queues (Jobs)

Python Job - Dispatcher REST

Web Service

Interacts with the Repository core for authentication/authorization, projects ~ storage details

nmrml2ISA

mzml2ISA

Bruker Converter

ML models

Analysis modules

NMR Workflow

MS Workflow

Raman Workflow

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).

- REST API

- S3 support

- Opensource

- GUI

Interacts with python dispatcher via REST API

Every cluster job is run as workflow. Some workflows can have one single node (example - converters).

Argos yaml to CWL conversion and vice versa is under development.

101 of 127

Data Conversions

https://github.com/NFDI4Chem/formaTAPIRest

102 of 127

Core Trust Seal

https://zenodo.org/record/3638211#.YOauSxNKhCU; Remark by Ti: Repos with CTS are explicitly recommended in author guidelines of some journals e.g. OpenChemistry (DeGryter).

103 of 127

ISO 16290:2013 (Technology Readiness Levels)

https://www.iso.org/standard/56064.html

104 of 127

DOI

https://support.datacite.org/docs/api-create-dois

DOI Registration Agencies: https://www.doi.org/RA_Coverage.html

105 of 127

Backend

  • Ruby (Ruby on Rails)
  • Python (Django, Flask, Pylons)
  • PHP (Laravel)
  • Java (Spring)
  • Scala (Play)
  • Node.js (Express)

Frame-work

  • MySQL (relational)
  • PostgreSQL (relational)
  • MongoDB (non-relational, document)

Database

  • Apache
  • Nginx

Server

106 of 127

nmrXiv

T E C H N O L O G Y S T A C K

107 of 127

Backend

Ruby (Ruby on Rails)

Vs

Python (Django)

Vs

PHP (Laravel)

108 of 127

Backend

  • REST API
  • Scalability
  • Documentation
  • Security
  • Learning resources
  • Speed
  • Ecosystem
  • Team Experience

Factors considered

EASY

NOT STRAIGHTFORWARD

YES

VERY SCALABLE

VERY SCALABLE

SLOW

Very good

Good

Good

Good

Very good

Good

Very good

Good

Good

Concurrent requests

One request at a time

Yes with multi threading

Gaining popularity

Okay

Declining

YES

YES

NO

109 of 127

Backend - Trends

110 of 127

Backend

https://laravel.com/

Source : https://laravel.com and wikipedia

111 of 127

Amazon S3 compatible server-side software storage stack, it can handle unstructured data such as photos, videos, log files, backups, and container images with currently the maximum supported object size of 5TB

Source : https://min.io and wikipedia

112 of 127

Cache

Search

113 of 127

Status Page - Example Dropbox

Easily communicate real-time status to our users

Source : https://www.atlassian.com/software/statuspage and wikipedia

114 of 127

Frontend

  • Angular
  • React
  • Vue
  • Ember
  • Svelte

Frame-work

  • WebPack

Bundler

  • Yarn / NPM

Package manager

115 of 127

Frontend

https://vuejs.org/

Source : https://vuejs.org and wikipedia

116 of 127

Mandate

Acquisition

Deposition

Processing

Distribution

Discovery

Archiving

Repurposing

Analysis

Color

Key

Journal,Institution, Funder

User

nmrXiv Platform

117 of 127

Authentication and Authorisation

118 of 127

Data life cycle / Versioning

https://github.com/ScienceObjectsDB/Documentation

Project 1

Project 2

Project 3

Study

Dataset

Dataset

Dataset

Dataset

Dataset

Dataset

Dataset

Sample

Assay

Spectra

M

119 of 127

Directory structure

BagIt is a set of hierarchical file layout conventions designed to support storage and transfer of digital content.

120 of 127

ISA formats

ISA Tab

ISA Json

121 of 127

DataCite

OpenAIRE

Bioschema

IUPAC FAIRSpec

I/O: Data Schemas

122 of 127

Data Formats

Support all major �instrument raw output�file formats and open �data formats.

123 of 127

Data Versioning

Versioning is natively built around data models at the repository level. In addition to that we support DOI Versioning: https://support.datacite.org/docs/versioning

The version number follows semantic versioning principles. Can have additional tags like "stable", "current" or "dev" that link to a specific version and can be updated and separately queried.

124 of 127

Ontologies

Giving (meta)data meaning with ontologies

Ontology driven input fields, textareas not only provide rich user experience but also capture rich metadata ensuring machine readability

125 of 127

nmrXiv - Ontology component

Smart compose - Ontologies / Controlled Vocabulary driven

126 of 127

Compatible with

(use React components in Vue app)

Source : https://vuejs.org and https://reactjs.org

127 of 127

Deployments