1 of 127

Venkata Chandrasekhar Nainala (Chandu)

Friedrich Schiller University -JENA

Update: January 2025

2 of 127

https://nmrxiv.org

3 of 127

87 Projects ~ 914 Compounds ~ 4587 Spectra

40 Projects ~ 500 Compounds ~ 5500 Spectra

PUBLIC

PRIVATE / EMBARGO

4 of 127

nmrXiv

CURATION WORKFLOW

5 of 127

Curation workflow - Semi automatic

Human in the loop approach

RAW FILES

PROCESSED FILES ~ STANDARD FORMATS

LONG TERM ARCHIVAL

Version 3.0

Version 2.0

Version 1.0

WEB

API

AI/ML tools

CASE

SEARCH ENGINES

STEP 1: FILE UPLOAD

ARCHIVAL / PUBLISHING

STEP 2: ASSIGNMENTS & META-DATA

STEP 3: VALIDATION

6 of 127

Data ingestion & staging

STEP 1: FILE UPLOAD

7 of 127

Metadata Normalization

STEP 2: ASSIGNMENTS & META-DATA

Step 2. Metadata Normalization

After ingestion, extracted metadata are mapped to the nmrXiv schema covering samples, molecules, experiments, and spectra. Automated routines interpret instrument files to detect nucleus type (^1H, ^13C, ^15N, etc.), experiment dimension (1D/2D), field strength, temperature, and solvent. Structural data (SMILES, InChI, or .mol/.sdf files) are linked if available.

Folder and file names are also parsed at this stage, providing contextual hints (e.g., Sample1_HSQC_600MHz) to standardize experiment labeling. Supporting metadata files like NMReDATA are prioritized, since they carry structured assignments and acquisition conditions.

At this point, submitters can intervene to verify consistency between the submitted sample description, structure, and metadata. Ambiguous or missing fields (e.g., incomplete solvent information, mislabeled nuclei) are flagged for human resolution before moving forward.

8 of 127

Data Validation

Failure to meet any of the requirements - validation failures (further human attention is required)

STEP 3: VALIDATION

MIChI: Minimum Information about Chemical Investigation

Missing minimum info

File Integrity Checks

Checksums match

Missing files

Meta-Data Checks

Citation

Author

License

Step 3. Data Validation

Once metadata have been normalized, the system runs a series of automated quality control (QC) checks to ensure data integrity and compliance with nmrXiv standards.

In this stage, the system performs automated quality checks and applies community metadata standards to ensure that every dataset is both technically sound and FAIR-compliant.

File integrity checks: Confirm that raw instrument files are complete (e.g., Bruker fid, acqus, pdata/, JCAMP .dx) and parseable.
MIChI compliance: Verify that the minimum required metadata are present (sample ID, chemical structure, acquisition parameters such as nucleus, solvent, temperature, frequency). Missing or inconsistent fields are flagged for curator review.
Duplicate detection: Compare spectra and structures (InChI, SMILES) and spectra to identify redundant submissions.
Bioschemas preparation: Map metadata into schema.org-compatible fields (Dataset, MolecularEntity, PropertyValue) to ensure web interoperability.
DataCite readiness: Collect required citation metadata (title, authors/submitters, license, year) so a DOI can be minted when the dataset is published.

9 of 127

nmrXiv

COMPOUND

10 of 127

Caffeine

11 of 127

Caffeine - Spectra

12 of 127

Caffeine - Metadata

13 of 127

nmrXiv

U P D A T E S

14 of 127

15 of 127

nmrXiv

N E W D A T A S U B M I S S I O N

16 of 127

17 of 127

Onboarding Screen (Primer)

18 of 127

Upload (Drag & Drop or File Browser)

19 of 127

Parallel uploads ( Strict validations & error tracking)

20 of 127

Parallel uploads ( Strict validations & error tracking)

21 of 127

Missing & Corrupt file checks

22 of 127

Auto-Processing Spectra

23 of 127

Auto-Import metadata

(compound information - .mol,.sdf,.nmredata.sdf)

24 of 127

Auto-Import metadata

(compound information - .mol,.sdf,.nmredata.sdf)

25 of 127

Validation Report

26 of 127

Validation Report

27 of 127

Multi-Spectra Views

28 of 127

29 of 127

30 of 127

Project or Independent Sample submissions

31 of 127

Ontology driven - Organism details (Part as well)

32 of 127

Samples Overview

33 of 127

Embargo mode - Team Sharing

34 of 127

35 of 127

Optimised processing - notifications

36 of 127

37 of 127

nmrXiv

S E A R C H

38 of 127

MIChI Recommendations ( Draft Version)

https://docs.google.com/spreadsheets/d/1MxCceGO3UUAvIn-GWxxgeOUFnR34A3ileqIW3sYgZNU/edit#gid=0

39 of 127

Advanced Search

Project

Sample

Assay(s)

Molecule(s)

Protocol description
Measurement parameters

Solvent
Temperature
Field Strength
Instrument
Ionization / Collection method

Molecular Formula
Inchi
SMILES

Similarity
Substructure
Exact

Molecular Weight
Name / Synonyms
HOSE Code
Descriptors*
CAS

Spectrum Type
Spectrum category
Spectra Processing parameters
Raw Data types
Processed Data types

Sample study

Spectral Dataset(s)

Keywords
Description
Organism
Organism part
Submitter
Author
Citation
License Type
Quality Score
Timestamps

Keywords
Description
Organism
Organism part
Submitter
Author
Citation
ID
License Type
Timestamps

Collection

Keywords
Description
ID

Name
ID

40 of 127

Structure

Search

RDKit Based

~ Exact

~ Substructure

~ Similarity

41 of 127

Structure Search

Results

Browse all samples

reporting the compound

in one view

~ Compare spectra from

different samples

(UI updates pending)

42 of 127

nmrXiv

B IO S C H E M A M E T A D A T A S T R UC T U R E

43 of 127

BioSchemas Metadata Structure

Project

Repo

Study/

Sample

Dataset/

Spectrum

hasPart (1 => n)

isPartOf (1)

includedInDataCatalog (1)

Study

ISA

Dataset

ISA

DataCatalog

ISA

Organization

CreativeWork

Person

publisher

author

citation

ChemicalSubstance: sample

about

MolecularEntity: molecules

hasBioChemEntityPart

'NMR solvent': 'NMR:1000330'	CDCl3
'acquisition nucleus': 'NMR:1400083'	13C
Other than the measurementTechnique (url), everything is a PropertyValue. Dimension, probe, Temperature, frequency, field strength, number of scans, experiment…

variableMeasured

Properties: inChI, inChIKey, iupacName, molecularFormula, molecularWeight, smiles, mol (hasRepresentation), Percentage composition (description)

nmrXiv Object

Bioschemas Type

44 of 127

A R C H I T E C T U R E

O V E R V I E W

45 of 127

nmrXiv Core

integrations

Web Application

Application Database

------

NMR Database

Search

Importers /

Exporters

Cache

SPA

Front end

File format converters

AI/ML

Tools

API

Workflows

Plugins

OAuth ~ SSO / AAI

Schemas

46 of 127

File format converters

AI/ML

Tools

Workflows

Plugins

Schemas

47 of 127

File format conversions

https://nmrxiv.org

https://chemotion.net/

https://www.mdpi.com/1420-3049/28/3/1448

File format converters

48 of 127

Prediction Service

AI/ML

HOSE Codes

Lookup tables

Prediction / CASE

Assignments

CASE

Software /

Prediction

49 of 127

nmrXiv

Repository

Private

Inhouse data

Prediction

Service

Assignments

AI/ML

Prediction

Model

AI/ML

Prediction

Model*

Assignments

50 of 127

nmrXiv Core

integrations

Web Application

Application Database

------

NMR Database

Search

Importers /

Exporters

Cache

SPA

Front end

File format converters

AI/ML

Tools

API

Workflows

Plugins

OAuth ~ SSO / AAI

ELN

51 of 127

A collection of powerful microservices designed to simplify NMR data processing and analysis. nmrKit offers NMR Prediction, Validation, and Depiction via the nmrium library, seamless Format Conversion using the nmr-load-save package accessible through a unified API.

52 of 127

FAST API

CDK

RDKit

HOSE CODE

ALATIS NS

nmr-load-save

lwreg

NMR

Processing/Format conversions

NMR Prediction / Training

Spectral Assignments Validation

Search / Depiction

PostGreSQL

Redis

Minio

Graphana

Prometheus

NN-Models (Tensor Flow)

FAST API

CDK

RDKit

HOSE CODE

ALATIS NS

nmr-load-save

lwreg

NMR

Processing/Format conversions

NMR Prediction / Training

Spectral Assignments Validation

Search / Depiction

PostGreSQL

Redis

Minio

Graphana

Prometheus

NN-Models (Tensor Flow)

FAST API

CDK

RDKit

HOSE CODE

ALATIS NS

nmr-load-save

lwreg

NMR

Processing/Format conversions

NMR Prediction / Training

Spectral Assignments Validation

Search / Depiction

PostGreSQL

Redis

Minio

Graphana

Prometheus

Deep Learning Models (Tensor Flow/PyTorch)

53 of 127

Data - So far…

54 of 127

ELN Integration

Chemotion - nmrXiv

55 of 127

ELN (Chemotion) - nmrXiv

Workflow

Chemotion - Requests temporary Upload URL
Uploads - RO-Crate based (BagIt) ZIP file
nmrXiv jobs unzips and process metadata
Validation/Publishing API

56 of 127

NMR Platform

57 of 127

New Roles

Lab Operator

Manage Samples

Statistics

Facility Manager

Add or update lab

Operators

Manage instruments

Statistics

Announcements

Any other roles??

nmrXiv new roles to grant access to NMR platform views

58 of 127

Admin console access

Facility Manager & Lab Operator

Can access admin console to

Manage NMR platform from the

nmrXiv interface when they login

Users with the admin console access (NMR Platform) will have additional links in the dropdown on the top right corner.

59 of 127

Admin console options

Options to access NMR Platform in the User Admin Console

60 of 127

NMR Platform Management

NMR Platform Dashboard

Current Workload (Samples being processed)
Instruments Active
Settings etc.

61 of 127

Sample Management > Submission

nmrXiv University Private Page

Options to submit orders / Search etc.

62 of 127

Sample Management > Backend

Sample Details View

Options to select/upload spectra. Assignments and other meta-data

Samples overview

63 of 127

NMR Platform > Settings > Device Management

Manage devices (Add, Edit or Remove)

Quick links to other settings on the platform

This page evolves as we improve the platform integration

64 of 127

Metrics

Displays all metrics you would like to access - date range selection will be user controlled

65 of 127

NMRium - metadata storage (rewritten)
Downloads - Pre-generated URLs (faster and reliable)
AutoImport - API driven (lot more stable)
Extensive Validations (needs more)
File integrity Checks implemented

PostgreSQL (migrated to v15.3 - support until 2027)
RDkit cartridge enabled (Search)
Simplified UX/UI (less maintenance)
Bug fixes
Autobackups

Highlights - Recent release

66 of 127

NMR Prediction (NMRShiftDB)
Spectra Validation Service

Format conversions
NN training

Spectra - Quality control (scoring)
Auto-Peak picking / Report export etc.
Community curation

Next steps (nmrKit)

67 of 127

Thank you

Documentation

GitHub

API

68 of 127

Facilitating Comprehensive Metadata Capture and Validation in Data Repositories (nmrXiv) through Terminologies and Terminology Service Suite Widgets

DATA REPOSITORIES

Electronic Lab

Notebooks (ELNs)

NFDI4Chem

TERMINOLOGY

SERVICES

TS4NFDI Terminology Service Suite

69 of 127

nmrXiv

D A T A S T R U C T U R E

70 of 127

Data life cycle / Versioning

https://github.com/ScienceObjectsDB/Documentation

Project 1

Project 2

Project 3

Study

Dataset

Sample

Assay

Spectra

M

71 of 127

Data Structure

Sample Study

https://github.com/ScienceObjectsDB/Documentation

Project 1

Project 2

Project 3

Sample Study

Sample

Assay(s)

Spectral Dataset(s)

Molecule(s)

Sample Study

72 of 127

More Validation Checkpoints
Auto-imported metadata - validation
Group samples into project

Assignments auto processing
Stereochemistry checks (missing centers)
Md5 checksum (in archive files - downloadable)

Bug fixes / Refactoring

Next steps

73 of 127

Release

nmrkit.nmrxiv.org

O N G O I N G D E V E L O P M E N T

Prediction service will be going live soon*

Subsequent releases should have file format conversions and other API

74 of 127

S E R V E R S I D E

B A C K E N D

75 of 127

Web application framework

Database

Cache

Search

Backend

Testing

CI/CD

Deployment / Testing (Maintenance)

76 of 127

Web application framework - PHP (MIT)

Database

Cache

Search

Testing

Selenium ~ Chrome Driver

Deployment & DevOps

Backend

77 of 127

S P E C T R A V I E W E R

78 of 127

https://github.com/NFDI4Chem/nmrium-react-wrapper

Spectral Assignments Processing
Efficient Versioning of changes
Automatic Spectra Snapshots

Releases / Maintenance
Bug fixes

79 of 127

D O C U M E N T A T I O N

80 of 127

https://docs.nmrxiv.org

81 of 127

D A T A I M P O R T & E X P O R T

82 of 127

I/O: Meta Data Model - RO-Crate

83 of 127

Chemotion Integration - todo

Workflow

Chemotion - Requests temporary Upload URL
Uploads - RO-Crate based (BagIt) ZIP file
nmrXiv jobs unzips and process metadata
Validation/Publishing API

84 of 127

BioSchema / DataCite

Integrated into NFDI4Chem Search service

Fine tuned metadata mapping
Bug fixes / Exception handling
Publish Package

85 of 127

T H A N K Y O U

86 of 127

nmrXiv

D A T A S U B M I S S I O N

(mockup based on feedback)

87 of 127

Sample -1 d

Sample - 2

Sample - 3

Sample - 4

Project > Study

User Authentication

STEP 1: FILE UPLOAD

nmrXiv Data Submission

Sample 1

Sample 3

Sample 2

88 of 127

DATA UPLOADED IN PREVIOUS STEP

STEP 2: SIMILARITY SEARCH AND SPECTRA - ATOM ASSIGNMENTS

nmrXiv Data Submission

ADD MOLECULE(S)

Instrument Data

Sample 1

NMRShiftDB

SHERLOCK

NMRium AUTO Assignments

Mol 1 95%

Mol 2 03%

Mol 3 01%

Mol 4 -

Cancel

STEP 3: META DATA (Minimum information requirements / validations will be implemented at this stage )

nmrXiv Data Submission

Sample Preparation Protocol

Assay Protocol

Provenance

License Information*

Cancel

90 of 127

STEP 4: COMPLETE

nmrXiv Data Submission

CLOSE

SAMPLE DETAILS PROVIDED

ID0001	ID0002	ID0003	ID0004

Data Set IDs

Release Date

Citation

Private

Public

Visibility

Author, 1., & Author, 2.. (2022). FAIR, consensus-driven NMR data repository and computational platform. The ultimate goal is to accelerate broader coordination and data sharing among natural product (NP) researchers by enabling storage, management, sharing and analysis of NMR data.

Download Zip

MD5 hashmap

Embed </>

Share

91 of 127

T E S T S I T E

92 of 127

Currently in pre-beta development stage

Test site: https://dev.nmrxiv.org

Source code: https://github.com/NFDI4Chem/nmrxiv

93 of 127

nmrXiv

D A T A S U B M I S S I O N

94 of 127

Sample -1 d

Sample - 2

Sample - 3

Sample - 4

Project > Study

User Authentication

STEP 1: FILE UPLOAD

nmrXiv Data Submission

Sample 1

Sample 3

Sample 2

95 of 127

DATA UPLOADED IN PREVIOUS STEP

STEP 2: SIMILARITY SEARCH AND SPECTRA - ATOM ASSIGNMENTS

nmrXiv Data Submission

ADD MOLECULE(S)

Instrument Data

Sample 1

NMRShiftDB

SHERLOCK

NMRium AUTO Assignments

Mol 1 95%

Mol 2 03%

Mol 3 01%

Mol 4 -

Cancel

STEP 3: META DATA (Minimum information requirements / validations will be implemented at this stage )

nmrXiv Data Submission

Sample Preparation Protocol

Assay Protocol

Provenance

License Information*

Cancel

97 of 127

STEP 4: COMPLETE

nmrXiv Data Submission

CLOSE

SAMPLE DETAILS PROVIDED

ID0001	ID0002	ID0003	ID0004

Data Set IDs

Release Date

Citation

Private

Public

Visibility

Author, 1., & Author, 2.. (2022). FAIR, consensus-driven NMR data repository and computational platform. The ultimate goal is to accelerate broader coordination and data sharing among natural product (NP) researchers by enabling storage, management, sharing and analysis of NMR data.

Download Zip

MD5 hashmap

Embed </>

Share

98 of 127

Meta Data Model

https://isa-tools.org/

ISA Limitations

(Repository perspective)

Templates
Redundant data
Captures rich description of the experimental metadata but not of the repository

Need to extend beyond the ISA Models and give total flexibility to the end user to define their own templates while still being complaint with ISA Specifications.

99 of 127

Data/File Types

- CeNAPT data of 42 IMPS, some fully interpreted and with HiFSA profiles: https://dataverse.harvard.edu/dataverse/cenapt

- Data from all our publications with NMR data (39 papers, have not counted but should be 100-200 cpds) since 2015: https://dataverse.harvard.edu/dataverse/gfpuic

1H, HSQC, HMBC plus COSY, NOESY, 13C/APT

Any minimum requirements?

Any format conditions (zipped, un/processed, TopSpin/Xwinnmr, size)?

100 of 127

Data Conversions

Redis / RabbitMQ Queues (Jobs)

Python Job - Dispatcher REST

Web Service

Interacts with the Repository core for authentication/authorization, projects ~ storage details

nmrml2ISA

mzml2ISA

Bruker Converter

ML models

Analysis modules

NMR Workflow

MS Workflow

Raman Workflow

Argo Workflows - https://argoproj.github.io/argo-workflows

GKE: https://medium.com/sysmap-labs/how-to-install-and-configure-argo-workflows-on-gke-9dde654c145e

Python SDK: https://github.com/argoproj/argo-workflows/tree/master/sdks/python

Considerations before choosing Argo: https://medium.com/datamindedbe/what-to-consider-before-choosing-argo-workflow-54f6067307a8

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).

- REST API

- S3 support

- Opensource

- GUI

Interacts with python dispatcher via REST API

Every cluster job is run as workflow. Some workflows can have one single node (example - converters).

Argos yaml to CWL conversion and vice versa is under development.

101 of 127

Data Conversions

https://github.com/NFDI4Chem/formaTAPIRest

102 of 127

Core Trust Seal

https://zenodo.org/record/3638211#.YOauSxNKhCU; Remark by Ti: Repos with CTS are explicitly recommended in author guidelines of some journals e.g. OpenChemistry (DeGryter).

103 of 127

ISO 16290:2013 (Technology Readiness Levels)

https://www.iso.org/standard/56064.html

104 of 127

DOI

https://support.datacite.org/docs/api-create-dois

DOI Registration Agencies: https://www.doi.org/RA_Coverage.html

105 of 127

Backend

Ruby (Ruby on Rails)
Python (Django, Flask, Pylons)
PHP (Laravel)
Java (Spring)
Scala (Play)
Node.js (Express)

Frame-work

MySQL (relational)
PostgreSQL (relational)
MongoDB (non-relational, document)

Database

Apache
Nginx

Server

106 of 127

nmrXiv

T E C H N O L O G Y S T A C K

107 of 127

Backend

Ruby (Ruby on Rails)

Vs

Python (Django)

Vs

PHP (Laravel)

108 of 127

Backend

REST API
Scalability
Documentation
Security
Learning resources
Speed
Ecosystem
Team Experience

Factors considered

EASY	NOT STRAIGHTFORWARD	YES
VERY SCALABLE	VERY SCALABLE	SLOW
Very good	Good	Good
Good	Very good	Good
Very good	Good	Good
Concurrent requests	One request at a time	Yes with multi threading
Gaining popularity	Okay	Declining
YES	YES	NO

109 of 127

https://trends.builtwith.com/framework/Laravel

https://trends.builtwith.com/framework/Ruby-on-Rails

https://trends.builtwith.com/framework/Django-Language

Backend - Trends

110 of 127

Backend

https://laravel.com/

Source : https://laravel.com and wikipedia

�

111 of 127

Amazon S3 compatible server-side software storage stack, it can handle unstructured data such as photos, videos, log files, backups, and container images with currently the maximum supported object size of 5TB

Source : https://min.io and wikipedia

�

112 of 127

Cache

Search

Instant search: https://www.meilisearch.com/

Big Data: ELK Stack: Elasticsearch, Logstash, Kibana

Cache: https://redis.io/

113 of 127

Status Page - Example Dropbox

Easily communicate real-time status to our users

Source : https://www.atlassian.com/software/statuspage and wikipedia

�

114 of 127

Frontend

Angular
React
Vue
Ember
Svelte

Frame-work

WebPack

Bundler

Yarn / NPM

Package manager

115 of 127

Frontend

https://vuejs.org/

Source : https://vuejs.org and wikipedia

�

116 of 127

Mandate

Acquisition

Deposition

Processing

Distribution

Discovery

Archiving

Repurposing

Analysis

Color

Key

Journal,Institution, Funder

User

nmrXiv Platform

117 of 127

Authentication and Authorisation

118 of 127

Data life cycle / Versioning

https://github.com/ScienceObjectsDB/Documentation

Project 1

Project 2

Project 3

Study

Dataset

Sample

Assay

Spectra

M

119 of 127

Directory structure

https://de.wikipedia.org/wiki/BagIt

https://github.com/whikloj/BagItTools

BagIt is a set of hierarchical file layout conventions designed to support storage and transfer of digital content.

120 of 127

ISA formats

https://isa-tools.org/

https://www.ebi.ac.uk/metabolights/MTBLS1/assays

ISA Tab

ISA Json

121 of 127

DataCite

OpenAIRE

Bioschema

IUPAC FAIRSpec

I/O: Data Schemas

https://schema.datacite.org/meta/kernel-4.4

https://guidelines.openaire.eu/en/latest/data/index.html

https://bioschemas.org/

122 of 127

Data Formats

Support all major �instrument raw output�file formats and open �data formats.

123 of 127

Data Versioning

Versioning is natively built around data models at the repository level. In addition to that we support DOI Versioning: https://support.datacite.org/docs/versioning

The version number follows semantic versioning principles. Can have additional tags like "stable", "current" or "dev" that link to a specific version and can be updated and separately queried.

124 of 127

Ontologies

Giving (meta)data meaning with ontologies

Ontology driven input fields, textareas not only provide rich user experience but also capture rich metadata ensuring machine readability

125 of 127

nmrXiv - Ontology component

Smart compose - Ontologies / Controlled Vocabulary driven

126 of 127

Compatible with

(use React components in Vue app)

Source : https://vuejs.org and https://reactjs.org

�

127 of 127

Deployments

https://github.com/NFDI4Chem/repo-helm-charts

https://docs.nmrxiv.org/docs/developer-guides/deployment/helm