1 of 37

1) WorldFAIR+ and CDIF: vision, progress, and next steps: https://bit.ly/wfpluscdif

Simon Hodson, Steve Richard

2 of 37

Making Data Work – WorldFAIR – WorldFAIR+

Making Data Work

(2018-2022)

WorldFAIR

(2022-2024)

WorldFAIR+

(2024+)

3 of 37

  • See WorldFAIR Final Message from Coordinator https://bit.ly/WorldFAIR-Final-Coordinator-Message and https://bit.ly/worldfair-plus
    • Summary of project, links to key outputs and next steps.
  • Reports, recommendations, guidelines, implementation examples and training materials from 11 Case Studies: https://bit.ly/WorldFAIR-Case-Study-Outputs
    • So many useful materials for all the subjects covered by WorldFAIR!
  • Experience with FAIR Implementation Profiles (FIPs) https://doi.org/10.5281/zenodo.11236094 and Recommendations for FAIR Assessment https://doi.org/10.5281/zenodo.11242737
    • Use and further develop FIPs; put the FAIR Implementation Profiles horse in front of the FAIR assessment cart!
  • Policy Recommendations: https://doi.org/10.5281/zenodo.11242702 :
    • We urgently need to shift from a bibliographic to an engineering approach to data stewardship. We need metadata upflift.
  • Cross-Domain Interoperability Framework: https://doi.org/10.5281/zenodo.11236871 (original report) and now CDIF Book https://cdif.codata.org
    • Considerable interest shown (over 4000 downloads)
    • A practical guide to FAIR implementation! Adopt widely used web standards and use them in line with good practice.

WorldFAIR Outputs

4 of 37

Enabling Global FAIR Data: https://doi.org/10.5281/zenodo.11242702

  • Addressing major, global challenges (DRR, climate adaptation) requires FAIR-enabled research and data infrastructures.
  • Data platforms and systems need to aggregate and integrate data, and provide data products for research, policy and action.
  • Requires a shift from a ‘bibliographic’ data stewardship practice to a data engineering practice.
  • Requires implementing the FAIR principles and significant metadata and semantic uplift.
  • Other benefits include:
    • Reproducible research and transparent data use
    • Increasingly automated use of data, including sensitive data
    • Responsible and effective use of AI
  • Key enabler are the FAIR principles and the Cross-Domain Interoperability Framework (CDIF).

CODATA-WorldFAIR Policy Recommendations: Enabling Global FAIR Data

5 of 37

What is CDIF?

  • The Cross Domain Interoperability Framework (CDIF) is a set of practical, implementation-level principles designed to improve data management practices within any community and lower the barriers to cross-domain data reuse. CDIF offers standards and methodologies for achieving different levels of interoperability necessary for reusing data across diverse domains. It is built around five core profiles that address the essential functions for implementing cross-domain FAIR principles.
  • CDIF was first released in May 2024 as an output of the WorldFAIR project: https://doi.org/10.5281/zenodo.11236871
  • The point of reference for CDIF and its component profiles is now the CDIF Book: https://cdif.codata.org
  • CDIF has attracted a lot of interest and has led directly to a set of additional projects and collaborations.

6 of 37

Discovery Profile

  • Discovery profile: https://bit.ly/cdif-discovery
  • Variable description in the discovery metadata
    • Name of the variable as it appears in the dataset.
      • Uses schema.org variableMeasured.
    • Text description.
    • propertyID with URI for the represented concept.

7 of 37

Description Profile: DDI CDI for Data Structure, Variable Cascade, Provenance…

  • Important to think about how we combine data for cross-domain research.
  • Data Documentation Initiative (DDI) Cross-Domain Integration (CDI) specification contains three modules to assist with this:
    • Structural Description: assists processing of data structure transformations across four data structures.
    • Data Description / Variable Cascade describes data at an atomic level, describes relationships between concepts, representations and instances (assists with combining data and documenting information loss).
    • Provenance and Processing: module uses PROV-O and SDTL to provide and relay provenance and processing information.
    • Now officially released: https://ddialliance.org/ddi-cdi

8 of 37

CDIF, Next Steps

9 of 37

CDIF as a ‘curated collection of “FAIR-enabling resources” ’

  • Discovery:
    • Use JSON-LD serialization
    • Use Schema.org / DCAT
    • Use robots.txt or FAIR Signposting
    • Include Schema.org variableMeasured
  • Access:
    • Use ODRL
      • Use ODRL, DPV, DUO (work in progress)
  • Vocabularies (semantic artefacts…):
    • Use SKOS
      • Further work on ontologies, infrastructures, mappings (SSSOM, AI)

  • Data / variable description
    • Use DDI-CDI for description of data structure and variable cascade.
      • Further work on variable description, aligning with O&M, I-ADOPT.
  • Universals
    • Time and place
    • Units of measurement
      • Further work on this guidance, updating developments in units of measurement.
  • Context, provenance and quality
    • Priority area of work, including at this year’s Dagstuhl workshop

10 of 37

WorldFAIR+

Vision:

  • Federation of case studies (existing and new), with parallel funding and supported by a coordinating mechanism with technical expertise. Focus on projects and collaborations to implement, test, refine and extend CDIF
  • Seven confirmed projects!
  • Exploring collaborations with existing (IUPAC, Embrapa, AuScope, OneGeochemistry…) and new (ARDC, DataObservatory…) partners.

Potential Case Studies and partnerships:

  • Do you have a potential (project, initiative) case study needing FAIRification, data engineering, metadata uplift? Would you be interested in CDIF implementation?
  • Keen to discuss potential case studies!

WorldFAIR+ how to get involved?

  • ISC has approved WorldFAIR+ as part of its portfolio of activities: https://bit.ly/ISC-WorldFAIR-PLUS ; vision and approach for WorldFAIR+: https://bit.ly/worldfair-plus
  • Will put in place lightweight MoU / LoA for case studies and partner projects.
  • Contact simon@codata.org

11 of 37

‘WorldFAIR+’, CDIF Implementation Projects

  1. “Data Science Without Borders”: Wellcome-funded project. Population health, building on WorldFAIR WP07. CDIF implementation. Combining population health / statistical data, clinical outcome data, phylogenetic data, environmental data. Also privacy management, ML/AI to enable federated analysis across four African health research centres (Kenya, Ethiopia, Senegal, Cameroon). Three years. Underway. Africa.
  2. “FAIR Data and Emergencies”: ISC-funded. Disaster Risk Reduction (DRR) research. Applying the WorldFAIR methodology, implementing CDIF components in case studies on earthquake data (Turkey) and flooding / cholera (Malawi). 18 months. Started 1 September 2024. Africa and Turkey.
  3. “CDIF-4-XAS”: OSCARS cascading grant (EC). Describing X-ray absorption spectroscopy (XAS) data with CDIF to enhance interoperability and enable interdisciplinary reuse. Two years. Started 1 October 2024. Europe (Germany and UK, but with global relevance and partners).
  4. “CLIMATE-ADAPT4EOSC”: Major EC-funded project, FAIR data and innovative services for climate adaptation. CDIF implementation for legal, organisational, semantic and technical (LOST) interoperability. Three case studies: urban heat (Greece); oceans / coastal management (Portugal); clay soils / hydrology / built environment / insurance (France). Four years. Started 1 Jan 2025. Europe.
  5. “JUST SAFE”: EC-funded, linked to CLIMATE-ADAPT. CDIF implementation, particular emphasis on climate adaptation and citizen science data, legal and policy recommendations. Four years. Started 1 May 2025. Europe.
  6. “TOGETHER”: EC-funded, linked to CLIMATE-ADAPT. CDIF implementation and data integration for disaster risk management. Case studies in Norway (mud-slides), Greece (wildfires), and Spain (flooding). Three years. Starts 1 Nov 2025. Europe.
  7. “FAIR Principles implementation for DDE”: Implementation of FAIR principles, alignment of IUGS CGI standards with CDIF, for cross-domain research topics and data reuse in geology. Three years. Started August 2025. Funding from DDE. Global.

12 of 37

2) Progress with funded projects (CDIF4XAS, Climate-Adapt4EOSC, FAIR4DDE)

Simon Hodson, Steve Richards

13 of 37

CDIF-4-XAS: Overview

  • Overview of standards, vocabularies (and ontologies), data formats and practices within the XAS area (landscape analysis): https://doi.org/10.5281/zenodo.14920226
    • Survey of XAS database schemas and software dependent XAS schemas.
    • Survey of XAS community standardisation effort: observe an alignment around NXxas for multi-spectra raw and processed data and XDI for single spectra data (reinforced by an IUCr recommendation).

Products

  • Mapping and description of NXxas and XDI data using CDIF.
    • using the CDIF Discovery profile for metadata;
    • characterise XDI (table) and NXxas (HDF5) data structures using DDI-CDI;
    • Example metadata instances;
    • Vocabulary for key concepts;
    • will be published next week… !

14 of 37

X-Ray Absorption Challenge

  • How to create metadata records describing data in dramatically different formats to enable frictionless discovery and integration of data
    • XDI- spectra in columns in a text file, with metadata in comments in a header
    • NEXUS – spectra as a set of arrays in a binary file with embedded metadata.
  • The information represented in these formats is closely aligned, but arranged differently

From Matthew Newville, 2023

15 of 37

Proposed solution:

  • Put data describing the experiment context in the metadata record
    • Information that is constant over the dataset
    • Context and configuration important for discovery and initial assessment
  • Metadata includes sufficient description of data structure to enable software to extract what is needed for analysis (CDI-DDI role)
  • To start, limit to files containing spectra from a single sample and experiment.

16 of 37

Spectra Data

  • Maps incident energy to any of several results: transmitted intensity, fluorescent intensity, absorption coefficient…
    • Can be represented in tabular form, or as a set of one dimensional arrays with the same dimension.
  • A raw data array might have multiple values for each incident energy level,
  • For XDI use DDI-CDI WideDataStructure description
  • For NEXUS use DDI-CDI DimensionalDataStructure description

17 of 37

Implementation

18 of 37

"schema:about": {"@id": "xas:485749"},� "schema:description": "metadata about documentation for se_na2so4",� "dcterms:conformsTo": [� {"@id": "cdif:profile_basic_1.0"},� {"@id": "cdif:profile_xasCDIF"}� ]

Self describing modularization

19 of 37

  • Cluster of EC-funded projects, including Climate-Adapt4EOSC, looking at various case studies, including urban heat, coastal management and shrink-swell of soils.
    • CDIF for semantic and technical interoperability.
    • Incorporation and mapping of key semantics.
    • RO-Crates for packaging and orchestration.
  • Exploring how to maximise and automate solutions for Legal and Organisational Interoperability.
    • Identify commonly encountered LOI obstacles.
    • Identify corresponding LOI enablers (agreements/contracts, licences, conditions) and test them with case studies.
    • Express them in code (building on DPV, ODRL, DUO standards).
    • Report on landscape and proposed solutions, Dec 2025.
    • Legal and Organisational IFs, late 2027.

CDIF in Climate-Adapt4EOSC

20 of 37

CDIF-4-XAS: Next Steps

  • Use cases: CDIF-4-XAS contends that increased standardisation of metadata and by following CDIF recommendations will increase the reuse potential of XAS data outside the original experiment.
  • Concrete use cases need to be identified to demonstrate that this is indeed the case.

21 of 37

  • “FAIR Principles implementation for DDE”: Implementation of FAIR principles, alignment of IUGS CGI standards with CDIF, for cross-domain research topics and data reuse in geology. Three years. From August 2025. Global. Funding from IUGS.
    • Enabling the alignment of IUGS CGI and other geology standards with CDIF.
    • Envisage a similar methodology to the OSCARS project with XAS data.
    • Map and implement CDIF discovery profile.
    • Implement CDIF data description profile.
    • Highlight importance of authoritative and FAIR concept schemes.

CDIF in DDE

22 of 37

  • Good correspondence
  • Some DDE elements are not in CDIF profile
  • Create DDE/CDIF profile with those elements

Mapping CDIF discovery -- DDE Metadata

23 of 37

3) Updates on recent funding proposals and implications

Simon Hodson

24 of 37

Updates on recent funding proposals

  • CDIF centred proposal: CDIF4EOSC (FAIR Integration)
    • Will create a CDIF playbook for EOSC including new profiles (including
  • Plays an important role in AI4SocialPlus (AI Readiness)
  • Contributory roles in two Climate-Adapt related proposals (HeatProof4All and JUST-COAST), a Data Stewardship proposal (StarData), a reproducibility proposal (RE3FAIR), and two EU-AU collaborations (CoClimate and BAHATI).

  • CDIF forms an important part of an ARDC proposal (A Connected, Ethical and AI-Ready Data Ecosystem for Australian Research Excellence).

  • Keen to explore partnerships in various countries, geographies for CDIF-related projects.
  • Encourage partners to prepare CDIF-related proposals for funding, to discuss possible collaborations on case studies, implementations, work to extend profiles.

25 of 37

CDIF4EOSC

  • Will hear about funding in Jan…
  • Will create a CDIF playbook for EOSC and beyond.
  • New profiles:
    • Machine-actionable navigation (FAIR signposting)
    • AI readiness (Croissant)
    • FAIR software repositories (CodeMeta)
    • Machine-actionable mediation of access (DPV, ODRL, DUO)
    • Interoperability of complex variables (I-ADOPT)
    • Ontological mappings (SSSOM)
    • Metadata for semantic artefacts (MOD)
    • AI-powered ontology alignment (OCET and more)
    • Packaging of FAIR digital objects (RO-Crate)
  • Suite of AI-assisted tools for each of these requirements.
  • Three use cases (ocean sciences; social sciences and climate; SSbD materials

26 of 37

4) Plans for upcoming Dagstuhl workshop

Simon Hodson

27 of 37

Dagstuhl Workshop: the Provenance Chain

  • Upcoming Dagstuhl workshop ‘The Provenance Chain: Connecting and Reusing Data, Models, and Experiments’ will look at:
    • Provenance, context and quality
    • Access (implementation of ODRL, DPV, DUO to manage responsibilities)
    • Metadata mapping and data integration
    • CDIF for XAS

28 of 37

5) Priority topics: CDIF, Croissant and AI; dealing with binary data formats; context, provenance and data quality.

Steve Richard, Slava Tykhonov, Simon Hodson

29 of 37

scan to access slides

and links

Croissant for Machine Learning

30 of 37

Responsible AI: CroissantML and DDI

(Data Documentation Initiative)

Responsible AI

“As AI advances at rapid speed there is increased recognition among researchers, practitioners and policy makers that we need to explore, understand, manage, and assess its economic, social, and environmental impacts. One of the main instruments to operationalise responsible AI (RAI) is dataset documentation.

This is how Croissant helps address RAI:

  1. It proposes a machine-readable way to capture and publish metadata about ML datasets – this makes existing documentation solutions like Data Cards easier to publish, share, discover, and reuse;
  2. It records at a granular level how a dataset was created, processed and enriched throughout its lifecycle – this process is meant to be automated as much as possible by integrating Croissant with popular ML frameworks. By allowing the metadata to be loaded automatically, Croissant also enables developers to compute RAI metrics automatically and systematically, identifying potential data quality issues to be fixed.

Croissant is designed to be modular and extensible. One such extension is the Croissant RAI vocabulary, which addresses 7 specific use cases, starting with the data life cycle, data labeling, and participatory scenarios to AI safety and fairness evaluation, traceability, regulatory compliance and inclusion. More details are available in the . We welcome additional extensions from the community to meet the needs of specific data modalities (e.g. audio or video) and domains (e.g. geospatial, life sciences, cultural heritage).”

Croissant spec v1.0

31 of 37

CDIF-driven DDI variable cascade integrated into CroissantML

Note: CroissantML defines an AI-ready metadata layer, CDIF is the graph representation of expert knowledge

32 of 37

Multilingual properties in Semantic Croissant: “energy”

Short Description

Energy is the capacity to do work or perform tasks. It is a fundamental concept in physics and is often measured in units such as joules or kilowatt‑hours. Energy can be transferred from one object to another, and can be transformed from one form to another. It is essential for powering machines, lighting homes, and powering transportation systems.

AI-generated concept description powered by CDIF and based on factual data (MCP)

33 of 37

  • Identify variables represented
  • Challenge is how to describe structure
    • typically bespoke tools are necessary to access data in binary formats
    • For FAIR data, these tools must be easily accessible, and free
    • With HDF5/NetCDF we can use the path structure
  • We need more example data for testing
  • What formats should be supported?

scan to access slides

and links

Binary Formats

34 of 37

Context, Provenance and Quality

  • What do we need to know about the context of data for reuse, particularly outside the original domain?
    • ‘Object of interest’ and variables
    • Methodology and technique > Minimal information standards
  • How can this information be relayed, particularly using PROV and extensions?
  • What quality criteria need to be communicated to the user for them to make an informed decision?
    • Accuracy/uncertainty, completeness, consistency, reliability, relevance, reproducibility / replicability.

35 of 37

6) Governance, release management and licensing

Discussion

36 of 37

Governance, release management and licensing

  • Reinitiate the CDIF WGs from later this month.
  • Need to develop an editorial group to coordinate / align WGs, releases etc.
  • Licensing: CC-BY? CC-BY-NC? Do we need a means of licensing for commercial use?
  • Mechanisms for overall governance?

37 of 37

7) Suggestions, recommendations, opportunities for collaboration, AOB?

Discussion