1 of 30

Decentralized identifiers (DIDs) for sustainable AI in the Dataverse data network

Semantic Croissant regular meeting

Slava Tykhonov

Harvard Dataverse Ambassador

DANS-KNAW, the Netherlands

June 23th, 2025

DANS is an institute of KNAW and NWO�Dutch national centre of expertise and repository for research data

www.dans.knaw.nl

2 of 30

Croissant 1.0 for Machine Learning

2

Kaggle data

OpenML data platform

3 of 30

scan to access slides

and links

4 of 30

Croissant ML export in Dataverse

4

Croissant

exporter

Code

Mappings

5 of 30

Knowledge graph for AI: metadata across all Data Stations queried with SPARQL

5

Qlever-based MuseIT triple store

6 of 30

Dataverse AI Guide

7 of 30

Model Context Protocol (MCP) for Dataverse

8 of 30

MCP protocol for Dataverse powered by Croissant 1.0 ML

9 of 30

Common questions for Dataverse MCP

Common Questions

  • Do onboarding and give me overview of all Dataverses
  • List Dataverses from the US
  • How many datasets in dataverse.nl?
  • How many datasets on {query} exist in the whole Dataverse network?
  • How many Dataverse installations were created over the last 10 years, by country?
  • How many datasets exist in France?
  • How many datasets on economics are in dataverse.nl?
  • Which countries have added the most new Dataverse nodes since 2015?
  • What kinds of files are in energy consumption datasets from dataverse.nl?
  • How many datasets were published in France in 2024?
  • Compare number of datasets between Johns Hopkins and Harvard Dataverse.
  • I'm studying gender inequality in education. What datasets could help?
  • give me overview of dataset doi:10.17026/dans-x8n-hfvr
    1. where this coin was found?
    2. what is the age of the coin?

🇳🇱 Summary: Dutch Dataverses (June 2025)

  • DataverseNL: 8,161 datasets
  • DANS Data Station Archaeology: 162,435 datasets
  • DANS SSH: 7,932 datasets
  • DANS Life Sciences: 997 datasets
  • DANS Physical & Technical Sciences: 845 datasets
  • IISH Dataverse: 358 datasets
  • ODISSEI Portal: 10,163 datasets

These are the major Dutch Dataverse nodes covering a wide range of research areas.

10 of 30

MCP powered by Croissant ML 2.0

MCP primitives are “moving pieces” modular, dynamic, and designed to evolve in time:

  • LLM prompts can be created both by human and machine
  • Resources (data points, metadata) can be dynamic such as streaming
  • MCP tools can be connected and disconnected (offline)

MCP is about structuring the context in which machine learning models, especially LLMs, operate. This includes everything that informs, constrains, or modifies model behavior.

Open question: How to make workflows persistent, sustainable and FAIR?

11 of 30

Decentralized Resource Identity & Trust Framework (DRIFT)

DRIFT is a decentralized identity and trust layer that propagates traceable context — such as sessions, prompts, user metadata, and spans — across AI tools and services in event-driven environments like the Model Context Protocol (MCP).

In software engineering, data drift is a key concern that can affect system reliability and trust in data processing pipelines. DRIFT helps mitigate these risks by making identity and changes traceable.

Three common types of drift include:

  • Infrastructure Drift – Changes in the software environment that can break or invalidate infrastructure configurations.
  • Structural Drift – Changes to the data schema that may invalidate databases or data contracts.
  • Semantic Drift – Changes in the meaning of data without altering its structure, often caused by multiple developers independently modifying system components.

By assigning decentralized identifiers (DIDs) and enforcing consistent, signed metadata, DRIFT supports systems in managing and auditing these types of drift.

DRIFT enables trusted interoperability between distributed AI components by providing unique, verifiable identities for all parts of an AI pipeline, such as sessions, prompts, users and tools.

“In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.”

Source: Wikiland

12 of 30

Why it’s important? DRIFT has another nature

DRIFT in fact is dedicated for Machine Learning on streaming data where knowledge graphs are continuously evolving.

We need to change current AI paradigm to give all rights back to content creators and publishers:

  • identifiers assigned to MCP primitives should guarantee reproducibility.
  • Reliable mechanism to filter out low quality (“poison”) materials
  • Users should get answers based on trustworthy and verifiable information
  • New generation of AI models should be fully decentralized

13 of 30

Decentralized FAIR data network for MCP

Source: Wikipedia

We’re considering experimental implementation of the decentralized identifiers for various content types extension to archive various types of content.

DIDs can be assigned to any artefacts (not only MCP primitives) including images, audio and video, for example, to store and link metadata records and provenance information together with their digitized content.

DID can be private (invisible and not resolvable for public) but available for access with cryptokey.

14 of 30

FAIR decentralized identifiers for AI

We envision the near future where the it will be possible to create a decentralized system which will not depend on any specific registry, one provider, one authority, etc., so all connections will be established in a peer-to-peer network, and but will be persistent at the same time. This solution should support AI workflows and infrastructure with FAIR principles of findability, accessibility, interoperability, and reusability (FAIR).

This can be archived with global decentralized identifier - DID.

The resolution of the decentralized identifier (DID) is cryptographically verifiable to prove the identity and the ownership of that identifier and can support Model Context Protocol with sustainable infrastructure to keep provenance and origins of prompts, resources and tools.

Core DID features are listed below:

  1. A permanent (persistent) identifier (never change)
  2. A resolvable identifier (you can look it up to discover metadata)
  3. A cryptographically-verifiable identifier (with private and public keys)
  4. A decentralized identifier (no centralized authority)

DID should bring control of all provenance and metadata back to their owners instead of giving them away. In the same time public part will/could not be very different from other persistent identifiers like DOIs and even replace them for the specific use cases like sharing sensitive data.

15 of 30

The place of DID as unified resource

Source: “Self-Sovereign Identity”. by Alex Preukschat, Drummond Reed

DID can be considered as a “replacement” of domain names and DNS from the “centralized” network

16 of 30

The role of private and public key, and service endpoints in DID

Service endpoints can tell how exactly to interact with the subject, what kind of protocols, what kind of network endpoints are available to connect, for example, to an agent that represents the data subjects so that you can then exchange credentials or some other messages.

17 of 30

Universal Resolver for AI primitives

Registrar output:

Decentralized identifier (DID): did:oyd:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G

curl https://dev.uniresolver.io/1.0/identifiers/did:oyd:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G

"service": [

{

"id": "did:oyd:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G#payload",

"type": "Custom",

"serviceEndpoint": "https://oydid.ownyourdata.eu",

"payload": {

"prompt": "hello MCP"

}

}

]

18 of 30

“Fixing” ownership issues in the online world

Proposal of “Internet fixing” with decentralized, cryptographically secure protocol for web identity, trust, and content provenance:

DID for Every Web Page

  • Every individual web page (not just domain) receives a unique DID (e.g., did:web:example.com/page.html).�
  • These DIDs act as digital passports for web resources, enabling:�
    • Verifiable identity of creators/publishers�
    • Trustable metadata (authorship, versioning, license, etc.), signed and timestamped assertions
    • Private keys are stored in “digital wallets” or registries hosted by large vendors like Microsoft Amazon, Cloudflare.
  • Public keys are published via a DID Document, retrievable through well-known URIs or a global ledger.

“Ownership” transfer can be done with did:web identifiers by transmitting all private keys for all website’s pages with their DIDs

Web content (HTML, RDFa, JSON-LD, etc.) is signed using the private key of its DID.�

Signed assertions can include copyright and licensing, chain of modifications, author/organization DID, time of creation, expiry, or deprecation. This enables cryptographic verification of who said what, when, and where.

19 of 30

Integration of AI (LLM) with traditional search

Graph transition to SOLR query:

type:"dataset" AND (keyword:"panel" OR title:"data" OR description:"data" OR keyword:"statistics" OR title:"research" OR keyword:"data" OR title:"statistics" OR keyword:"study" OR keyword:"survey" OR keyword:"research" OR description:"research" OR description:"study" OR keyword:"datasets" OR title:"study" OR title:"survey" OR description:"statistics" OR title:"datasets" OR description:"survey" OR description:"panel" OR description:"datasets" OR title:"panel") AND (title:"Bavaria" OR keyword:"Bavaria" OR description:"Bavaria") AND (description:"2013" OR keyword:"2013" OR title:"2013")

Question: Can you find a 2013 study on environmental policies in Bavaria?

20 of 30

More challenges for communication in MCP

Model Context Protocol (MCP) built on ASGI (Asynchronous Server Gateway Interface) frameworks and toolkits - event-driven communication.

Typical challenges:

  • Tracing data across concurrent requests (e.g., asyncio tasks or threads) without leaks or collisions
  • Middleware that handles startup/shutdown events vs. request-response events needs separate trace handling
  • Tracing full communication, especially for WebSockets or long-lived connections, is harder than for HTTP (receive/send events should be in sync with trace/session)
  • MCP tools can introduce latency (especially with LLM models) or I/O blocking in an async environment
  • Not all ASGI-compatible frameworks (e.g., FastAPI, Starlette, Django Channels) expose low-level hooks for complete traceability
  • Tracing context must flow across function calls, tasks, background jobs, and database queries

Main question:

How to consistently propagate sessions, trace IDs, spans, prompts and user metadata across tools and services

21 of 30

Extending MCP with Semantic Croissant

Semantic Croissant (Croissant 2.0) is more a data infrastructure of a fundamentally new type than a new standard, as its knowledge graph is continuously evolving:

  • New data field names are being regularly introduced for new schemes (schema.org, Croissant, DCAT, DCAT-AP, CodeMeta, etc.) - ongoing work of the scientific community
  • We can only catch and archive “snapshots” of concepts with their descriptions and relationships but they’ve moving (changing) in time - “concept drift
  • AI models are being trained on “snapshots” capturing knowledge at that point of time - not suitable for real-time processing
  • Existent controlled vocabularies mostly aren’t FAIR however created by humans - their concepts don’t have persistent identifiers and revision information, and provenance.
  • There is no reliable metrics to check AI-powered ontologies creation

22 of 30

Semantic Croissant for controlled vocabularies

Prompts, resources and tools are evolving:

  • Label is modified
  • Relationships are changing
  • Descriptions are added
  • More datasets linked to concepts can be considered as “feedback” (human in the loop)

23 of 30

Communication channels in MCP: events

curl "http://localhost:8077/sse/messages?session_id=zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G”

event: message

data: {"prompt": "hello MCP"}

curl -X POST "http://localhost:8077/sse/messages?session_id=did:oydid:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G" -d '{"prompt": "hello MCP"}'

{"status":"sent","to":"did:oydid:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G"}%

What if we use DID as session ID?

24 of 30

Semantic Croissant “in action”

Croissant relationships, SKOS and Wikidata are linked by RAG (LLM+KG)

25 of 30

Disambiguation in Semantic Croissant

Input:

Response:

Response:

Input:

Context and user’s input defines which concepts will be linked in the Semantic Croissant

Inspired

Related

26 of 30

Attributes in DID document

27 of 30

Machine-readable elements in Croissant 2.0

owl:OntologyLexicalConcept is being used to keep context as “lexical chains” ready to recompute relationships

28 of 30

DID in n8n AI “orchestrator”

N8n is a fair-code licensed workflow automation tool that combines AI capabilities with business process automation. It covers everything from setup to usage and development. It's a work in progress and all contributions are welcome.

29 of 30

Try Dataverse MCP now in any popular IDE

https://mcp.dataverse.org

30 of 30

More information

Visit our website www.dans.knaw.nl

And follow us online

LinkedIn @DANS

Mastodon @DANS_knaw_nwo

X @DANS_knaw_nwo

DANS is an institute of KNAW and NWO�Dutch national centre of expertise and repository for research data