Decentralized identifiers (DIDs) for sustainable AI in the Dataverse data network
Semantic Croissant regular meeting
Slava Tykhonov
Harvard Dataverse Ambassador
DANS-KNAW, the Netherlands
June 23th, 2025
DANS is an institute of KNAW and NWO�Dutch national centre of expertise and repository for research data
www.dans.knaw.nl
Croissant 1.0 for Machine Learning
2
Kaggle data
OpenML data platform
scan to access slides
and links
Croissant ML export in Dataverse
4
Croissant
exporter
Code
Mappings
Knowledge graph for AI: metadata across all Data Stations queried with SPARQL
5
Qlever-based MuseIT triple store
Dataverse AI Guide
Model Context Protocol (MCP) for Dataverse
Install it and try in your IDE! https://mcp.dataverse.org
MCP protocol for Dataverse powered by Croissant 1.0 ML
Common questions for Dataverse MCP
Common Questions
🇳🇱 Summary: Dutch Dataverses (June 2025)
These are the major Dutch Dataverse nodes covering a wide range of research areas.
MCP powered by Croissant ML 2.0
MCP primitives are “moving pieces” modular, dynamic, and designed to evolve in time:
MCP is about structuring the context in which machine learning models, especially LLMs, operate. This includes everything that informs, constrains, or modifies model behavior.
Open question: How to make workflows persistent, sustainable and FAIR?
Decentralized Resource Identity & Trust Framework (DRIFT)
DRIFT is a decentralized identity and trust layer that propagates traceable context — such as sessions, prompts, user metadata, and spans — across AI tools and services in event-driven environments like the Model Context Protocol (MCP).
In software engineering, data drift is a key concern that can affect system reliability and trust in data processing pipelines. DRIFT helps mitigate these risks by making identity and changes traceable.
Three common types of drift include:
By assigning decentralized identifiers (DIDs) and enforcing consistent, signed metadata, DRIFT supports systems in managing and auditing these types of drift.
DRIFT enables trusted interoperability between distributed AI components by providing unique, verifiable identities for all parts of an AI pipeline, such as sessions, prompts, users and tools.
“In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.”
Source: Wikiland
Why it’s important? DRIFT has another nature
DRIFT in fact is dedicated for Machine Learning on streaming data where knowledge graphs are continuously evolving.
We need to change current AI paradigm to give all rights back to content creators and publishers:
Decentralized FAIR data network for MCP
Source: Wikipedia
We’re considering experimental implementation of the decentralized identifiers for various content types extension to archive various types of content.
DIDs can be assigned to any artefacts (not only MCP primitives) including images, audio and video, for example, to store and link metadata records and provenance information together with their digitized content.
DID can be private (invisible and not resolvable for public) but available for access with cryptokey.
FAIR decentralized identifiers for AI
We envision the near future where the it will be possible to create a decentralized system which will not depend on any specific registry, one provider, one authority, etc., so all connections will be established in a peer-to-peer network, and but will be persistent at the same time. This solution should support AI workflows and infrastructure with FAIR principles of findability, accessibility, interoperability, and reusability (FAIR).
This can be archived with global decentralized identifier - DID.
The resolution of the decentralized identifier (DID) is cryptographically verifiable to prove the identity and the ownership of that identifier and can support Model Context Protocol with sustainable infrastructure to keep provenance and origins of prompts, resources and tools.
Core DID features are listed below:
DID should bring control of all provenance and metadata back to their owners instead of giving them away. In the same time public part will/could not be very different from other persistent identifiers like DOIs and even replace them for the specific use cases like sharing sensitive data.
The place of DID as unified resource
Source: “Self-Sovereign Identity”. by Alex Preukschat, Drummond Reed
DID can be considered as a “replacement” of domain names and DNS from the “centralized” network
The role of private and public key, and service endpoints in DID
Service endpoints can tell how exactly to interact with the subject, what kind of protocols, what kind of network endpoints are available to connect, for example, to an agent that represents the data subjects so that you can then exchange credentials or some other messages.
Universal Resolver for AI primitives
Try this! https://dev.uniresolver.io
Registrar output:
Decentralized identifier (DID): did:oyd:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G
curl https://dev.uniresolver.io/1.0/identifiers/did:oyd:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G
"service": [
{
"id": "did:oyd:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G#payload",
"type": "Custom",
"serviceEndpoint": "https://oydid.ownyourdata.eu",
"payload": {
"prompt": "hello MCP"
}
}
]
“Fixing” ownership issues in the online world
Proposal of “Internet fixing” with decentralized, cryptographically secure protocol for web identity, trust, and content provenance:
DID for Every Web Page
“Ownership” transfer can be done with did:web identifiers by transmitting all private keys for all website’s pages with their DIDs
Web content (HTML, RDFa, JSON-LD, etc.) is signed using the private key of its DID.�
Signed assertions can include copyright and licensing, chain of modifications, author/organization DID, time of creation, expiry, or deprecation. This enables cryptographic verification of who said what, when, and where.
Integration of AI (LLM) with traditional search
Graph transition to SOLR query:
type:"dataset" AND (keyword:"panel" OR title:"data" OR description:"data" OR keyword:"statistics" OR title:"research" OR keyword:"data" OR title:"statistics" OR keyword:"study" OR keyword:"survey" OR keyword:"research" OR description:"research" OR description:"study" OR keyword:"datasets" OR title:"study" OR title:"survey" OR description:"statistics" OR title:"datasets" OR description:"survey" OR description:"panel" OR description:"datasets" OR title:"panel") AND (title:"Bavaria" OR keyword:"Bavaria" OR description:"Bavaria") AND (description:"2013" OR keyword:"2013" OR title:"2013")
Question: Can you find a 2013 study on environmental policies in Bavaria?
More challenges for communication in MCP
Model Context Protocol (MCP) built on ASGI (Asynchronous Server Gateway Interface) frameworks and toolkits - event-driven communication.
Typical challenges:
Main question:
How to consistently propagate sessions, trace IDs, spans, prompts and user metadata across tools and services
Extending MCP with Semantic Croissant
Semantic Croissant (Croissant 2.0) is more a data infrastructure of a fundamentally new type than a new standard, as its knowledge graph is continuously evolving:
Semantic Croissant for controlled vocabularies
Prompts, resources and tools are evolving:
Communication channels in MCP: events
curl "http://localhost:8077/sse/messages?session_id=zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G”
event: message
data: {"prompt": "hello MCP"}
curl -X POST "http://localhost:8077/sse/messages?session_id=did:oydid:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G" -d '{"prompt": "hello MCP"}'
{"status":"sent","to":"did:oydid:zQmZNDWhhr3cHaL2pfLCpwXXbK5WVKY5wJHtC3RvrgQiG5G"}%
What if we use DID as session ID?
Semantic Croissant “in action”
Croissant relationships, SKOS and Wikidata are linked by RAG (LLM+KG)
Disambiguation in Semantic Croissant
Input:
Response:
Response:
Input:
Context and user’s input defines which concepts will be linked in the Semantic Croissant
Inspired
Related
Attributes in DID document
Machine-readable elements in Croissant 2.0
owl:OntologyLexicalConcept is being used to keep context as “lexical chains” ready to recompute relationships
DID in n8n AI “orchestrator”
N8n is a fair-code licensed workflow automation tool that combines AI capabilities with business process automation. It covers everything from setup to usage and development. It's a work in progress and all contributions are welcome.
Try Dataverse MCP now in any popular IDE
https://mcp.dataverse.org
More information
Visit our website www.dans.knaw.nl
And follow us online
LinkedIn @DANS
Mastodon @DANS_knaw_nwo
X @DANS_knaw_nwo
DANS is an institute of KNAW and NWO�Dutch national centre of expertise and repository for research data