Ask Dataverse and
Large Language Models in the RAG framework
Slava Tykhonov
DANS-KNAW R&D
February 4th, 2025 - Harvard Dataverse Community call
DANS is an institute of KNAW and NWO�Dutch national centre of expertise and repository for research data
www.dans.knaw.nl
2
Data Stations - Future Data Services
Dataverse is our basic technology and a key framework for Open Innovation!
ODISSEI portal built on various Dataverse instances
3
Near future: search across all Data Stations with SPARQL
4
Qlever-based MuseIT triple store
DANS is producing Machine Learning/AI-ready data
5
Kaggle data
OpenML data platform
scan to access slides
and links
Croissant ML export in Dataverse
7
Croissant
exporter
Code
Mappings
Retrieval Augmented Generation (RAG)
Terminology in terms of Dataverse:
Graph is metadata layer integrated with various ontologies and controlled vocabularies including responsible AI. Graph is expressed by Croissant ML standard.
Vector is the content of data files encoded in embeddings, with properties and their attributes. Embeddings computed by various ML algorithms and use different models.
LLM vision: let’s get both graph and vectors in the same model.
Vision of Distributed AI: LLM is “interface” between human and AI, and works as a “reasoning engine”.
Implementation: LLM connected to the “RAG library” (graph), navigate through datasets and consuming embeddings (vectors) as a context.
Annotations as “Human in the loop”
General blueprint for a human-in-the-loop interactive AI system. Credits: Stanford University HAI
“how do we build a smarter system?” to “how do we incorporate useful, meaningful human interaction into the system?”
MuseIT annotation service with metadata
Reference: MuseIT annotations
Dataverse Knowledge Graph
Go to https://graph.muse-it.eu - it’s 3D and interactive!
Use case: automated translation of vocabularies
12
LLM API response:
Use case: automated vocabularies enrichment
We use LLM to translate Dataverse keywords and link to SKOSMOS concepts in the specific “context”
13
Original vocabulary
Enriched vocabulary
Use case: AI powered taxonomy creator
Input: Wikipedia or any page
14
Output: ontology “from scratch”
LLM workflow to convert Web pages into SKOS vocabularies integrated with Dataverse
Use case: AI powered knowledge graph “miner”
Input: any sentence from text
15
Output: concepts with their relationships
LLM is building graph using relationships from known vocabularies
Example: Question conversion to directed graph
Graph transition to SOLR query:
type:"dataset" AND (keyword:"panel" OR title:"data" OR description:"data" OR keyword:"statistics" OR title:"research" OR keyword:"data" OR title:"statistics" OR keyword:"study" OR keyword:"survey" OR keyword:"research" OR description:"research" OR description:"study" OR keyword:"datasets" OR title:"study" OR title:"survey" OR description:"statistics" OR title:"datasets" OR description:"survey" OR description:"panel" OR description:"datasets" OR title:"panel") AND (title:"Bavaria" OR keyword:"Bavaria" OR description:"Bavaria") AND (description:"2013" OR keyword:"2013" OR title:"2013")
Question: Can you find a 2013 study on environmental policies in Bavaria?
DataChat - Croissant powered local AI to “chat” with Dataverse
More questions about metadata in DataChat
No hallucinations: information is not present in metadata
Want to try? MuseIT AI https://ai.muse-it.eu
Github: https://github.com/gdcc/datachat
Welcome Ghostwriter - KG+LLM+semantics+crawler!
Ghostwriter - ask data about something
Test Ask Dataverse on guides, manuals, github issues, google threads: https://ask.dataverse.org
Ghostwriter - Talking with the dead
Ghostwriter - greatest researcher of its time
Twofold faces of AI - “AI Agents”
Low-Rank Adaptation of Large Language Models (LoRa)
Graph Neural Networks (GNN)
Structured data
Multidimensional Space
modelling new knowledge
KG+LLM integrator: automated concept prediction/1
Goal: find wikidata concept for song Hotel California
Query: Hotel California Context: Song by Eagles
{
"title": "Q780394",
"label": "Hotel California",
"description": "song written and composed by Don Felder, Don Henley and Glenn Frey; originally recorded by Eagles and released 1976",
"url": "https://www.wikidata.org/wiki/Q780394 "
}
KG+LLM integrator: automated concept prediction/2
Goal: find wikidata concept for movie Hotel California about criminal
Query: Hotel California Context: criminal movie
{
"title": "Q4509386",
"label": "Hotel California",
"description": "2008 film by Geo Santini",
"url": "https://www.wikidata.org/wiki/Q4509386 "
}
KG+LLM integrator: automated concept prediction/3
Goal: find wikidata concept for 2013 movie Hotel California
Query: Hotel California Context: movie from 2013
{
"title": "Q5911306",
"label": "Hotel California",
"description": "2013 film by Aji John",
"url": "https://www.wikidata.org/wiki/Q5911306 "
}
Agents vs Models
Credits: "Julia Wiesinger, Patrick Marlow and Vladimir Vuskovic, "Agents."
HuggingFace Smolagents - building AI agents
Source: https://github.com/huggingface/smolagents
from smolagents.agents import ToolCallingAgent
from smolagents import tool, LiteLLMModel
from typing import Optional
model = LiteLLMModel(
model_id="ollama/llama3",
api_base="http://ollama:11435",
api_key="your-api-key" # replace with API key if necessary
)
@tool
def get_weather(location: str, celsius: Optional[bool] = False) -> str:
"""
Get weather in the next days at given location.
Secretly this tool does not care about the location, it hates the weather everywhere.
Args:
location: the location
celsius: the temperature
"""
return "city,temperature\nAmsterdam,3\nParis,-1\nLondon,5\n"
agent = ToolCallingAgent(tools=[get_weather], model=model)
print(agent.run("What's the weather like in Amsterdam?"))
Microsoft Markitdown tool for all doc types
Reference: https://github.com/microsoft/markitdown
Collecting feedback from human
Example: Evaluating transcript agent - improving YouTube transcript
“all right zalinski just brought the fire all right zalinski just brought the fire all right zalinski just brought the fire to one of the largest podcasters in to one of the largest podcasters in to one of the largest podcasters in America who is pretty pro- Russia America who is pretty pro- Russia America who is pretty pro- Russia podcaster Lex fredman secured an podcaster Lex fredman secured an podcaster Lex fredman secured an interview with zilinski in Ukraine and interview with zilinski in Ukraine and interview with zilinski in Ukraine and during this interview zilinski has to during this interview zilinski has to during this interview zilinski has to push back a lot on the idea that both push back a lot on the idea that both push back a lot on the idea that both sides are equally at fault on the idea sides are equally at fault on the idea sides are equally at fault on the idea that they should forgive Vladimir Putin that they should forgive Vladimir Putin that they should forgive Vladimir Putin or that Vladimir Putin somehow is the or that Vladimir Putin somehow is the or that Vladimir Putin somehow is the the victim in this scenario zilinsky the victim in this scenario zilinsky the victim in this scenario zilinsky does a great job articulating his does a great job articulating his does a great job articulating his beliefs and I want to run you down some beliefs and I want to run you down some beliefs and I want to run you down some of the best moments of this interview of the best moments of this interview of the best moments of this interview make sure you drop a like And subscribe make sure you drop a like And subscribe make sure you drop a like And subscribe to boost this Pro democracy message and to boost this Pro democracy message and to boost this Pro democracy message and before we jump into this first clip this before we jump into this first clip this before we jump into this first clip this video is weirdly dubbed so zilinsky is video is weirdly dubbed so zilinsky is video is weirdly dubbed so zilinsky is speaking in Ukrainian during this speaking in Ukrainian during this speaking in Ukrainian during this interview meanwhile Lex freedan is interview meanwhile Lex freedan is interview meanwhile Lex freedan is speaking in Russian and the reason why speaking in Russian and the reason why speaking in Russian and the reason why zalinski doesn't want to speak Russian zalinski doesn't want to speak Russian zalinski doesn't want to speak Russian you know I'll let him explain it but you know I'll let him explain it but you know I'll let him explain it but just know there's a weird voiceover that just know there's a weird voiceover that just know there's a weird voiceover that they did in post during the editing they did in post during the editing they did in post during the editing that's not something I did this is just that's not something I did this is just…”
Typical problems here :
- one line, no pronunciation signs
- syntax misspellings for the same person(s), mistakes in words
Watch video on YouTube
Evaluating transcript agent - graph with named entities
{
"context": "names",
"inputName": "zelenskyy",
"correctedName": "Zelenskyy"
},
{
"context": "names",
"inputName": "lex fridman",
"correctedName": "Lex Fridman"
},
{
"context": "names",
"inputName": "putin",
"correctedName": "Putin"
}
Part of the prompt:
You are provided with a dataset of known name corrections in JSON-LD format. This dataset contains the following fields: 'context', 'inputName', and 'correctedName'.
Your task:
1. Correct all names in the provided text by referencing the dataset. Specifically:
- Search the '@graph' in the dataset for entries with 'inputName' and their corresponding 'correctedName'.
- Replace names in the text that match or are slightly different (but within the same context) with the corresponding 'correctedName', in all cases.
2. If a name in the text appears incorrect but is not listed in the dataset, make a reasonable correction based on the context and an educated guess.
Evaluating transcript agent - context window
{
"@type": "contextualInfo",
"context": [
"Russia",
"Ukraine"
],
"inputName": "Zelenskyy",
},
{
"@type": "contextualInfo",
"context": [
"Russia",
"Ukraine"
],
"inputName": "Lex Fridman",
}
Model output:
First, I'll replace all names with their corrected forms from the name correction dataset:
1. "Zalinski" -> "Zelenskyy"
2. "Lex Fredman" -> "Lex Fridman"
Now, let's apply the corrections and added context based on the contextual dataset:
* Since both Zelenskyy and Putin are mentioned in the text, I'll add the context that they are both from Russia/Ukraine.
* For Lex Fridman, I'll keep his role as a podcaster and maintain the context of America.
Evaluating transcript agent - Llama3.2 output with datasets on context and spelling names
Here is the corrected text:
{{
All right, Zelenskyy just brought the fire to one of the largest podcasters in America who is pretty pro-Russia podcaster Lex Fridman secured an interview with Zelenskyy in Ukraine, and during this interview, Zelenskyy has to push back a lot on the idea that both sides are equally at fault on the idea that they should forgive Vladimir Putin or that Vladimir Putin somehow is the victim in this scenario. Zelenskyy does a great job articulating his beliefs, and I want to run you down some of the best moments of this interview. Make sure you drop a like and subscribe to boost this Pro-democracy message, and before we jump into this first clip, this video is weirdly dubbed, so Zelenskyy is speaking in Ukrainian during this interview, meanwhile Lex Fridman is speaking in Russian, and the reason why Zelinski doesn't want to speak Russian, you know I'll let him explain it, but just know there's a weird voiceover that they did in post during the editing, that's not something I did; this is just}}.
Note: I've corrected "Zalinski" to "Zelenskyy", "Fredman" to "Fridman", and "Vladimir Putin" remained as it was. I've also re-formatted some sentences for better readability and added commas, periods, and quotation marks where needed. The text's intent and meaning remain intact.
Evaluating transcript agent - ChatGPT corrections
Watch and read more…
“Ask Dataverse and
Large Language Models in the RAG framework”
Recording of the talk on Dataverse TV (Harvard IQSS)
Try “Ask Dataverse” now on https://ask.dataverse.org
“Leveraging Artificial Intelligence to enable interoperability in EOSC”
https://zenodo.org/records/14729680
“FAIR vocabularies in DANS Data Stations”
https://zenodo.org/records/14671310
“The Next Generation of Data Management with Artificial Intelligence”
https://zenodo.org/records/14507120
Croissant: A Metadata Format for ML-Ready Datasets
https://arxiv.org/abs/2403.19546
More information
Visit our website www.dans.knaw.nl
And follow us online
LinkedIn @DANS
Mastodon @DANS_knaw_nwo
X @DANS_knaw_nwo
DANS is an institute of KNAW and NWO�Dutch national centre of expertise and repository for research data