1 of 37

Ask Dataverse and

Large Language Models in the RAG framework

Slava Tykhonov

DANS-KNAW R&D

February 4th, 2025 - Harvard Dataverse Community call

DANS is an institute of KNAW and NWO�Dutch national centre of expertise and repository for research data

www.dans.knaw.nl

2 of 37

2

Data Stations - Future Data Services

Dataverse is our basic technology and a key framework for Open Innovation!

3 of 37

ODISSEI portal built on various Dataverse instances

3

4 of 37

Near future: search across all Data Stations with SPARQL

4

Qlever-based MuseIT triple store

5 of 37

DANS is producing Machine Learning/AI-ready data

5

https://mlcommons.github.io/croissant/docs/croissant-spec.html

Kaggle data

OpenML data platform

6 of 37

Create

Editor

https://huggingface.co/spaces/MLCommons/croissant-editor

Platform auto generate

Discover and find

Google Dataset Search

https://datasetsearch.research.google.com/

Crawler in repo

https://github.com/mlcommons/croissant/blob/main/health/visualizer/report_huggingface.ipynb

Use it

Colab [https://githubtocolab.com/mlcommons/croissant/blob/main/python/mlcroissant/recipes/tfds_croissant_builder.ipynb]

Original [https://github.com/mlcommons/croissant/blob/main/python/mlcroissant/recipes/tfds_croissant_builder.ipynb

scan to access slides

and links

7 of 37

Croissant ML export in Dataverse

7

Croissant

exporter

Code

Mappings

8 of 37

Retrieval Augmented Generation (RAG)

Terminology in terms of Dataverse:

Graph is metadata layer integrated with various ontologies and controlled vocabularies including responsible AI. Graph is expressed by Croissant ML standard.

Vector is the content of data files encoded in embeddings, with properties and their attributes. Embeddings computed by various ML algorithms and use different models.

LLM vision: let’s get both graph and vectors in the same model.

Vision of Distributed AI: LLM is “interface” between human and AI, and works as a “reasoning engine”.

Implementation: LLM connected to the “RAG library” (graph), navigate through datasets and consuming embeddings (vectors) as a context.

9 of 37

Annotations as “Human in the loop”

General blueprint for a human-in-the-loop interactive AI system. Credits: Stanford University HAI

“how do we build a smarter system?” to “how do we incorporate useful, meaningful human interaction into the system?”

10 of 37

MuseIT annotation service with metadata

Reference: MuseIT annotations

11 of 37

Dataverse Knowledge Graph

Go to https://graph.muse-it.eu - it’s 3D and interactive!

12 of 37

Use case: automated translation of vocabularies

12

LLM API response:

13 of 37

Use case: automated vocabularies enrichment

We use LLM to translate Dataverse keywords and link to SKOSMOS concepts in the specific “context”

13

Original vocabulary

Enriched vocabulary

14 of 37

Use case: AI powered taxonomy creator

Input: Wikipedia or any page

14

Output: ontology “from scratch”

LLM workflow to convert Web pages into SKOS vocabularies integrated with Dataverse

15 of 37

Use case: AI powered knowledge graph “miner”

Input: any sentence from text

15

Output: concepts with their relationships

LLM is building graph using relationships from known vocabularies

16 of 37

Example: Question conversion to directed graph

Graph transition to SOLR query:

type:"dataset" AND (keyword:"panel" OR title:"data" OR description:"data" OR keyword:"statistics" OR title:"research" OR keyword:"data" OR title:"statistics" OR keyword:"study" OR keyword:"survey" OR keyword:"research" OR description:"research" OR description:"study" OR keyword:"datasets" OR title:"study" OR title:"survey" OR description:"statistics" OR title:"datasets" OR description:"survey" OR description:"panel" OR description:"datasets" OR title:"panel") AND (title:"Bavaria" OR keyword:"Bavaria" OR description:"Bavaria") AND (description:"2013" OR keyword:"2013" OR title:"2013")

Question: Can you find a 2013 study on environmental policies in Bavaria?

17 of 37

DataChat - Croissant powered local AI to “chat” with Dataverse

18 of 37

More questions about metadata in DataChat

No hallucinations: information is not present in metadata

Want to try? MuseIT AI https://ai.muse-it.eu

Github: https://github.com/gdcc/datachat

19 of 37

Welcome Ghostwriter - KG+LLM+semantics+crawler!

20 of 37

Ghostwriter - ask data about something

Test Ask Dataverse on guides, manuals, github issues, google threads: https://ask.dataverse.org

21 of 37

Ghostwriter - Talking with the dead

22 of 37

Ghostwriter - greatest researcher of its time

23 of 37

Twofold faces of AI - “AI Agents”

LLM models are centralized and static, fully understood by machines
Vector space doesn’t have too much meaning for human
Humans need special tools to visualize vector spaces in projections
“Time dimension” is missing in LLM world and leads to models “snapshots”

Graph is dynamic and can be understood and curated by human
Graph can be distributed in Semantic Web
Graph can add semantics to the vector space
AI (such as GNN) to restore or predict new relationships

Low-Rank Adaptation of Large Language Models (LoRa)

Graph Neural Networks (GNN)

Structured data

Multidimensional Space

modelling new knowledge

24 of 37

KG+LLM integrator: automated concept prediction/1

Goal: find wikidata concept for song Hotel California

Query: Hotel California Context: Song by Eagles

{

"title": "Q780394",

"label": "Hotel California",

"description": "song written and composed by Don Felder, Don Henley and Glenn Frey; originally recorded by Eagles and released 1976",

"url": "https://www.wikidata.org/wiki/Q780394 "

}

25 of 37

KG+LLM integrator: automated concept prediction/2

Goal: find wikidata concept for movie Hotel California about criminal

Query: Hotel California Context: criminal movie

{

"title": "Q4509386",

"label": "Hotel California",

"description": "2008 film by Geo Santini",

"url": "https://www.wikidata.org/wiki/Q4509386 "

}

26 of 37

KG+LLM integrator: automated concept prediction/3

Goal: find wikidata concept for 2013 movie Hotel California

Query: Hotel California Context: movie from 2013

{

"title": "Q5911306",

"label": "Hotel California",

"description": "2013 film by Aji John",

"url": "https://www.wikidata.org/wiki/Q5911306 "

}

27 of 37

Agents vs Models

Credits: "Julia Wiesinger, Patrick Marlow and Vladimir Vuskovic, "Agents."

28 of 37

HuggingFace Smolagents - building AI agents

Source: https://github.com/huggingface/smolagents

from smolagents.agents import ToolCallingAgent

from smolagents import tool, LiteLLMModel

from typing import Optional

model = LiteLLMModel(

model_id="ollama/llama3",

api_base="http://ollama:11435",

api_key="your-api-key" # replace with API key if necessary

)

@tool

def get_weather(location: str, celsius: Optional[bool] = False) -> str:

"""

Get weather in the next days at given location.

Secretly this tool does not care about the location, it hates the weather everywhere.

Args:

location: the location

celsius: the temperature

"""

return "city,temperature\nAmsterdam,3\nParis,-1\nLondon,5\n"

agent = ToolCallingAgent(tools=[get_weather], model=model)

print(agent.run("What's the weather like in Amsterdam?"))

29 of 37

Microsoft Markitdown tool for all doc types

Reference: https://github.com/microsoft/markitdown

30 of 37

Collecting feedback from human

31 of 37

Example: Evaluating transcript agent - improving YouTube transcript

“all right zalinski just brought the fire all right zalinski just brought the fire all right zalinski just brought the fire to one of the largest podcasters in to one of the largest podcasters in to one of the largest podcasters in America who is pretty pro- Russia America who is pretty pro- Russia America who is pretty pro- Russia podcaster Lex fredman secured an podcaster Lex fredman secured an podcaster Lex fredman secured an interview with zilinski in Ukraine and interview with zilinski in Ukraine and interview with zilinski in Ukraine and during this interview zilinski has to during this interview zilinski has to during this interview zilinski has to push back a lot on the idea that both push back a lot on the idea that both push back a lot on the idea that both sides are equally at fault on the idea sides are equally at fault on the idea sides are equally at fault on the idea that they should forgive Vladimir Putin that they should forgive Vladimir Putin that they should forgive Vladimir Putin or that Vladimir Putin somehow is the or that Vladimir Putin somehow is the or that Vladimir Putin somehow is the the victim in this scenario zilinsky the victim in this scenario zilinsky the victim in this scenario zilinsky does a great job articulating his does a great job articulating his does a great job articulating his beliefs and I want to run you down some beliefs and I want to run you down some beliefs and I want to run you down some of the best moments of this interview of the best moments of this interview of the best moments of this interview make sure you drop a like And subscribe make sure you drop a like And subscribe make sure you drop a like And subscribe to boost this Pro democracy message and to boost this Pro democracy message and to boost this Pro democracy message and before we jump into this first clip this before we jump into this first clip this before we jump into this first clip this video is weirdly dubbed so zilinsky is video is weirdly dubbed so zilinsky is video is weirdly dubbed so zilinsky is speaking in Ukrainian during this speaking in Ukrainian during this speaking in Ukrainian during this interview meanwhile Lex freedan is interview meanwhile Lex freedan is interview meanwhile Lex freedan is speaking in Russian and the reason why speaking in Russian and the reason why speaking in Russian and the reason why zalinski doesn't want to speak Russian zalinski doesn't want to speak Russian zalinski doesn't want to speak Russian you know I'll let him explain it but you know I'll let him explain it but you know I'll let him explain it but just know there's a weird voiceover that just know there's a weird voiceover that just know there's a weird voiceover that they did in post during the editing they did in post during the editing they did in post during the editing that's not something I did this is just that's not something I did this is just…”

Typical problems here :

- one line, no pronunciation signs

- syntax misspellings for the same person(s), mistakes in words

Watch video on YouTube

32 of 37

Evaluating transcript agent - graph with named entities

{

"context": "names",

"inputName": "zelenskyy",

"correctedName": "Zelenskyy"

},

{

"context": "names",

"inputName": "lex fridman",

"correctedName": "Lex Fridman"

},

{

"context": "names",

"inputName": "putin",

"correctedName": "Putin"

}

Part of the prompt:

You are provided with a dataset of known name corrections in JSON-LD format. This dataset contains the following fields: 'context', 'inputName', and 'correctedName'.

Your task:

1. Correct all names in the provided text by referencing the dataset. Specifically:

- Search the '@graph' in the dataset for entries with 'inputName' and their corresponding 'correctedName'.

- Replace names in the text that match or are slightly different (but within the same context) with the corresponding 'correctedName', in all cases.

2. If a name in the text appears incorrect but is not listed in the dataset, make a reasonable correction based on the context and an educated guess.

33 of 37

Evaluating transcript agent - context window

{

"@type": "contextualInfo",

"context": [

"Russia",

"Ukraine"

],

"inputName": "Zelenskyy",

},

{

"@type": "contextualInfo",

"context": [

"Russia",

"Ukraine"

],

"inputName": "Lex Fridman",

}

Model output:

First, I'll replace all names with their corrected forms from the name correction dataset:

1. "Zalinski" -> "Zelenskyy"

2. "Lex Fredman" -> "Lex Fridman"

Now, let's apply the corrections and added context based on the contextual dataset:

* Since both Zelenskyy and Putin are mentioned in the text, I'll add the context that they are both from Russia/Ukraine.

* For Lex Fridman, I'll keep his role as a podcaster and maintain the context of America.

34 of 37

Evaluating transcript agent - Llama3.2 output with datasets on context and spelling names

Here is the corrected text:

{{

All right, Zelenskyy just brought the fire to one of the largest podcasters in America who is pretty pro-Russia podcaster Lex Fridman secured an interview with Zelenskyy in Ukraine, and during this interview, Zelenskyy has to push back a lot on the idea that both sides are equally at fault on the idea that they should forgive Vladimir Putin or that Vladimir Putin somehow is the victim in this scenario. Zelenskyy does a great job articulating his beliefs, and I want to run you down some of the best moments of this interview. Make sure you drop a like and subscribe to boost this Pro-democracy message, and before we jump into this first clip, this video is weirdly dubbed, so Zelenskyy is speaking in Ukrainian during this interview, meanwhile Lex Fridman is speaking in Russian, and the reason why Zelinski doesn't want to speak Russian, you know I'll let him explain it, but just know there's a weird voiceover that they did in post during the editing, that's not something I did; this is just}}.

Note: I've corrected "Zalinski" to "Zelenskyy", "Fredman" to "Fridman", and "Vladimir Putin" remained as it was. I've also re-formatted some sentences for better readability and added commas, periods, and quotation marks where needed. The text's intent and meaning remain intact.