1 of 45

LINKED OPEN DATA

DANIELE FUSI - TIZIANA MANCINELLI

VENICE CENTRE FOR DIGITAL�AND PUBLIC HUMANITIES

JULY 8, 2020

Strand #1 - Part 1

2 of 45

Introduction

From the Web of Hypertexts�to the Web of Data

3 of 45

Towards a Global Data Representation

CSS

JS

HTML

XML

XSLT

representation

transformation

presentation

DB

XML

...

software

web 3.0 (data)�global data representation

HTML, CSS and JS are the pillars of the web.

HTML is a presentational markup, designed to mark the structure of hypertexts.

Hypertexts target human readers, and "talk" to them about any content: web 1.0, a web of hypertexts for humans.

similar separation in wider context: content in a separate backend, whatever its storage techs.

semantic markup: content independent and separate from presentation, which is

generated by software.

web 2.0 (applications)

presentation generated�from content

web 1.0 (hypertexts)�global data presentation

  • content is mingled with presentation: there is only the hypertext talking about some content, not the content itself, available independently from any specific presentation.
  • presentation is static, always equal for all the users loading the same page.

web of applications, where only the topmost layer of the systems targets humans. Much smarter: everything responds to specific users requests in real time, with presentations tailored to their interactions and preferences, like using a desktop app. Yet, it's smarter by virtue of these apps and their data, rather than of infrastructure.

bring the separation of content and presentation found in web applications into the web infrastructure itself, by providing a global data representation, rather than presentation through hypertexts; i.e. directly publish data, with globally standard models. A web of data, targeting machines: a sort of world-wide, uniformly modeled database, where anyone can publish his own data at any time.

software systems interact with users to get data on behalf of them, and present it in real time according to user interaction in a GUI

4 of 45

Mining Data from its Text Presentation

  • search engine: like an index to a huge book (the web of hypertexts)
  • each hypertext essentially has:
    • text in a natural language, intended for human readers
    • links to other hypertexts
  • a search engine indexes normalized text, and takes into account a number of related factors, like incoming links

the human language is ambiguous

human languages are different

Homer

data in traditional pages is informal and lacks structure:�we just have a name ("Homer") in a text. The machine knows nothing about what’s a poet or a cartoon; it just relies on words, even if in a very smart and powerful way.

5 of 45

Representation: Structured Data

  • 1 star: available on the web (e.g. image of a printed table)
    • data locked in the document.
  • 2 stars: machine-readable structured format (e.g. XLS)
    • data locked in the proprietary format.
  • 3 stars: standard format (e.g. CSV)
    • data unlocked but on rather than in the web.
  • 4 stars: global identifiers (URI like URL)
    • data in the web.
  • 5 stars: linked data
    • data in the web and linked to other data.

URI

URI

URI

Deployment requirements for more structured data

Tim Berners-Lee, 2010 - https://www.w3.org/2011/gld/wiki/5_Star_Linked_Data

6 of 45

Data Distribution: Tabular Data Sample

name

birthDate

birthPlace

Marco Polo

1254

Venice

Niccolò Polo

1230

Venice

Maffeo Polo

1230

Venice

column

column

column

row

row

row

metadata

ID

156

1201

368

column

list of persons:

meaning of each cell in a row of data

data records, each row is a person

7 of 45

Distributing by Rows

server B

server A

name

birthDate

birthPlace

Marco Polo

1254

Venice

Niccolò Polo

1230

Venice

Maffeo Polo

1230

Venice

name

birthDate

birthPlace

duplicate metadata

ID

156

1201

368

ID

to define the meaning of each column in rows

8 of 45

Distributing by Columns

server B

server A

name

birthDate

birthPlace

Marco Polo

1254

Venice

Niccolò Polo

1230

Venice

Maffeo Polo

1230

Venice

ID

156

1201

368

ID

156

1201

368

duplicate IDs

to link columns to the same row (person)

9 of 45

Distributing by Columns and Rows (Cells)

server B

server A

name

birthDate

birthPlace

Marco Polo

1254

Venice

Niccolò Polo

1230

Venice

Maffeo Polo

1230

Venice

ID

156

1201

368

name

ID

name

ID

ID

156

1201

368

ID

birthDate

birthPlace

ID

156

1201

368

ID

duplicate both IDs and metadata

this is the most atomic data distribution, and is the approach taken by SemWeb using statement-like constructs known as triples

10 of 45

Semantic Web Data Modeling: RDF

Resource Description Framework

https://www.w3.org/RDF/

http://www.w3.org/1999/02/22-rdf-syntax-ns#

11 of 45

Modeling Data in RDF: Triples

  • linguistic metaphor (statement):
    1. Subject (156) = row
    2. Predicate (has-name) = col
    3. Object = val (Marco Polo)
  • = "the person identified by 156 has name Marco Polo"
  • any S,P,O is globally identified by its own URI, unless it's a literal (it's just an ID, it doesn't have to point to an existing page!)
  • usually literals have their data type (string, number, date...)

triple

Marco Polo

Marco Polo�@it

http://www.w3.org/2000/01/rdf-schema#label

http://dbpedia.org/resource/Marco_Polo

type = string

language = Italian

name

Marco Polo

ID

156

S

P

O

subject

predicate

object

any data is expressed by triples: e.g. Marco-Polo...

  • has-father Niccolò-Polo;
  • has-birth-date 1254;
  • has-birth-place Venice ...

this implies having vocabularies defining all the resources used as S,P, or O, each with its own URI

12 of 45

A World of Vocabularies

  • several vocabularies are used to build ontologies (set of concepts and their relations):
    • general-purpose (e.g. Dublin Core for generic metadata: title, description, creator, date...)
    • concept-specific, e.g.:
      • FOAF (friend of a friend): persons, their activities and relations
      • DBPedia from Wikipedia infoboxes
      • Schema.org by major search engines (general-purpose metadata for web sites)
  • anyone can create vocabularies, yet trying to reuse concepts from others

13 of 45

Publishing Data on the Web

publishing data presentations (however generated) as hypertexts/GUIs for humans

publishing data representations as triples for machines

vocabulary A

vocabulary B

triples (URI)

triples from other ontologies

merge

link

web site (URL)

web 1.0/2.0

web 3.0

anyone can publish pages in a site, which gets hyperlinked to other sites.�Pages are identified by URLs.

anyone can publish data as triples, which get merged into the global data graph. Concepts are identified by URIs.

14 of 45

Assumptions for a Global World

  • Anyone can say Anything about Any topic (AAA): information is born distributed, and can be nonsensical or inconsistent
  • Open World Assumption: information in any given moment is just a snapshot of what is available at that moment; so we can't rely on assuming that our information is complete
  • Nonunique Naming Assumption: given that there is no central authority to manage the IDs assigned to things, the same thing can get any number of IDs (IDs are URIs, and everyone with a domain can mint them)

layer 1

layer 1

layer 2

layer 1

Marco Polo

Niccolò Polo

has father

Marco Polo

Maffeo Polo

has father

Marco Polo

Niccolò Polo

has father

server of layer 2 down

Marco Polo

layer 2

Marco Polo

abc.com/mp

def.edu/h/156

15 of 45

Sample Graph

Marco Polo

person

explorer of Asia

Marco Polo�@it

Поло, Марко @ru

http://www.w3.org/2000/01/rdf-schema#label

Venice

Venezia�@it

1254

http://xmlns.com/foaf/0.1/Person

http://dbpedia.org/resource/Marco_Polo

http://dbpedia.org/resource/Venice

http://dbpedia.org/resource/Category:Explorers_of_Asia

Venedig�@de

http://dbpedia.org/ontology/birthDate

http://dbpedia.org/ontology/birthPlace

http://www.w3.org/1999/02/22-rdf-syntax-ns#type

http://www.w3.org/1999/02/22-rdf-syntax-ns#type

http://www.w3.org/2000/01/rdf-schema#label

http://www.w3.org/2000/01/rdf-schema#label

X is a person;

... URIs are verbose!

X was born in 1254;

X is an explorer of Asia.

this place has name "Venice" (Italian), "Venedig" (German);

X has name "Marco Polo" (Italian), Поло, Марко (Russian);

X was born in a place;

everything is a triple, and each part in it has its own URI (except for literals)

16 of 45

Sample Graph

Marco Polo

person

explorer of Asia

Marco Polo�@it

Поло, Марко @ru

rdfs:label

Venice

Venezia�@it

1254

foaf:Person

dbr:Marco_Polo

dbr:Venice

dbc:Explorers_of_Asia

Venedig�@de

rdfs:label

rdfs:label

dbo:birthDate

dbo:birthPlace

rdf:type

rdf:type

foaf: http://xmlns.com/foaf/0.1/

rdfs: http://www.w3.org/2000/01/rdf-schema#

dbo: http://dbpedia.org/ontology/

dbc: http://dbpedia.org/resource/Category/

rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#

dbr: http://dbpedia.org/resource/

shorten URIs by replacing their first portion�with an arbitrarily chosen prefix (qnames)

17 of 45

More Nodes

Marco Polo

person

explorer of Asia

Marco Polo�@it

Поло, Марко @ru

rdfs:label

Venice

Venezia�@it

1254

rdf:type

foaf:Person

dbr:Marco_Polo

dbr:Venice

dbc:Explorers_of_Asia

Venedig�@de

rdfs:label

Venetian lagoon

dbr:Venetian_Lagoon

dbr:Ferdinand_Magellan

Magellan

rdf:type

rdf:type

Giulia Lama

dbr:Giulia_Lama

Alpi Eagles

dbr:Alpi_Eagles

dbo:hubAirport

Anthony Quinn

dbr:Anthony_Quinn

rdf:type

rdfs:label

rdfs:label

dbo:birthDate

dbo:birthPlace

dbo:nearestCity

dbo:deathPlace

The power of Linking Data: the value of the network is greater than the sum of its parts

jumping across nodes up to even remotely connected things, derived from different ontologies, all merged into the same, global data graph

18 of 45

Serializing Data

  • RDF is an abstract data model; so we need some serialization format to store its triples
  • there are several formats, targeted to different typical usages, like:
    • RDF/XML (machine to machine)
    • JSON-LD (machine to machine)
    • N-Triples (verbose)
    • Turtle (and TriG for multiple graphs)

19 of 45

Serialization: Turtle - Tokens

  • URIs in <>
  • qnames: prefix:name where prefix is an initial portion of the URI
  • prefixes are defined in a preamble:
    • @prefix PREFIX: <URI>.
    • @base PREFIX: <URI>. default prefix (e.g. :name)
  • literals wrapped in ""
  • literal type (optional): ^^type (usually XSDL datatype)
  • abbreviation a = rdf:type

"123"^^xs:integer

"1254-6-28"^^xs:date

@PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.

@PREFIX foaf: <http://xmlns.com/foaf/0.1/>.

@PREFIX dbr: <http://dbpedia.org/resource/>.

dbr:Marco_Polo rdf:type foaf:Person.

Literals

QNames

dbr:Marco_Polo a foaf:Person.

Abbreviations

20 of 45

Serialization: Turtle - Statements

  • statements ended by .
  • multiple statements about the same subject: all but last ended by ;
  • multiple statements about the same subject and predicate: all but last ended by ,

@PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@PREFIX foaf: <http://xmlns.com/foaf/0.1/>.

@PREFIX dbr: <http://dbpedia.org/resource/>.

dbr:Marco_Polordfs:label "Marco Polo"@it;

rdfs:label "Поло, Марко"@ru;

a foaf:Person.

3 triples, shared subject

21 of 45

Serialization: Turtle - BNodes

  • RDF blank nodes: nodes without an identity (=without URI), used to express the fact that we know some things about something, but not its identity: e.g. "someone wrote this post".
  • in Turtle a bnode is in [], either as S or O

[] a foaf:Person;

foaf:name "Ted".

there is an otherwise unknown person named Ted

dbr:Marco_Polo x:hasMistress [

a foaf:Person.

]

bnode as S

bnode as O

Marco Polo had a mistress, and we don't know anything else about her

22 of 45

Turtle Sample

Marco Polo

person

explorer of Asia

Marco Polo�@it

Поло, Марко @ru

rdfs:label

Venice

Venezia�@it

1254

foaf:Person

dbr:Marco_Polo

dbr:Venice

dbc:Explorers_of_Asia

Venedig�@de

rdfs:label

rdfs:label

dbo:birthDate

dbo:birthPlace

rdf:type

rdf:type

@PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.

@PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@PREFIX foaf: <http://xmlns.com/foaf/0.1/>.

@PREFIX dbo: <http://dbpedia.org/ontology/>.

@PREFIX dbc: <http://dbpedia.org/resource/Category>.

dbr:Marco_Polordfs:label "Marco Polo"@it;

rdfs:label "Поло, Марко"@ru;

a foaf:Person;

dbo:birthDate "1254-1-1"^^xsd:date;

dbo:birthPlace dbr:Venice;

a dbc:Explorers_of_Asia.

dbr:Venicerdfs:label"Venezia"@it,

"Venedig"@de.

preamble

Subject Predicate Object

1

2

3

4

5

6

7

8

23 of 45

LOD Lab

24 of 45

LAB: Adding and Presenting Data

CSS

JS

HTML

XML

XSLT

representation

presentation

transformation

semantically marked text

1. add RDF data to XML TEI about persons and places

2. apply XSLT transformation

3. examine the result

1A. manually write Turtle triples from realia.csv

1B. convert Turtle into RDF/XML and paste it

25 of 45

Materials

  • TEI document from the previous LAB (TEI-original.xml; its LOD version is TEI.xml)
  • CSV table with mappings between TEI IDs and DBPedia URIs (realia.csv)
  • XSLT stylesheet to transform TEI into HTML (LOD.xsl)
  • HTML dependencies:
    • CSS stylesheets (LOD.css, indigo-pink.css)
    • JS custom web element for LOD-enabled web app (realia-list.js)

realia.csv

TEI.xml

RDF/XML

turtle

LOD.xsl

TEI.html

LOD.css

indigo-pink.css

realia-list.js

manual input

convert

paste

transform

result

26 of 45

LAB Steps

27 of 45

1. Adding Other Namespaces

  • add some namespaces at the TEI root (listed below)
  • this is required to have them at hand during XSLT transformations

<TEI xmlns="http://www.tei-c.org/ns/1.0"

     xmlns:tei="http://www.tei-c.org/ns/1.0"

     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

     xmlns:foaf="http://xmlns.com/foaf/0.1/"

     xmlns:dbo="http://dbpedia.org/ontology/">

28 of 45

2. Adding RDF/XML to TEI Header

  • in our example, we need some triples about persons and places to feed a LOD-based web app
  • xenoData: "a container element into which metadata in non-TEI formats may be placed" (https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-xenoData.html)
  • directly include RDF/XML as its content (rdf:RDF element)

<xenoData>

    <rdf:RDF>

        <rdf:Description tei:ref="#MPrdf:about="http://dbpedia.org/ontology/Marco_Polo">

            <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person" />

            <rdfs:label xml:lang="en">Marco Polo</rdfs:label>

        </rdf:Description>

        <rdf:Description tei:ref="#NPrdf:about="http://dbpedia.org/ontology/Niccolò_and_Maffeo_Polo">

            <rdf:type rdf:resource="http://xmlns.com/

foaf/0.1/Person" />

            <rdfs:label xml:lang="en">Niccolò Polo</rdfs:label>

        </rdf:Description>

    </rdf:RDF>

</xenoData>

RDF container for all persons and places

each person/place is a rdf:Description

29 of 45

Finding DBPedia Resources

  1. TEI-scoped realia are already marked in teiHeader/profileDesc.
  2. find them (e.g. "Marco Polo") in Wikipedia: if found, you get to the Wikipedia page, which is like en.wikipedia.org/wiki/Main_Page/Marco_Polo. The final part of the URI (Marco_Polo) is used in DBPedia as the qname name.
  3. browse to the corresponding DBPedia resource page using that name: www.dbpedia.org/page/Marco_Polo.

see triples in various formats

30 of 45

Some Person Resources (realia.csv)

  • dbo prefix = http://dbpedia.org/resource/
  • foaf prefix = http://xmlns.com/foaf/0.1/
  • #Adam = Adam
  • #Christ = Jesus
  • #GrandChan = Kublai_Khan
  • #MP = Marco_Polo
  • #NP = Niccolò_Polo

TEI ID as found in header's profileDesc

the corresponding DBPedia resource URI (minus the prefix)

31 of 45

3. Adding Persons: Turtle

  • write in Turtle 2 triples:
  • Marco Polo is a person;
  • he has label "Marco Polo".

http://dbpedia.org/ontology/Marco_Polo

http://xmlns.com/foaf/0.1/Person

http://www.w3.org/1999/02/22-rdf-syntax-ns#type

http://www.w3.org/2000/01/rdf-schema#label

dbp:

a

foaf:

rdfs:

@PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@PREFIX dbo: <http://dbpedia.org/ontology/>.

@PREFIX foaf: <http://xmlns.com/foaf/0.1/>.

dbo:Marco_Polo a foaf:Person;

rdfs:label "Marco Polo"@en.

32 of 45

Converting Formats

www.easyrdf.org/converter

  1. paste Turtle triples
  2. select Turtle as input format
  3. select RDF/XML as output format
  4. click Submit
  5. copy the generated output (element inside rdf:RDF)

33 of 45

Adding Persons – RDF/XML Result

  • rdf:Description contains the person
    • @tei:ref is the reference to the internal TEI ID for that person
    • @rdf:about is the subject ID
  • rdf:type is the rdf:type predicate, whose object is @rdf:resource
  • rdfs:label is the RDFS label predicate, whose object (a literal) is the element's content. Its language is @xml:lang.

<rdf:Description

tei:ref="#MP"

rdf:about="http://dbpedia.org/ontology/Marco_Polo">

<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>

<rdfs:label xml:lang="en">Marco Polo</rdfs:label>

</rdf:Description>

34 of 45

Some Places Data (realia.csv)

  • dbo prefix = http://dbpedia.org/resource/
  • foaf prefix = http://xmlns.com/foaf/0.1/
  • #Constantinopel = Constantinople
  • #Great_Armenia = Kingdom_of_Armenia_(antiquity)
  • #India = India
  • #Persia = Persian_Empire
  • #Soldania = Sudak
  • #Venice = Venice

35 of 45

4. Adding Places

@PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@PREFIX dbo: <http://dbpedia.org/ontology/>.

dbo:Venice a dbo:Place;

rdfs:label "Venice"@en.

<dbo:Place

 tei:ref="#Venice"

 rdf:about="http://dbpedia.org/resource/Venice">

  <rdfs:label xml:lang="en">Venice</rdfs:label>

</dbo:Place>

as for persons, the only difference being the resource type, a dbo:Place rather than a foaf:Person

36 of 45

5. Transforming into Presentation

  1. get the XSLT script (LOD.xsl) and the target HTML dependencies (LOD.css, indigo-pink.css, realia-list.js)
  2. run the script on your edited TEI document to get the HTML output
  3. open it with your browser (you must be online to use the web app inside it)

37 of 45

Presentation Result

Quick Glance

38 of 45

Final Result: HTML, CSS, JS from XML

each text color corresponds to a different TEI element

lines and columns numbers in gray

drop-letters

names and places connected to RDF have tips

fully working web apps for persons and places

39 of 45

Web app: Persons and Places From DBPedia

  • DBPedia: result from scraping Wikipedia info boxes (www.dbpedia.org)
  • app-realia-list is a web component, which gets a list of URIs and labels in its json-list attribute, and queries DBPedia to get and present information about persons and places

40 of 45

Web app: Feeding with Document's Data

<app-realia-listtype="person"�json-list='[{ "uri": "http://dbpedia.org/resource/Marco_Polo", "label": "Marco Polo" },�{ "uri": http://dbpedia.org/resource/Kublai_Khan", "label": "Kublai Khan"}]'�</app-realia-list>

persons URIs and names

contains persons or places

41 of 45

Querying LOD Data

label(s)

birth and death

calculated age

abstract in selected language

link to Wikipedia

linked image

PREFIX dbp: <http://dbpedia.org/property/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX dbr: <http://dbpedia.org/resource/>

SELECT DISTINCT dbr:Marco_Polo as ?person ?name

?birth_date ?birth_place ?birth_place_label

?death_date ?death_place ?death_place_label

?topic ?depiction ?abstract

WHERE {

dbr:Marco_Polo a foaf:Person;

foaf:name ?name.

OPTIONAL {

dbr:Marco_Polo dbo:birthDate ?birth_date;

dbo:deathDate ?death_date;

foaf:isPrimaryTopicOf ?topic;

foaf:depiction ?depiction;

dbo:abstract ?abstract.

dbr:Marco_Polo dbo:birthPlace ?birth_place.

?birth_place rdfs:label ?birth_place_label.

dbr:Marco_Polo dbo:deathPlace ?death_place.

?death_place rdfs:label ?death_place_label.

}

FILTER(lang(?birth_place_label)="en")

FILTER(lang(?death_place_label)="en")

}

SPARQL Query

42 of 45

Querying LOD Data

PREFIX dbp: <http://dbpedia.org/property/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX dbr: <http://dbpedia.org/resource/>

SELECT DISTINCT dbr:Marco_Polo as ?person ?name

?birth_date ?birth_place ?birth_place_label

?death_date ?death_place ?death_place_label

?topic ?depiction ?abstract

WHERE {

dbr:Marco_Polo a foaf:Person;

foaf:name ?name.

OPTIONAL {

dbr:Marco_Polo dbo:birthDate ?birth_date;

dbo:deathDate ?death_date;

foaf:isPrimaryTopicOf ?topic;

foaf:depiction ?depiction;

dbo:abstract ?abstract.

dbr:Marco_Polo dbo:birthPlace ?birth_place.

?birth_place rdfs:label ?birth_place_label.

dbr:Marco_Polo dbo:deathPlace ?death_place.

?death_place rdfs:label ?death_place_label.

}

FILTER(lang(?birth_place_label)="en")

FILTER(lang(?death_place_label)="en")

}

Marco Polo

foaf:�Person

rdf:type

foaf:name

Marco Polo^^en

dbo:birthDate

1254-1-1

dbo:deathDate

1324-1-1

foaf:isPrimaryTopicOf

foaf:depiction

dbo:abstract

...

...

dbr:Republic_of_Venice

dbo:birthPlace

rdfs:label

Republic of Venice^^en

dbo:deathPlace

English only

must be a person

having a name

if present, we also want: birth/death date, wiki page, default picture, abstract...

... birth place and its label...

... death place and its label...

only English labels for places

pick these data, as specified below

en.Wikipedia.org...

commons.wikimedia...

not filtered by language,�get all the abstracts

43 of 45

Inspecting the Query

SPARQL

results in a table

execute

44 of 45

Data Presentation Logic

  • selected data are fetched and filtered from DBPedia via SPARQL, and resources are identified by URIs, not by language-dependent, ambiguous names
  • data are presented in a totally customized fashion

45 of 45

Data and Services Aggregation

  • for places, we fetch also latitude and longitude
  • this allows us using another external service to present them in a map
  • presentation based on aggregating data and services, both in the web infrastructure, rather than in the app