Let's dive into EntitySchemas
Slides for Data Modelling Days event
Wikidata - 30/Nov/2023
Janeiro Digital
Agenda
Preliminaries: RDF, ShEx, Entity schemas, …
Tools for entity schemas
Exercise creating an entity schema
Some applications of entity schemas
Discussion
2
RDF graphs
RDF = W3C recommendation (since 98)
Lingua franca of Semantic Web
Based on triples
(subject, predicaje, object)
Most nodes are URIs
Interoperability
:TimBL
:London
:Human
rdf:type
rdf:type
:birthPlace
:CERN
:employer
"Tim Berners-Lee"
rdfs:label
"1955-06-08"^^xsd:date
:birthDate
:UK
:country
:Spain
:birthPlace
:knows
:City
:Organization
rdf:type
:Metropolis
rdf:type
RDF ecosystem
One data model, several syntaxes: Turtle, N-Triples, JSON-LD
Vocabularies: RDF Schema, OWL, SKOS, etc.
prefix : <http://example.org/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
�:timbl rdf:type :Human ;
:birthPlace :london ;
rdfs:label "Tim Berners-Lee" ;
:birthDate "1955-06-08"^^xsd:date ;
:employer :CERN ;
:knows _:1 .
:london rdf:type :City, :Metropolis ;
:country :UK .
:CERN rdf:type :Organization .
_:1 :birthPlace :Spain .
Turtle
RDF ecosystem: SPARQL
SPARQL is an RDF query language and protocol
It enables the creation of SPARQL endpoints
select ?person ?date ?country where {
?person :birthDate ?date .
?person :birthPlace ?p .
?p :country ?country
}
?person | ?date | ?country |
:timbl | 1955-06-08 | :UK |
Wikibase graphs
Popularized by Wikidata
Wikibase = software supporting Wikidata
The values can be nodes in the graph
Example:
Tim Berners Lee
http://www.wikidata.org/entity/Q80
timBl (Q80)
birthDate: 1955
vintCerf
Human
CERN
PA
awarded
- pointTime: 2002
- togetherWith:
instanceOf
instanceOf
employer
- start: 1984
- end: 1994
employer
- start: 1980
- end: 1980
London
UK
birthPlace
country
Spain
country
awarded
- pointTime: 2013
NewHaven
birthPlace
Wikibase graphs and RDF
Wikibase graphs generate RDF serializations for each item
SPARQL endpoint and Query service available
RDF dumps
RDF
serialization
SPARQL
Query service
Wikibase
select ?name ?date?country where {
wd:Q80 wdt:P1559 ?name .
wd:Q80 wdt:P569 ?date .
wd:Q80 wdt:P19 ?place .
?place wdt:P17 ?country
}
?name | ?date | ?country |
Tim Berners-lee | 1955-06-08 | :UK |
Try it: https://w.wiki/5yGu
Wikibase RDF representation
(P166)
(Q3320352)
(P585)
(P1796)
(Q92743)
:Q80 :P166 :Q3320352 {|
:P585 2002;
:P1796 :Q62843, :Q92935, :Q92743
|}
(Q92935)
(Q62843)
wd:Q80 wdt:P166 wd:Q3320352 ;
p:P166 s:PQ80-494FA .
s:PQ80-494FA ps:P166 wd:Q3320352 ;
pq:P585 2002 ;
pq:P1706 wd:Q62843, wd:Q92935, wd:Q92743.
RDF
Syntax inspired by RDF-Star annotated triples syntax
Could be represented as
RDF, the good parts...
RDF as an integration language
RDF as a lingua franca for semantic web and linked data
Basis for knowledge representation
RDF flexibility
Data can be adapted to multiple environments
Reusable data by default
RDF tools
RDF data stores & SPARQL
Several serializations: Turtle, JSON-LD, RDF/XML...
Can be embedded in HTML (Microdata/RDFa)
RDF, the other parts
Consuming & producing RDF
Describing and validating RDF content
SPARQL endpoints are not well documented
Typical documentation = set of SPARQL queries
Difficult to know where to start doing queries
Producer
Consumer
?
SPARQL endpoint
Why describe & validate RDF?
For producers
Understand the contents they are going to produce
Ensure they produce the expected structure
Advertise and document the structure
Generate interfaces
For consumers
Understand the contents
Verify the structure before processing it
Query generation & optimization
Producer
Consumer
Shapes
SPARQL endpoint
Similar technologies
Technology | Schema |
Relational Databases | DDL |
XML | DTD, XML Schema, RelaxNG, Schematron |
Json | Json Schema |
RDF | ? |
Fill that gap
Schemas for RDF?
RDF flexibility doesn't want to impose a schema, but...
In practice, there are implicit schemas
Assumed by producers and consumers
Shapes can make schemas explicit
Handle malformed/incomplete data
Avoid defensive programming
Producer
Consumer
Shapes
SPARQL endpoint
Shapes
Shapes for consensus building
Initial motivation: clinical data models (FHIR)
Distributed, extensible content models
Distributed by location and authority
Extensible content models
Shared schemas
Understandable by domain experts
...and machine processable
Shapes
Shapes
Shapes
Example of a shape
A shape describes
The form of a node (node constraint)
Incoming/outgoing arcs from a node
Possible values associated with those arcs
:timbl rdfs:label "Tim Berners-Lee" ;
:birthPlace :london ;
:birthDate "1955-06-08"^^xsd:date ;
:employer :CERN .
RDF Node
<Researcher> {
rdfs:label xsd:string ;
:birthPlace @<Place> ? ;
:birthDate xsd:date ? ;
:employer @<Organization> * ;
}
ShEx
ShEx evolution
2013 RDF Validation Workshop
Conclusions of the workshop:
There is a need of a higher level, concise language for RDF Validation
ShEx initially proposed (v 1.0)
2014 W3C Data Shapes WG chartered
2017 SHACL accepted as W3C recommendation
2017 ShEx 2.0 released as W3C Community group draft
2019 ShEx adopted by Wikidata
2022 ShEx + extends
2023 IEEE ShEx (current work in progress)
RDF description - validation - constraints
Description
Tell what something is about
What data we want/expect
Constraint
Rule that something has to obey
What data we don’t want/expect
ShEx
SHACL
Validation
Check that something conforms
Short intro to ShEx
ShEx (Shape Expressions Language)
Concise and human-readable
Syntax similar to SPARQL, Turtle
Semantics inspired by regular expressions & RelaxNG
2 syntaxes: Compact and RDF/JSON-LD
Official info: http://shex.io
Semantics: http://shex.io/shex-semantics/, primer: http://shex.io/shex-primer
ShEx implementations and playgrounds
Simple example
Nodes conforming to <Researcher> must:
prefix schema: <http://schema.org/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
<Researcher> {
rdfs:label xsd:string ;
schema:knows @<Researcher> * ;
}
Prefix declarations
are similar to Turtle/SPARQL
RDF Validation using ShEx
:alice rdfs:label "Alice" ;
schema:knows :alice .
�:bob schema:knows :alice ;
rdfs:label "Robert".
�:carol rdfs:label "Carol", "Carole" ;
schema:knows "bob" .
�:dave rdfs:label 234 ;
schema:knows :bob .
�:emily schema:name "Emily" .
�:frank rdfs:label "Frank" ;
schema:email <mailto:frank@example.org> ;
schema:knows :alice, :bob .
Try it (RDFShape): https://rdfshape.weso.es/link/16685282129
Schema
Data
<Researcher> {
rdfs:label xsd:string ;
schema:knows @<Researcher> * ;
}
:alice@<Researcher>,
:bob @<Researcher>,
:carol@<Researcher>,
:dave @<Researcher>,
:emily@<Researcher>,
:frank@<Researcher>,
Shape map
✔
✔
🗶
🗶
🗶
✔
Validation process
:bob@:Person, :carol@:Person
ShEx
Validator
Result shape map
:Person IRI and {
:name xsd:string ;
:knows @:Person *
}
ShEx Schema
:alice :name "Alice" ;
:knows :alice .
:bob :knows :alice ;
:name "Robert".
:carol :name "Carol", "Carole" .
RDF data
Shape map
:alice@:Person,
:bob@:Person,
:carol@!:Person
Input: RDF data, ShEx schema, Shape map
Output: Result shape map
Try it (RDFShape)
Node constraints
Describe the shape of a node
:Book {
:name xsd:string ;
:datePublished xsd:date ;
:numberOfPages MinInclusive 1 ;
:author @:Person ;
:genre [ :Action :Comedy :NonFiction ] ;
:isbn /isbn:[0-9X]{10}/ ;
:publisher IRI ;
:audio . ;
:maintainer @:Person OR @:Organization
}
:Person {}
:Organization {}
:item23
:name "Weaving the Web" ;
:datePublished "2012-03-05"^^xsd:date ;
:numberOfPages 272 ;
:author :timbl ;
:genre :NonFiction ;
:isbn "isbn:006251587X" ;
:publisher <http://www.harpercollins.com/> ;
:audio <http://audio.com/item23> ;
:maintainer :alice .
Try it: (RDFShape)
Cardinalities
Inspired by regular expressions: +, ?, *, {m,n}
By default {1,1}
:Book {
:name xsd:string ;
:numberOfPages xsd:integer ? ;
:author @:Person + ;
:publisher IRI ? ;
:maintainer @:Person {1,3} ;
:related @:Book *
}
:Person {}
:Organization {}
:item23
:name "Weaving the Web" ;
:numberOfPages 272 ;
:author :timbl, :markFischetti ;
:maintainer :alice,:bob .
Try it: RDFShape
Recursive schemas
Well defined supports for cyclic (recursive) schemas
:Book {
:title xsd:string ;
:author @:Person * ;
:related @:Book * ;
}
:Person {
:name xsd:string ? ;
:birthDate xsd:date ? ;
:birthPlace @:Place ? ;
:knows @:Person * ;
:worksFor @:Company * ;
}
:Place {
:name xsd:string ;
:country @:Country ;
}
:Country {
:name xsd:string
}
:Company {
:name xsd:string
:employee @:Person
}
Try it: RDFShape
Open/Closed content models
:Person {
:name xsd:string ;
:knows . *
}
:frank :name "Frank" ;
:knows :alice, :bob ;
:email <mailto:frank@e.com> .
✔
:Person CLOSED {
:name xsd:string ;
:knows . *
}
🗶
Try it: RDFShape
Open/Closed properties
Property values are closed by default (closed properties)
Try it: RDFShape
:Book {
:code /isbn:[0-9X]{10}/ ;
}
:item23 :code "isbn:006251587X" .
✔
:item23 :code 23 .
🗶
:Book {
:code /isbn:[0-9X]{10}/ ;
:code /isbn:[0-9]{13}/
}
:item23 :code "isbn:006251587X" ,
:code "isbn:9780062515872" .
Properties can be repeated
✔
:Book EXTRA :code {
:code /isbn:[0-9X]{10}/ ;
}
:item23 :code "isbn:006251587X" ,
:code 23 .
EXTRA declares properties as open
✔
Triple expressions
:Person {
(:name xsd:string |
:firstName xsd:string + ;
:lastName xsd:string
) ;
}
:alice :name "Alice Cooper" .
:bob :firstName "Robert" ;
:lastName "Smith" .
:carol :firstName "Carol" ;
:lastName "King" ;
:firstName "Carole" .
“Unordered” regular expressions: Regular bag expressions
:dave :firstName "Dave" ;
:name "Dave Navarro" .
Try it: RDFShape
name |
firstName + ; lastName
Logical operators
Shape Expressions can be combined with AND, OR, NOT
Some restrictions on the use of NOT combined with recursion
:Book {
:name xsd:string ;
:author @:Person OR @:Organization ;
}
:AudioBook @:Book AND {
:name MaxLength 20 ;
:readBy @:Person ;
} AND NOT {
:numberOfPages . +
}
:Person {}
:Organization {}
:item24 :name "Weaving the Web" ;
:author :timbl ;
:readBy :timbl .
:item23 :name "Weaving the Web" ;
:author :timbl ;
:numberOfPages 272 ;
:readBy :timbl .
Try it: RDFShape
Importing schemas
import statement can be used to import schemas
import <https://www.validatingrdf.com/examples/book.shex>
:AudioBook @:Book AND {
:title MaxLength 20 ;
:readBy @:Person ;
}
:Book {
:title xsd:string ;
}
:Person {
:name xsd:string ? ;
}
http://validatingrdf.com/tutorial/examples/book.shex
:item24 :name "Weaving the Web" ;
:author :timbl ;
:readBy :timbl .
Try it: RDFShape
Inheritance model for ShEx
extends allows to reuse existing shapes adding new content
Handles closed properties and shapes
Other features
Multiple inheritance
Abstract shapes
:Book {
:name xsd:string ;
:author @:Person ;
:code /isbn:[0-9]{13}/ ;
:code /isbn:[0-9X]{10}/
}
:LibraryBook extends @:Book {
:code /internal:[0-9]*/ ;
}
:item23 :name "Weaving the Web" ;
:author :timbl ;
:code "isbn:006251587X" ;
:code "isbn:9780062515872" ;
:code "internal:234" .
Try it: RDFShape
Machine processable annotations
They look like comments but are machine processable
:Book {
:name xsd:string // rdfs:label "Name"@en
// rdfs:label "Nombre"@es
// rdfs:comment "Name of person" ;
:author @:Person // rdfs:label "author"@en
// rdfs:label "autor"@es
// rdfs:comment "Book author" ;
}
Other ShEx features
Machine processable annotations
Value set ranges
Language tagged values
Semantic actions
Named expressions
Nested shapes
Shape maps
. . .
Jose E. Labra Gayo, Eric Prud’hommeaux, Iovka Boneva, Dimitris Kontokostas, Validating RDF Data, Synthesis Lectures on the Semantic Web, Vol. 7, No. 1, 1-328, DOI: 10.2200/S00786ED1V01Y201707WBE016, Morgan & Claypool (2018)
Online version: http://book.validatingrdf.com/
Example with more ShEx features
:AdultPerson EXTRA rdf:type {
rdf:type [ schema:Person ] ;
:name xsd:string ;
:age MinInclusive 18 ;
:gender [:Male :Female] OR xsd:string ;
:address @:Address ? ;
:worksFor @:Company + ;
}
:Address CLOSED {
:addressLine xsd:string {1,3} ;
:postalCode /[0-9]{5}/ ;
:state @:State ;
:city xsd:string
}
:Company {
:name xsd:string ;
:state @:State ;
:employee @:AdultPerson * ; }
:State /[A-Z]{2}/
:alice rdf:type :Student, schema:Person ;
:name "Alice" ;
:age 20 ;
:gender :Male ;
:address [
:addressLine "Bancroft Way" ;
:city "Berkeley" ;
:postalCode "55123" ;
:state "CA"
] ;
:worksFor [
:name "Company" ;
:state "CA" ;
:employee :alice
] .
ShEx and Wikidata
Entity schemas namespace
WShEx = ShEx variant for Wikibase
WShEx
Entity schemas
ShEx
describe
describe
JSON dumps
RDF dumps
RDF
serialization
SPARQL
Query service
Wikibase
More info: https://www.weso.es/WShEx/
ShEx vs WShEx
PREFIX pq: <.../prop/qualifier/>
PREFIX ps: <.../prop/statement/>
PREFIX p: <.../prop/>
PREFIX wdt: <.../prop/direct/>
PREFIX wd: <.../entity/>
PREFIX xsd: <...XMLSchema#>
�<Researcher> {
wdt:P31 [ wd:Q5 ] ;
wdt:P166 @<Award> * ;
p:P166 { ps:P166 @<Award> ;
pq:P585 xsd:dateTime ? ;
pq:P1706 @<Researcher> *
} *
}
<Award> { wdt:P17 @<Country> ? }
<Country> {}
Entity-schema - ShEx
PREFIX : <.../entity/>
�<Researcher> {
:P31 [ :Q5 ] ;
:P166 @<Award> {| :P585 Time ? ,
:P1706 @<Researcher> ?
|} *
}
<Award> { :P17 @<Country> }
<Country> {}
WShEx
More info: https://www.weso.es/WShEx/
Agenda
Preliminaries: RDF, ShEx, Entity schemas,...
Tools for entity schemas
Exercise creating an entity schema
Some applications of entity schemas
Discussion
38
Tools for entity schemas
Wikidata
ShEx-simple tool
YaSHE
Wikishape
Command line tools
Wikidata
ShEx-simple tool
Available at: https://shex-simple.toolforge.org
Or at rawgit
YaSHE
It is deployed here: https://www.weso.es/YASHE/
Note: Jetbrains seems to have another nice plugin: https://plugins.jetbrains.com/plugin/13838-rdf-and-sparql
(I didn’t try it yet)
Wikishape
Online tool for wikidata, available here: https://wikishape.weso.es/
Command line tools
ShEx validators: shex.js, PyShEx,
wb
ShEx-rs
Agenda
Preliminaries: RDF, ShEx, Entity schemas, …
Tools for entity schemas
Exercise creating an entity schema
Some applications of entity schemas
Discussion
45
Create an entity schema
Link: https://www.wikidata.org/wiki/Special:NewEntitySchema
See existing ones: https://www.wikidata.org/wiki/EntitySchema:E187
Tips to create entity schemas
Understand the difference between the
Select some prototypical data
Start from examples and generalize
Some tips:
Possibility: Infer from some existing data? (sheXer)
Other possibility: Reuse and import other schemas
Agenda
Preliminaries: RDF, ShEx, entity schemas, …
Tools for entity schemas
Exercise creating an entity schema
Some applications of entity schemas
Discussion
48
Applications of entity schemas
Using entity schemas for Wikidata subsetting
Entity schemas can be the input of Wikidata subsetting tools
They describe the content of the subset
More info:
Wikidata subsetting: approaches, tools and evaluation, S. Hosseini, J. Labra, A. Waagmeester, A. Ammar, C. González, D. Slenter, S. Ui-Hasan, E. Willighagen, F. McNeill, A. Gray, accepted at Semantic Web Journal
Problem statement
51
Wikidata
database
JSON
dumps
Blazegraph
RDF
dumps
SPARQL
endpoint
API
(JSON view)
API
(RDF view)
Wikidata
subset
Subset
creation
tool
Subset
description
ShEx
WShEx
GeneWiki
project
Data model
GeneWiki subset
start= @:active_site OR
@:anatomical_structure OR
. . .
@:gene OR
. . .
:active_site EXTRA wdt:P31 {
rdfs:label [ @en ] ;
wdt:P31 [ wd:Q423026 ] ;
wdt:P361 @:protein_family * ;
wdt:P527 @:protein_family * ;
}
. . .
:gene EXTRA wdt:P31 {
rdfs:label [ @en ] ;
wdt:P31 [ wd:Q7187 ];
wdt:P684 @:gene * ; # ortholog (P684)
wdt:P2293 @:disease * ; # genetic association (P2293)
wdt:P703 @:taxon * ; # found in taxon (P703)
wdt:P1057 @:chromosome * ; # chromosome (P1057)
wdt:P682 @:biological_process * ; # biological process (P682)
wdt:P688 @:protein * ; # encodes (P688)
}
. . .
Results about GeneWiki experiment
Generating forms from ShEx (entity schemas)
Prototypes based on UI ontology: ShEx-forms (Eric), shapeForms (by WESO)
Agenda
Preliminaries: RDF, ShEx, entity schemas, …
Tools for entity schemas
Exercise creating an entity schema
Some applications of entity schemas
Discussion
56
Topics for discussion
End of presentation
58