Tutorial: Query Federation with SWObjects

Presenters: Eric Prud'hommeaux (W3C), Helena Deus (University of Texas) and M. Scott Marshall (HCLS IG co-chair) , originally presented at http://www.swat4ls.org/workshops/berlin2010/

Query federation is a way to make use of distributed data sources on the Web, including many SPARQL endpoints as well as relational databases. SWObjects makes use of mapping rules represented as SPARQL Constructs to dynamically rewrite the terms and predicates of a SPARQL query into corresponding terms in another vocabulary and connect the resulting query with the appropriate service for answering the query. SWObjects can also rewrite SPARQL queries in SQL, effectively including relational databases alongside SPARQL endpoints in a linked data federation. The author of SWObjects, Eric Prud'hommeaux, is the W3C staff contact for the W3C Health Care and Life Sciences Interest Group (HCLS IG) and will give the tutorial via a teleconference connection, with assistance from Helena Deus (University of Texas) and M. Scott Marshall (HCLS IG co-chair).

Outline:

Resources

http://people.csail.mit.edu/pcm/tempISWC/workshops/SWPM2010/InvitedPaper_6.pdf 

http://purl.org/net/biordfmicroarray/demo 

Setup

Cookbook  

(Note: all resources for Eric’s slides at Sample Queries: http://www.w3.org/2010/Talks/1208-egp-swobjects/goProt.zip . The cookbook queries do not correspond to the queries in the above zip file and were originally on the paper handout. File names in this document have now been changed to avoid name clash with zip file.)

Sample SQL queries for UCSC data

Courtesy Nigam Shah (NCBO), excerpted from e-mail communications

To connect to UCSC database(s) from the command line:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu

Add a parameter to select the database and execute a query from the command line:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A <dbname> -e '<myselectstatement>’'

For human P53, (uniprot id P04637).

mysql> select * from uniprot.gene where acc ='P04637'

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A uniProt -e 'select * from gene where acc ="P04637"'

you will get two rows back. Then do

mysql> select * from gene_product where symbol in ('TP53', 'P53');

you will get three rows back. For each of the ids (column 1), look up the GO annotation.

mysql> use go

Database changed

mysql> select * from association where gene_product_id in (17440, 3431471, 3586773)

Putting it all together in one query:

select uniProt.gene.val  from uniProt.gene, go.gene_product, go.association

where uniProt.gene.acc ='P04637'

and go.gene_product.symbol = uniProt.gene.val

and go.gene_product.id = go.association.gene_product_id

Or get more human readable output:

select uniProt.gene.val, go.association.term_id, go.term.name  from uniProt.gene, go.gene_product, go.association, go.term

where uniProt.gene.acc ='P04637'

and go.gene_product.symbol = uniProt.gene.val

and go.gene_product.id = go.association.gene_product_id

and go.association.term_id = go.term.id

So we just looked up annotation for all genes (in multiple species) that are named similar to the human protein identified by the uniprot id P04637. You will get 158 results .. and can make the output more interesting by adding the species names etc columns by doing the relevant joins to the species table in the "go" database.

UCSC federated query with output

Setting up services for the federated query that crosses two relational sources:

cmd> sparql --debug 1 --stem http://ucsc.example/uniProt/ -S mysql://genome@genome-mysql.cse.ucsc.edu/uniProt --serve http://localhost:8001/uniProt

cmd> sparql --debug 1 --stem http://ucsc.example/go/ -S mysql://genome@genome-mysql.cse.ucsc.edu/go --serve http://localhost:8003/go

Create goProt3.rq and goProt3.map with the following contents:


goProt3.rq:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX uniprot: <http://purl.uniprot.org/core/>

PREFIX go: <http://www.geneontology.org/dtd/go.dtd>

PREFIX gene: <http://yetanothergenevocabulary.org#>

SELECT ?gene_symbol ?goterm

{

  _:gene uniprot:id 'P04637' ;

         skos:prefLabel ?gene_symbol .

  ?go_product gene:symbol ?gene_symbol .

  ?go_id gene:product ?go_product .

  ?go_id go:term ?goterm_id .

  ?goterm_id rdfs:label ?goterm .

}


goProt3.map:

# Common RDF vocabularies:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

# Uniprot and GO:

PREFIX uniprot: <http://purl.uniprot.org/core/>

PREFIX go: <http://www.geneontology.org/dtd/go.dtd>

PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

PREFIX gene: <http://yetanothergenevocabulary.org#>

# Direct graph tables:

PREFIX Ugene: <http://ucsc.example/uniProt/gene#>

PREFIX gene_product: <http://ucsc.example/go/gene_product#>

PREFIX association: <http://ucsc.example/go/association#>

PREFIX term: <http://ucsc.example/go/term#>

<uniProt> CONSTRUCT

  {

    _:gene uniprot:id ?id ; skos:prefLabel ?gene_symbol

  }

  WHERE

  {

    SELECT (fn:lower-case(?u_gene_symbol) AS ?gene_symbol)

    {

      SERVICE <http://localhost:8001/uniProt>

      {

        _:gene Ugene:acc ?id ; Ugene:val ?u_gene_symbol

      }

    }

  }

<go> CONSTRUCT

  {

    ?go_product gene:symbol ?gene_symbol .

    ?go_id gene:product ?go_product .

    ?go_id go:term ?goterm_id .

    ?goterm_id rdfs:label ?goterm .

  }

  WHERE

  {

    SERVICE <http://localhost:8003/go>

    {

      ?gp gene_product:Symbol ?gene_symbol .

      ?association association:gene_product_id ?gp .

      ?association association:term_id ?t .

      ?t term:name ?goterm

    }

  }


Create above files and execute:

sparql -m goProt3.map goProt3.rq


@@ introduction to shared names @@

Now use the tool to create *computable* shared names;

create goProt2.rq and goProt2.map with the following contents:

goProt4.rq:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX uniprot: <http://purl.uniprot.org/core/>

PREFIX go: <http://www.geneontology.org/dtd/go.dtd>

PREFIX gene: <http://yetanothergenevocabulary.org#>

SELECT ?symbol ?label

{

 <http://www.uniprot.org/uniprot/P04637>

          skos:prefLabel ?symbol .

 ?product gene:symbol    ?symbol .

 ?id      gene:product   ?product .

 ?id      go:term        ?goterm .

 ?goterm  rdfs:label     ?label .

}


goProt4.map:

# Common RDF vocabularies:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

# Uniprot and GO:

PREFIX uniprot: <http://purl.uniprot.org/core/>

PREFIX unigene: <http://www.uniprot.org/uniprot/>

PREFIX go: <http://www.geneontology.org/dtd/go.dtd>

PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

PREFIX gene: <http://yetanothergenevocabulary.org#>

# Direct graph tables:

PREFIX Ugene: <http://ucsc.example/uniProt/gene#>

PREFIX gene_product: <http://ucsc.example/go/gene_product#>

PREFIX association: <http://ucsc.example/go/association#>

PREFIX term: <http://ucsc.example/go/term#>

<uniProt> CONSTRUCT

 {

   ?gene uniprot:id ?id ; skos:prefLabel ?gene_symbol

 }

 WHERE

 {

   SELECT (fn:concat(unigene:, ?id) AS ?gene)

          (fn:lower-case(?u_gene_symbol) AS ?gene_symbol)

   {

     SERVICE <http://localhost:8001/uniProt>

     {

       _:gene Ugene:acc ?id ; Ugene:val ?u_gene_symbol

     }

   }

 }

<go> CONSTRUCT

 {

   ?go_product gene:symbol ?gene_symbol .

   ?go_id gene:product ?go_product .

   ?go_id go:term ?goterm_id .

   ?goterm_id rdfs:label ?goterm .

 }

 WHERE

 {

   SERVICE <http://localhost:8003/go>

   {

     ?gp gene_product:Symbol ?gene_symbol .

     ?association association:gene_product_id ?gp .

     ?association association:term_id ?t .

     ?t term:name ?goterm

   }

 }


Create above files and execute:

sparql -m goProt4.map goProt4.rq


Federated Microarray Data  (see Helen’s slides)

The full query

PREFIX diseasome: <http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX stat: <http://purl.org/net/biordfmicroarray/stat#>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX void: <http://rdfs.org/ns/void#>

PREFIX dct: <http://purl.org/dc/terms/>

PREFIX biordf: <http://purl.org/net/biordfmicroarray/ns#>

PREFIX neurolex: <http://neurolex.org/wiki/Category:>

PREFIX doid: <http://purl.org/obo/owl/DOID#>

SELECT DISTINCT  ?diseaseName ?geneLabel ?geneName WHERE {

        #Retrieve a list of overexpressed genes in the entorhinal cortex of AD patients

        {

                ?sampleList        biordf:patients_have_disease        ?alzheimers .

                                FILTER (?alzheimers = doid:DOID_10652 )

?experimentSet        dct:isPartOf        ?microarray_experiment        ;

                                biordf:has_input_value        ?sampleList ;

                                biordf:differentially_expressed_gene         ?gene ;

                                biordf:has_ouput_value ?foldChange .

                        ?gene        rdfs:label ?geneLabel ;

                        biordf:name           ?geneName .

                ?foldChange        rdf:value        ?foldChangeValue ;

                        stat:p_value ?pval .

                #Apply filters to constrain the amount of results

                FILTER (xsd:float(?foldChangeValue) > 0)

                FILTER (xsd:float(?pval) < 0.001 )                

        }

        #Find most recently updated SPARQL endpoint that contains information about genes and diseases.

        

        {

                ?source        rdf:type        void:Dataset ;

                void:sparqlEndpoint        ?srvc ;

                dct:issued                ?issued  ;

                dct:subject                diseasome:diseases ;

                dct:subject                diseasome:genes .

                OPTIONAL {

                ?source1        rdf:type        void:Dataset ;

                void:sparqlEndpoint        ?srvc2 ;

                dct:issued        ?issued2 ;

                dct:subject        diseasome:diseases ;

                dct:subject        diseasome:genes .

                FILTER (?issued2 > ?issued)

                }

                FILTER (!BOUND(?srvc2))

        }

        #Get associated diseases from most recently updated Diseasome server.

        SERVICE ?srvc {

                ?diseasomeGene rdfs:label ?geneLabel .

                ?disease diseasome:associatedGene ?diseasomeGene.

                ?disease rdfs:label ?diseaseName .

        }

}


The easy query, with SWObjects mapping:

mArray.rq:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX : <http://easytorememberpredicates.com/>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?gene ?geneLabel ?geneDescription ?otherDiseases

{

 ?gene :expressed_id ?disease   .

 ?gene :fold_change ?fold_change .

 ?gene rdfs:label ?geneLabel .

 

 ?gene rdfs:comment ?geneDescription .

        FILTER regex(?disease, "Alzheimer")

        FILTER (xsd:float(?fold_change) > 0)

 

 ?diseasomeGene rdfs:label ?geneLabel .

 ?diseasomeGene :also_involved_in ?otherDiseases .

}


mArray.map

# Common RDF vocabularies:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX void: <http://rdfs.org/ns/void#>

PREFIX dct: <http://purl.org/dc/terms/>

PREFIX : <http://easytorememberpredicates.com/>

# contextual:

PREFIX diseasome: <http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/>

PREFIX stat: <http://purl.org/net/biordfmicroarray/stat#>

PREFIX biordf: <http://purl.org/net/biordfmicroarray/ns#>

PREFIX neurolex: <http://neurolex.org/wiki/Category:>

PREFIX doid: <http://purl.org/obo/owl/DOID#>

<mArray> CONSTRUCT

 {

   ?gene :expressed_id ?disease .

   ?gene :fold_change ?fold_change .

   ?gene rdfs:label ?geneLabel .

   ?gene rdfs:comment ?geneDescription .

   ?diseasomeGene rdfs:label ?geneLabel .

   ?diseasomeGene :also_involved_in ?otherDiseases .

 }

 WHERE

 {

   SERVICE <http://hcls.deri.org/sparql>

     {

    ?experimentSet        dct:isPartOf        ?microarray_experiment        ;

                        biordf:has_input_value        ?sampleList ;

                        biordf:differentially_expressed_gene         ?gene ;

                        biordf:has_ouput_value ?foldChange .

   ?sampleList        biordf:patients_have_disease        ?alzheimers .

   ?gene        rdfs:label ?geneLabel ;

                biordf:name           ?geneDescription .

   ?foldChange        rdf:value        ?fold_change ;

                stat:p_value ?pval .

        FILTER (xsd:float(?pval) < 0.001 )

        

        ?alzheimers rdfs:label ?disease .

   

        ?diseasomeGene rdfs:label ?geneLabel .

        ?diseasomeDisease diseasome:associatedGene ?diseasomeGene .

        ?diseasomeDisease rdfs:label ?otherDiseases .        

     }

 }

<diseasome> CONSTRUCT

 {

        ?diseasomeGene :also_involved_in ?otherDiseases .

 }

 WHERE {

        SERVICE <http://hcls.deri.org/sparql> {

        ?diseasomeGene rdfs:label ?geneLabel .

        ?diseasomeDisease diseasome:associatedGene ?diseasomeGene .

        ?diseasomeDisease rdfs:label ?otherDiseases .        

        }

 }


Appendix

Suppose someone wants to ask a question which spans a bunch of query

services, but doesn't (care to) enumerate those resources. The

ChainingMapper takes rules (currently written as SPARQL CONSTRUCTs)

and maps a query over the consequents of those rules to a query over

the antecedents of those rules. This can be used to partition queries

into SERVICE graphs.

For a terse example, imagine service S12 includes triples with

predicates <p1> and <p2>, and S3 includes predicates <p3>. Voila a

SPARQL invocation and tranformed query:

Query and mapping rules:

SPARQL -npq \

 -M 'CONSTRUCT { ?s <p1> ?o1 ; <p2> ?o2 } WHERE { SERVICE <S12> { ?s <p1> ?o1 ; <p2> ?o2 } }' \

 -M 'CONSTRUCT { ?s <p3> ?o } WHERE { SERVICE <S3> { ?s <p3> ?o } }' \

 -e 'SELECT * WHERE { ?x1 <p1> ?n1 ; <p2> ?n2 ; <p3> ?n3 }'

Transformed query:

SELECT * WHERE {

 SERVICE <S12> { ?x1 <p1> ?n1 . ?x1 <p2> ?n2 }

 SERVICE <S3> { ?x1 <p3> ?n3 }

}

If the system has a list of SERVICE graphs, then the user just has to

ask the base question and the systems handles query federation. What

if some predicates can be obtained from multiple endpoints?

SPARQL -npq \

 -M 'CONSTRUCT { ?s <p1> ?o1 ; <p2> ?o2 } WHERE { SERVICE <S12> { ?s <p1> ?o1 ; <p2> ?o2 } }' \

 -M 'CONSTRUCT { ?s <p2> ?o2 ; <p3> ?o3 } WHERE { SERVICE <S23> { ?s <p2> ?o2 ; <p3> ?o3 } }' \

 -e 'SELECT * WHERE { ?x1 <p1> ?n1 ; <p2> ?n2 ; <p3> ?n3 }'

Now we expect <p2> to come from both S12 and S23:

SELECT * WHERE { {

   SERVICE <S12> { ?x1 <p1> ?n1 . ?x1 <p2> ?n2 }

   SERVICE <S23> { ?x1 <p2> ?_0x8752540_0_o2 . ?x1 <p3> ?n3 }

 } UNION {

   SERVICE <S12> { ?x1 <p1> ?n1 . ?x1 <p2> ?_0x874df10_1_o2 }

   SERVICE <S23> { ?x1 <p2> ?n2 . ?x1 <p3> ?n3 }

 } }

So our query for predicates <p1> <p2> <p3> could be answered by <S12>

providing <p1> <p2> and <S23> providing <p3>, or by <p2> coming from

<S23> (hence the UNION).

What are those odd variables like _0x8752540_0_o2 ? In order to be

sound with respect to the rules, the ChainingMapper has to ensure that

the complete antecedent of each query is satisfied. Looking at the

first side of the UNION, <S23> is expected to match { ?x <p2> ?o2;

<p3> ?o3 }, so the generated query must make sure that for any { ?x1

<p3> ?n3 }, ?x1 also has a property <p2> (though the value is

unimportant).

Since those rules are two course, we can write more, simpler rules:

SPARQL -npq \

 -M 'CONSTRUCT { ?s <p1> ?o1 } WHERE { SERVICE <S12> { ?s <p1> ?o1 } }' \

 -M 'CONSTRUCT { ?s <p2> ?o2 } WHERE { SERVICE <S12> { ?s <p2> ?o2 } }' \

 -M 'CONSTRUCT { ?s <p2> ?o2 } WHERE { SERVICE <S23> { ?s <p2> ?o2 } }' \

 -M 'CONSTRUCT { ?s <p3> ?o3 } WHERE { SERVICE <S23> { ?s <p3> ?o3 } }' \

 -e 'SELECT * WHERE { ?x1 <p1> ?n1 ; <p2> ?n2 ; <p3> ?n3 }'

and see a simpler federated query:

SELECT * WHERE { {

   SERVICE <S12> { ?x1 <p1> ?n1 }

   SERVICE <S12> { ?x1 <p2> ?n2 }

   SERVICE <S23> { ?x1 <p3> ?n3 }

 } UNION {

   SERVICE <S12> { ?x1 <p1> ?n1 }

   SERVICE <S23> { ?x1 <p3> ?n3 }

   SERVICE <S23> { ?x1 <p2> ?n2 }

 } }


I got this related Q:

> Let's say we have two endpoints - S1 and S2 - registered with the federator.

>

> Triples in S1:

>

> ?x rdf:type ?y

>

> Triples in S2:

>

> ?a rdf:type foaf:person

> ?b foaf:mbox ?c

>

> Now, I get a query:

>

> Select ?x1

> Where {

> ?x1 rdf:type foaf:person

> ?x1 foaf:mbox ?y1

> }

>

> What does the mapper do in this instance? Is there a way to tell the

> mapper to look only at S2 (or only S1 to get bindings for the first

> triple in the where clause)? Or, is the federator expected to return

> the most complete answer by doing a union across both endpoints?

We can test this:

Query:

SPARQL -npq \

 -M 'CONSTRUCT { ?x a ?y } WHERE { SERVICE <S1> { ?x a ?y } }' \

 -M 'CONSTRUCT { ?a a <person> . ?b <mbox> ?c } WHERE { SERVICE <S2> { ?a a <person> . ?b <mbox> ?c } }' \

 -e 'SELECT ?x1 WHERE { ?x1 a <person> . ?x1 <mbox> ?y1 }'

and get the expected UNION because two services can produce rdf:type

arcs:

Result (query):

SELECT ?x1

WHERE

{

 {

   SERVICE <S1>

     {

       ?x1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <person> .

     }

   SERVICE <S2>

     {

       ?_0x8580540_0_a <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <person> .

       ?x1 <mbox> ?y1 .

     }

 }

UNION

 SERVICE <S2>

   {

     ?x1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <person> .

     ?x1 <mbox> ?y1 .

   }

}