Semantic Web Crawling: a Sitemap Extension

Version 0.91.1

Editors:

Giovanni Tummarello (DERI Galway - giovanni.tummarello@deri.org)

Status:

This is a working draft, feedback is requested.

Motivation:

There are many ways by which semantically structured data on the Semantic Web (SW) can be made available and consumed.
A SW database online can serve descriptions of its URI/URLs as in the Linked Data paradigm [1]. It can offer dumps containing the entire database available for download. It can offer a SPARQL access point, it can embed information in XML dialects such as RDFa etc. While such data might be the same, the implications of accessing it in one way or the other might be very big.

For example, if a client wanted to execute a relatively simple query over the DBPedia database, it should probably use the available SPARQL service. On the other hand if it wanted to execute a great number of queries, it should probably download the RDF dump and run them locally. Equally, to know the updates about a specific concept, it should access DBPedia’s specific URLs, rather than downloading the entire database too often. Similar issues arise with Semantic Web crawlers which can easily cause Denial Of Services when reaching a database with a great number of linked data URIs.

To address these issues we introduce the Semantic Crawler extension to the Sitemap protocol [2].  By using such extension, a host can both avoid denial of service by compliant robots and help compliant clients find alternative and possibly better ways to access the host data.

Note: The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Description:

The sitemap protocol defines a way to create a file by which automatic agents can obtain a list of URLs which they should index, along with meta information like, e.g., the expected rate of change for each individual URL [2].  The protocol also defines a way to extend robot.txt, so that a robot can find the location of said map in the website sitemap [3].

The Semantic Crawler extension is composed by special tags to use within a sitemap. The main idea is to describe equivalences and alternative ways for a Semantic Robot (client or crawler) to access Semantic Web data offered by a host. One such equivalence is declared by the <cs:dataset> tag, which is used at the same level as <url> tags in the sitemap. 

(
Note that for the example in this document we will assume that the cs namespace is mapped to the location of this extension that is  http://sindice.com/swcrawling/schema/0.1 . This is usually done in the <urlset> opening sitemap tag, as shown in the Appendix A examples.)

A dataset can have many representations, or ways to be accessed. These are described by tags to be used within a dataset:


The semantics of the <sc:dataset> tag is that is that the data underlying each of the representations is the same.

This means, for example that the merge of the RDF descriptions of all the linked data URL served by the server MUST be at least be contained, in RDF terms [3], for example by the model pointed by the <sc:dataDumpLocation> tag or in the database powering the sparql end point pointed by the <sc:sparqlEndpointLocation> tag.

Cases where this is not true SHOULD be limited to technical delays of relatively minor importance introduced,.e.g, by the creation of the RDF Dump happening only at certain times of the day or by server side caching of the RDF representation of popular URL/URI.

If a sitemap contains more datasets definitions, these are treated independently.

Within a dataset tag the following optional tags can also be used:

Constraints and values

<sc:linkedDataPrefix>


There can be any number of <sc:linkedDataPrefix> in a dataset: in this case the dataset is said to contain the union of all the linked data served under the different prefixes.


<sc:dataDumpLocation> and <sc:sparqlEndpointLocation>

There can be any number of <sc:dataDumpLocation> and <sc:sparqlEndpointLocation> in a dataset: the interpretation is that of "mirror locations", both for dumps and for sparql endpoints.

Using a different host for a mirror is possible (e.g. a  backup sparql server on a different host or a mirror rdf dump). A client however should believe such statements only if the same statements are also given in the sitemap on the second host.

In case of multiple such tags, the priority integer attribute can be specified in each tag to indicate the preference of the source:
 

<sc:dataFragmentDump>

There should be zero or more than one <sc:dataFragmentDump> in a dataset: if an RDF dataset is fragmented then the fragments should at least be 2. Fragments can also be on a different host, following the same rule as above.

<sc:datasetURI>

There can be zero or one <sc:datasetURI> and its value should not be the same of other datasets. Again it is STRONGLY recommended to set a URI for the dataset. A URI for the dataset is usually minted in the domain name of the host but does not need to point to an actual web resource (thought it is suggested but the linked data paradigm that it http points to a resource which provides a description of the dataset itlsef, see the examples)

<sc:datasetLabel>, <changefreq>

There can be zero or one <sc:datasetLabel>, <changefreq>

Behaviour from a compliant client

To be said compliant with this extension, a Semantic Web spider or client MUST check and interpret dataset tags. This means checking for robot.txt, looking for the sitemap location , retrieving this and interpreting it to chose the most appropriate, in general the least intrusive, way to access the data.

For example it MUST download a full data dump rather than crawl the entire resolvable URL space, while on the other hand it SHOULD NOT redownload the file (but rather access directly the URL/URI) if it is looking for a specific update. In agreement with the sitemap protocol itself, a spider might however decide a different recrawling schedule rather than strictly the one indicated in the <changefreq> tag.

Different dump formats

This extension does not specify in which RDF flavour should the datadumps or the linked data url be made available. While such format is expected to be in general RDF/XML, HTTP content negotiation can be of used to negotiate different formats. Similarly, a dump might be provided in compressed format (e.g. gzipped) . In such case a client might want to rely on the mimetype or, at its own risk, on other rules (e.g. data analysis or file extension matching)

Security Issues

In adopting dataset definition on a website which serves RDF, the adopters must be aware that, just like robots.txt or sitemaps, the mechanism works on a voluntary basis.  It is up to Semantic Crawler writers to keep the most respectful behavior toward the resources offered by the server by reading the srobots.rdf file and properly interpreting it.

APPENDIX A: Examples

The following examples illustrate example sitemaps using this extension. Feel free to copy and modify these examples for your own purposes.

The URLSET preamble

To use the extended terms here defined, the usual sitemap urlset element need to include the proper namespace definitions. In practice, rather than the usual:


<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
.....
.....

You should use

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
         http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
         xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:sc="http://sindice.com/swcrawling/schema/0.1">

which points the sc namespace to this extention as defined by the schema available online at
http://sindice.com/swcrawling/schema/0.1

Lets now see some concrete examples

1: A basic example

The following example states that a dataset with label "Product catalog for Example.org" is available at http://example.org/cataloguedump.rdf. Furthermore, such data is what powers the resolution of URIs/URLs in the space http://example.org/products/ . Finally, this data is said to change with a monthly frequency.    
<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
         http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
         xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:sc="http://sindice.com/swcrawling/schema/0.1">

<sc:dataset>
<sc:datasetName>Product Catalog for Example.org</sc:datasetName>
<sc:dataDumpLocation>http://example.org/cataloguedump.rdf</sc:dataDumpLocation>
<sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix>
 <changefreq>monthly</changefreq>
<sc:dataset>
   <url>...
</url>
<url>...
</url>
.....

</urlset>

2: A complete example

The following extends the previous with statements that says that the same dayaset is available in fragmented form. Also a URI is defined for the dataset.
Such URI is a URL minted in the same host space ( http://example.org/aboutcatalog.rdf#catalog ). While there is no need for such URI to be a URL and to be minted to point to an HTTP retrievable document (in this case to http://example.org/aboutcatalog.rdf) the Semantic Web linked data paradigm strongly advices to do so. By doing this, the concept itself can be resolved into possibly a description of the concept itself (in this case resolving the URI would give the aboutcatalog.rdf file which could and should in fact contain statements about http://example.org/aboutcatalog.rdf#catalog) 
Also, this example indicated the address for a sparql endpoint for the dataset.

<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
         http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
         xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:sc="http://sindice.com/swcrawling/schema/0.1">

<sc:dataset>
<sc:datasetName>Product Catalog for Example.org</sc:datasetName>
<sc:datasetURI>http://example.org/aboutcatalog.rdf#catalog</sc:datasetURI>
<sc:dataDumpLocation>http://example.org/cataloguedump.rdf</sc:dataDumpLocation>
<sc:dataFragmentDumpLocation>http://example.org/cataloguedump_part1.rdf</sc:dataFragmentDumpLocation>
<sc:dataFragmentDumpLocation>http://example.org/cataloguedump_part2.rdf</sc:dataFragmentDumpLocation>
<sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix>
<sc:sparqlEndPoint>http://example.org/queryengine/sparql</sc:sparqlEndPoint>
 <changefreq>monthly</changefreq>
<sc:dataset>

</urlset>

3: Data on different Hosts and priorities

In this example the website wishes to state it has data served as RDF Dump and by 2 SPARQL endpoints, one on the same website and one on another. Please note that for the statement involving and external site, the same statement MUST be placed also in the sitemap of the other site for this to be trusted by compliant crawler.

The priority values in the mirrored data dumps and sparql endpoints are set so that the backup dataset location WILL NOT be used unless the primary location is not responding.
On the other hand, the host wishes that the secondary SPARQL service be used with ratio 1/(10+default) = 1/11  with respect to the main one.
<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
         http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
         xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:sc="http://sindice.com/swcrawling/schema/0.1">

<sc:dataset>
<sc:datasetName>Product Catalog for Example.org</sc:datasetName>
<sc:datasetURI>http://example.org/aboutcatalog.rdf#catalog</sc:datasetURI>
<sc:dataDumpLocation>http://example.org/cataloguedump.rdf</sc:dataDumpLocation>
<sc:dataDumpLocation priority="0">http://backup.example.org/cataloguedump.rdf</sc:dataDumpLocation>
<sc:sparqlEndPoint priority="10">http://example.org/queryengine/sparql</sc:sparqlEndPoint>
<sc:sparqlEndPoint>http://secondary.example.org/queryengine/sparql</sc:sparqlEndPoint>
 <changefreq>monthly</changefreq>
<sc:dataset>

</urlset>

Appendix B: Schemas

This datamap extension schema is available at

It is reported here for convenience:

Appendix C: Additional Issues and Ideas

To be discussed : it is valid for a Sitemap to refer to subdomains e.g. for a sitempa posted on http://uniprot.ch to say that their dumps are on ftp://ftp.uniprot.ch without having an ftp sitemap.
To be discussed : provide an entry point in the linked dataset (e.g. an example uri). Pro, seems useful, con, once you have the datasets one could easliy find many URIs by a regex
To be discussed : allow one to specify which form of description algorithm is using to serve linked data from the DB. is it a Concise Bound Description ? is it a simmetric one? is it the sum of all the MSGs? is it just the triples involving the uri as subject, object or both? . Pro: seems important to be able to get a single dump and decompose it in something similar (e.g. a subset) to what the site serves at each URL/URI. This enables a search engine to building indexes that are more accurate (e.g. point to the right linked data). Con: many sites dont use a single strategy, a search engine could try to understand this by itself by analizing some of the RDf served from the linked data and comparing it with the content of the RDF dump.

References

[1] Linking Open Data on the Semantic Web - http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[2] The Sitemap protocol - http://www.sitemaps.org/protocol.php
[3] The Sitemap protocol: robot.txt extension - http://www.sitemaps.org/protocol.php#submit_robots
[4] RDF Semantics

Acknowledgments

The following people have contributed to the making of this extention:

Chris Bizer (Free University Berlin)
Richard Cyganiak (Free University Berlin)
Stefano Mazzocchi (SIMILE- MIT)
Christian Morbidoni (SEMEDIA - Universita' Politecnica delle Marche)
Michele Nucci (SEMEDIA - Universita' Politecnica delle Marche)
Eyal Oren (DERI Galway)
Leo Sauermann (DFKI)

History and Revision

2/6/2007 - first version created, called Semantic Crawling Ontology and based on an RDF/XML model and circulated for internal review
5/6/2007 - 0.9 incorporates first reviews and adds examples
3/7/2007 - 0.91 complete rewriting of the syntax and the proposed working mechanism as a Sitemap extension
12/7/2006 - 0.91.1 added Appendix C with some ideas and issues as gathered from the Linking Open Data group

Last updated 12/7/2006