Semantic Web Crawling: a Sitemap Extension
Version 0.91.1
Editors:
Giovanni Tummarello (DERI Galway - giovanni.tummarello@deri.org)
Status:
This is a working draft, feedback is requested.
Motivation:
There are many ways by which semantically structured data on the Semantic Web
(SW) can be made available and consumed.
A SW database online can serve descriptions of its URI/URLs as in the Linked
Data paradigm [1]. It can offer dumps containing the entire database available
for download. It can offer a SPARQL access point, it can embed information in
XML dialects such as RDFa etc. While such data might be the same, the
implications of accessing it in one way or the other might be very big.
For example, if a client wanted to execute a relatively simple query over the
DBPedia database, it should probably use the available SPARQL service. On the
other hand if it wanted to execute a great number of queries, it should probably
download the RDF dump and run them locally. Equally, to know the updates about a
specific concept, it should access DBPedia’s specific URLs, rather than
downloading the entire database too often. Similar issues arise with Semantic
Web crawlers which can easily cause Denial Of Services when reaching a database
with a great number of linked data URIs.
To address these issues we introduce the Semantic Crawler extension to the
Sitemap protocol [2]. By using such extension, a host can both avoid
denial of service by compliant robots and help compliant clients find
alternative and possibly better ways to access the host data.
Note: The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document
are to be interpreted as described in RFC 2119.
Description:
The sitemap protocol defines a way to create a file by which automatic agents
can obtain a list of URLs which they should index, along with meta information
like, e.g., the expected rate of change for each individual URL [2]. The
protocol also defines a way to extend robot.txt, so that a robot can find the
location of said map in the website sitemap [3].
The Semantic Crawler extension is composed by special tags to use within a
sitemap. The main idea is to describe equivalences and alternative ways for a
Semantic Robot (client or crawler) to access Semantic Web data offered by a
host. One such equivalence is declared by the
<cs:dataset> tag, which is
used at the same level as
<url> tags in the
sitemap.
(Note that for the example in this document we will assume that the
cs namespace is mapped to the
location of this extension that is
http://sindice.com/swcrawling/schema/0.1
. This is usually done in the
<urlset> opening sitemap tag,
as shown in the Appendix A examples.)
A dataset can have many representations,
or ways to be accessed. These are described by tags to be used within a dataset:
-
<sc:linkedDataPrefix> – A prefix for
linked data that a server hosts. URI/URLs that begin with this prefix will
resolve to their RDF description.
-
<sc:dataDumpLocation> - The location
of on an RDF data dump
-
<sc:dataFragmentDump> - The location
of a fragment of an RDF data dump, used for large datasets which are "split"
in several files.
-
<sc:sparqlEndpointLocation> - The
location of a SPARQL endpoint
The semantics of the <sc:dataset> tag is that is that the data
underlying each of the representations is the
same.
This means, for example that the merge of the RDF descriptions of all the linked
data URL served by the server MUST be at least be contained, in RDF terms [3],
for example by the model pointed by the <sc:dataDumpLocation> tag or in
the database powering the sparql end point pointed by the
<sc:sparqlEndpointLocation> tag.
Cases where this is not true SHOULD be limited to technical delays of relatively
minor importance introduced,.e.g, by the creation of the RDF Dump happening only
at certain times of the day or by server side caching of the RDF representation
of popular URL/URI.
If a sitemap contains more datasets definitions, these are treated
independently.
Within a dataset tag the following optional tags can also be used:
-
<sc:datasetURI> - It is STRONGLY recomended to set a URI for the
dataset. This can be used by the website itself to further provide further
annotation as well as by others.
-
<sc:datasetLabel> - A label describing the dataset
-
<changefreq> - The changefreq element
as defined by the sitemap protocol to describe how often it is expected that
the dataset will be updated (e.g.
monthly), see [2]
Constraints and values
<sc:linkedDataPrefix>
There can be any
number of <sc:linkedDataPrefix> in a dataset: in this case
the dataset is said to contain the union of all the linked data served under
the different prefixes.
<sc:dataDumpLocation>
and <sc:sparqlEndpointLocation>
There can be any number of
<sc:dataDumpLocation>
and <sc:sparqlEndpointLocation> in a dataset: the interpretation is
that of "mirror locations", both for dumps and for sparql endpoints.
Using a different host for a mirror is possible (e.g. a backup sparql
server on a different host or a mirror rdf dump). A client however should
believe such statements only if the same statements are also given in the
sitemap on the second host.
In case of multiple such tags, the
priority
integer attribute can be
specified in each tag to indicate the preference of the source:
-
A value of 0 for priority means
that the source SHOULD NOT be used unless the other one is not responding.
-
A value different than 0
indicates that the source SHOULD be used with a probability equal to the
priority divided by the sum of the priorities of all the mirror locations.
-
For sources not
specifiying
a priority, the default priority value is 1.
<sc:dataFragmentDump>
There should be zero or more than one
<sc:dataFragmentDump> in a dataset: if an RDF dataset is fragmented
then the fragments should at least be 2. Fragments can also be on a different
host, following the same rule as above.
<sc:datasetURI>
There can be zero or one <sc:datasetURI> and its value should not be the
same of other datasets. Again it is STRONGLY recommended to set a URI for the
dataset. A URI for the dataset is usually minted in the domain name of the host
but does not need to point to an actual web resource (thought it is suggested
but the linked data paradigm that it http points to a resource which provides a
description of the dataset itlsef, see the examples)
<sc:datasetLabel>,
<changefreq>
There can be zero or one <sc:datasetLabel>,
<
changefreq>
Behaviour from a compliant client
To be said compliant with this extension, a Semantic Web spider or client MUST
check and interpret dataset tags. This means checking for robot.txt, looking for
the sitemap location , retrieving this and interpreting it to chose the most
appropriate, in general the least intrusive, way to access the data.
For example it MUST download a full data dump rather than crawl the entire
resolvable URL space, while on the other hand it SHOULD NOT redownload the file
(but rather access directly the URL/URI) if it is looking for a specific update.
In agreement with the sitemap protocol itself, a spider might however decide a
different recrawling schedule rather than strictly the one indicated in the
<changefreq> tag.
Different dump formats
This extension does not specify in which RDF flavour should the datadumps or the
linked data url be made available. While such format is expected to be in
general RDF/XML, HTTP content negotiation can be of used to negotiate different
formats. Similarly, a dump might be provided in compressed format (e.g. gzipped)
. In such case a client might want to rely on the mimetype or, at its own risk,
on other rules (e.g. data analysis or file extension matching)
Security Issues
In adopting dataset definition on a website which serves RDF, the adopters must
be aware that, just like robots.txt or sitemaps, the mechanism works on a
voluntary basis. It is up to Semantic Crawler writers to keep the most
respectful behavior toward the resources offered by the server by reading the
srobots.rdf file and properly interpreting it.
APPENDIX A: Examples
The following examples illustrate example sitemaps using this extension. Feel
free to copy and modify these examples for your own purposes.
The URLSET preamble
To use the extended terms here defined, the usual sitemap urlset element need
to include the proper namespace definitions. In practice, rather than the
usual:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
.....
.....
You should use
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:sc="http://sindice.com/swcrawling/schema/0.1">
which points the sc namespace to this extention as defined by the schema available online at http://sindice.com/swcrawling/schema/0.1
Lets now see some concrete examples
1: A basic example
The following example states that a dataset with label "Product catalog for
Example.org" is available at http://example.org/cataloguedump.rdf. Furthermore,
such data is what powers the resolution of URIs/URLs in the space
http://example.org/products/ . Finally, this data is said to change with a
monthly frequency.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:sc="http://sindice.com/swcrawling/schema/0.1">
<sc:dataset>
<sc:datasetName>Product Catalog for Example.org</sc:datasetName>
<sc:dataDumpLocation>http://example.org/cataloguedump.rdf</sc:dataDumpLocation>
<sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix>
<changefreq>monthly</changefreq>
<sc:dataset>
<url>...
</url>
<url>...
</url>
.....
</urlset>
2: A complete example
The following extends the previous with statements that says that the same
dayaset is available in fragmented form. Also a URI is defined for the dataset.
Such URI is a URL minted in the same host space (
http://example.org/aboutcatalog.rdf#catalog ). While there is no need for such
URI to be a URL and to be minted to point to an HTTP retrievable document (in
this case to http://example.org/aboutcatalog.rdf) the Semantic Web linked data
paradigm strongly advices to do so. By doing this, the concept itself can be
resolved into possibly a description of the concept itself (in this case
resolving the URI would give the aboutcatalog.rdf file which could and should in
fact contain statements about http://example.org/aboutcatalog.rdf#catalog)
Also, this example indicated the address for a sparql endpoint for the dataset.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:sc="http://sindice.com/swcrawling/schema/0.1">
<sc:dataset>
<sc:datasetName>Product Catalog for Example.org</sc:datasetName>
<sc:datasetURI>http://example.org/aboutcatalog.rdf#catalog</sc:datasetURI>
<sc:dataDumpLocation>http://example.org/cataloguedump.rdf</sc:dataDumpLocation>
<sc:dataFragmentDumpLocation>http://example.org/cataloguedump_part1.rdf</sc:dataFragmentDumpLocation>
<sc:dataFragmentDumpLocation>http://example.org/cataloguedump_part2.rdf</sc:dataFragmentDumpLocation>
<sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix>
<sc:sparqlEndPoint>http://example.org/queryengine/sparql</sc:sparqlEndPoint>
<changefreq>monthly</changefreq>
<sc:dataset>
</urlset>
3: Data on different Hosts and priorities
In this example the website wishes to state it has data served as RDF Dump and
by 2 SPARQL endpoints, one on the same website and one on another. Please note
that for the statement involving and external site, the same statement MUST be
placed also in the sitemap of the other site for this to be trusted by compliant
crawler.
The priority values in the mirrored data dumps and sparql endpoints are set so
that the backup dataset location WILL NOT be used unless the primary location is
not responding.
On the other hand, the host wishes that the secondary SPARQL service be used
with ratio 1/(10+default) = 1/11 with respect to the main one.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:sc="http://sindice.com/swcrawling/schema/0.1">
<sc:dataset>
<sc:datasetName>Product Catalog for Example.org</sc:datasetName>
<sc:datasetURI>http://example.org/aboutcatalog.rdf#catalog</sc:datasetURI>
<sc:dataDumpLocation>http://example.org/cataloguedump.rdf</sc:dataDumpLocation>
<sc:dataDumpLocation priority="0">http://backup.example.org/cataloguedump.rdf</sc:dataDumpLocation>
<sc:sparqlEndPoint priority="10">http://example.org/queryengine/sparql</sc:sparqlEndPoint>
<sc:sparqlEndPoint>http://secondary.example.org/queryengine/sparql</sc:sparqlEndPoint>
<changefreq>monthly</changefreq>
<sc:dataset>
</urlset>
Appendix B: Schemas
This datamap extension schema is available at
It is reported here for convenience:
Appendix C: Additional Issues and Ideas
To be discussed : it is valid for a
Sitemap to refer to subdomains e.g. for a sitempa posted on http://uniprot.ch to
say that their dumps are on ftp://ftp.uniprot.ch without having an ftp sitemap.
To be discussed : provide an entry point
in the linked dataset (e.g. an example uri). Pro, seems useful, con, once you
have the datasets one could easliy find many URIs by a regex
To be discussed : allow one to specify
which form of description algorithm is using to serve linked data from the DB.
is it a Concise Bound Description ? is it a simmetric one? is it the sum of all
the MSGs? is it just the triples involving the uri as subject, object or both? .
Pro: seems important to be able to get a single dump and decompose it in
something similar (e.g. a subset) to what the site serves at each URL/URI. This
enables a search engine to building indexes that are more accurate (e.g. point
to the right linked data). Con: many sites dont use a single strategy, a search
engine could try to understand this by itself by analizing some of the RDf
served from the linked data and comparing it with the content of the RDF dump.
References
[1] Linking Open Data on the Semantic Web -
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[2] The Sitemap protocol - http://www.sitemaps.org/protocol.php
[3] The Sitemap protocol: robot.txt extension -
http://www.sitemaps.org/protocol.php#submit_robots
[4] RDF Semantics
Acknowledgments
The following people have contributed to the making of this extention:
Chris Bizer (Free University Berlin)
Richard Cyganiak (Free University Berlin)
Stefano Mazzocchi (SIMILE- MIT)
Christian Morbidoni
(SEMEDIA - Universita' Politecnica delle
Marche)
Michele Nucci (SEMEDIA - Universita' Politecnica delle
Marche)
Eyal Oren (DERI Galway)
Leo
Sauermann (DFKI)
History and Revision
2/6/2007 - first version created, called
Semantic
Crawling Ontology and based on an RDF/XML model and circulated for internal
review
5/6/2007 - 0.9 incorporates first reviews and adds examples
3/7/2007 - 0.91 complete rewriting of the syntax and the proposed working
mechanism as a Sitemap extension
12/7/2006 - 0.91.1 added Appendix C with some ideas and issues as gathered from
the Linking Open Data group
Last updated 12/7/2006