Resource Discovery Taskforce

Metadata Guidelines for the RDTF

Andy Powell and Pete Johnston, Eduserv

3 Feb 2011

DRAFT FOR COMMENT

Introduction

This document provides a set of guidelines about how metadata associated with library, museum and archival collections should be made available for the purposes of supporting resource discovery in line with the Resource Discovery Taskforce (RTDF) Vision [1].

Such a vision presents a number of significant metadata challenges because it spans the library, museum and archives sectors, each of which has multiple different metadata standards in current use. The RDTF Vision is about making the metadata from organisations within these sectors available in ways that are compatible with the web and that support the development of services that are useful to researchers, teachers and students.

This set of metadata guidelines is intended to ensure that a pragmatic approach is taken, to ease the burden of metadata reuse, and to help break down any sectoral silos that currently exist. The guidelines are high level and in line with emerging practice elsewhere on the web. Whilst we are not able to recommend the adoption of a single metadata standard, we do believe that by putting this guidance in place it will be possible to create significantly more coherence in the way that metadata is created, managed and used across the library, archives and museum sectors than is currently the case.

Guiding principles

These guidelines have been developed such that they:

The guidelines are intended to help libraries, museums and archives expose existing metadata (and any new metadata that is created using existing practices) in ways that 1) supports the development of aggregator services and that 2) integrates well with the web of data. The intention is not to change existing cataloguing practice in libraries, museums and archives. Note that no assumptions have been made about the nature of any resulting aggregator services, which may include everything from a simple collaboration between two museums through to a full-blown national ‘cultural heritage’ discovery service.

RDTF Metadata Guidelines

RDTF metadata should be made openly available using one or more of three approaches, referred to below as the community formats approach, the RDF data approach and the Linked Data approach.

All three approaches suggest that all RDTF metadata be made available using non-proprietary formats and under an open licence. For the purposes of these guidelines, a non-proprietary format is considered to be a format for which there is a published specification, usually maintained by a standards organization, which can be used and implemented by anyone [9]. The meaning of ‘open’ is the focus of other work being undertaken by the taskforce. For the time being open is assumed to mean that the metadata is free to use, reuse, and redistribute (subject only, at most, to the requirement to attribute and share-alike) by anyone [10].

This means that for all metadata made available according to these guidelines, software developers who are building aggregations of metadata will be able to:

Similarly, end-users will be able to:

Making metadata available using the community formats approach is reasonably low cost and easy to do for the data provider, whilst encouraging openness and re-use. However, software developers may have to work quite hard to aggregate metadata across multiple providers. Exposing RDF data and Linked Data brings increasing value, in terms of the potential services offered to end-users, and lower barriers to re-use for the developers of aggregated services. However, these approaches are likely to cost data providers more (in terms of time and effort in preparing the metadata for exposure on the web) and software developers more (in terms of handling the complexity of the metadata). See below for an outline of the key benefits and costs for each approach.

The following sections describe each of the approaches.

The community formats approach

Guidance

RDTF metadata that is exposed using the community formats approach must be made available under an open licence, using a non-proprietary file format (such as one of those listed in the examples section below).

The metadata must be made available using simple HTTP GET requests or the OAI-PMH [11].

Where HTTP is used, one or more sitemaps (conforming to the Sitemap protocol [12]) should also be made available, listing the available files. The sitemaps should be listed in a robots.txt file. Sitemaps should use the following RDTF extension to differentiate RDTF files from other content:

<?xml version='1.0' encoding='UTF-8'?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rdtf="http://purl.org/rdtf"> <!-- namespace extension -->
  <url>
     <rdtf:loc>http://example.org/rdtf/catalogrecords.marc</rdtf:loc>
     ...
  </url>
</urlset>

Where HTTP is used, GZip compression [13] may be used to reduce file sizes.

Where possible, all significant resources associated with the collection of interest should be described using separate records. For example, there should be separate records describing a physical museum artifact and any digital surrogates of that artifact (e.g. images). For the purposes of these guidelines, a significant resource is one that is likely to be of interest to end-users, differs from other resources in terms of format or other attributes, and may have different ownership and/or usage restrictions to other resources. Note that ‘resources’ (as used here) may include conceptual entities (e.g. a FRBR ‘work’), people and organisations as well as both physical and digital objects. Where metadata is encoded using CSV [14] (or similar), a record corresponds to a row in the table. Where metadata is encoded using the OAI-PMH, a record corresponds to an OAI-PMH record.

All metadata records should contain an attribute/field/property that can be used as a label or title for the resource. Where metadata is encoded using CSV (or similar), this label should appear in a column called ‘label’ or ‘title’. Where metadata is encoded using CSV (or similar), any identifier for the resource (e.g. an ISBN) should appear in a column called ‘identifier’. In addition, for any metadata encoded using CSV, the first row should contain the column headings, there should be no use of footnotes and all rows should be of the same length.

Metadata about the resources associated with multiple collections may be made available. In general, where HTTP is used, there should be one file per collection; where the OAI-PMH is used, collections should be partitioned into separate repositories or separate sets within a single repository. Note that ‘collection’ (as used here) simply means any grouping of resources for curatorial, discovery or some other purpose.

Examples

For libraries, typical file format examples include library catalogue records encoded using MARC21 [15] or MODS [16], BibTeX [17], RIS [18], the CrossRef output schema [19], Dublin Core records encoded using XML [20], the Europeana Semantic Elements (ESE) format and formats based on JSON [21] Atom [22] or RSS [23].

For museums, typical file format examples include museum catalogue records encoded using SPECTRUM [24], database tables dumped as CSV files, Dublin Core records encoded using XML, the ESE format and formats based on Atom or RSS.

For archives, typical file format examples include archival descriptions encoded using EAD [25], database tables dumped as CSV files, Dublin Core records encoded using XML, the ESE format and formats based on Atom or RSS.

Benefits

As an end-user:

As a provider:

Costs/issues

As an end-user:

As an aggregator:

The RDF data approach

Guidance

RDTF metadata that is exposed using the RDF data approach must be made available under an open licence as RDF datasets [26].

The RDF datasets must be made available using simple HTTP GET requests as one or more RDF dumps (e.g. files containing RDF/XML [27], N-Triples [28], N-Quads [29] or RDF/JSON [30]).

GZip compression may be used to reduce the file size of the RDF dumps.

The location of all RDF dumps must be disclosed in accordance with the Semantic Web Crawling Sitemap Extension [31].

Metadata about multiple collections may be made available. If so, the dataset corresponding to each collection should be made available using separate RDF dumps.

The dataset in each RDF dump should be described using the Vocabulary of Interlinked Datasets (VoID) [32]. VoID files should be made available over HTTP and should be listed in the sitemap(s) above.

All significant resources (as defined above) associated with the collection of interest must be assigned a unique URI. Such URIs should be ‘http’ URIs.

Examples

For libraries, the use of accepted open ontologies (FOAF [33], BIBO [34], DC [35], ORE [36], etc) and other uses of RDF data modeled according to FRBR [37] are acceptable. Examples include the Europeana Data Model (EDM) and the British Library Catalogue Dataset in RDF/XML [38].

For museums, the use of RDF data modeled according to the CIDOC CRM [39] is acceptable.

For archives, the use of RDF data modeled according to the principles underpinning EAD is acceptable.

Benefits

As an end-user:

As an aggregator:

Costs/issues

As a provider:

The Linked Data approach

Guidance

RDTF metadata that is exposed using the Linked Data approach must be made available under an open licence as RDF datasets.

The RDF datasets must be made available using simple HTTP GET requests as one of more RDF dumps (e.g. files containing RDF/XML, N-Triples, N-Quads or RDF/JSON).

GZip compression may be used to reduce the file size of the RDF dumps.

The location of all RDF dumps must be disclosed in accordance with the Semantic Web Crawling Sitemap Extension.

Metadata about multiple collections may be made available. If so, the dataset corresponding to each collection should be made available using separate RDF dumps.

The dataset in each RDF dump must include links to other (external) RDF datasets, for example those describing people, organisations, topics or places.

The dataset in each RDF dump must be described using the Vocabulary of Interlinked Datasets (VoID). VoID files must be made available over HTTP and must be listed in the sitemap(s) above.

All significant resources associated with the collection of interest must be assigned a unique ‘http’ URI.

All URIs must dereference to a human-readable HTML description and an RDF description (e.g. using RDF/XML, N-Triples, RDF/JSON or RDFa [40]) of the resource when the URI, either by using one of the patterns described in Cool URIs for the Semantic Web [41] or by combining the HTML and RDF descriptions using embedded RDFa.

Examples

For libraries, the use of RDF data modeled according to FRBR and including links to other RDF sources (such as people, organisations, topics and places) is acceptable. The JISC OpenBib project [42] provides an example of this.

For museums, the use of RDF data modeled according to the CIDOC CRM and including links to other RDF sources (such as people, organisations, topics and places) is acceptable. The CLAROS project [43] provides an example of this.

For archives, the use of RDF data modeled according to EAD and including links to other RDF sources (such as people, organisations, topics and places) is acceptable. The JISC LOCAH project [44] provides an example of this.

Benefits

As an end-user:

As an aggregator:

As a provider:

Costs/issues

As a provider:

Designing ‘http’ URIs

For metadata that is exposed using the RDF data or Linked Data approaches, all significant resources associated with the collection of interest must be assigned a unique URI. Where the Linked Data approach is used, such URIs must be ‘http’ URIs.

When assigned, all ‘http’ URIs should conform to the Designing URI Sets for the UK Public Sector guidelines.

Data model guidelines

Metadata that is exposed using the RDF data or Linked Data approaches should be modeled according to FRBR, the CIDOC CRM or the principles that underpin EAD. Where structural “containment” relationships are required, ORE should be used.

References

  1. RDTF Vision - http://ie-repository.jisc.ac.uk/475/1/JISC%26RLUK_VISION_FINAL.pdf
  2. eFoundations: Resource discovery revisited - http://efoundations.typepad.com/efoundations/2010/08/resource-discovery-revisited.html
  3. Linked Data principles  - http://linkeddata.org/
  4. Linked Data Design Issues - http://www.w3.org/DesignIssues/LinkedData.html
  5. Linked Open Data Star Scheme  by example - http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/
  6. Designing URI Sets for the UK Public Sector -http://www.cabinetoffice.gov.uk/resource-library/designing-uri-sets-uk-public-sector
  7. Europeana Data Model - http://version1.europeana.eu/c/document_library/get_file?uuid=aff89c92-b6ff-4373-a279-fc47b9af3af2&groupId=10605
  8. Europeana Semantic Elements  - http://group.europeana.eu/c/document_library/get_file?uuid=a830cb84-9e71-41d6-9ca3-cc36415d16f8&groupId=10602
  9. Wikipedia: Open Format - http://en.wikipedia.org/wiki/Open_format
  10. Open Definition - http://www.opendefinition.org/
  11. OAI-PMH - http://www.openarchives.org/OAI/openarchivesprotocol.html
  12. Sitemaps XML format - http://www.sitemaps.org/protocol.php
  13. Wikipedia: gzip - http://en.wikipedia.org/wiki/Gzip
  14. Wikipedia: Comma-separated values - http://en.wikipedia.org/wiki/Comma-separated_values
  15. MARC21 - http://www.loc.gov/marc/bibliographic/
  16. MODS - http://www.loc.gov/standards/mods/
  17. BibTeX - http://www.bibtex.org/Format/
  18. Wikipedia: RIS - http://en.wikipedia.org/wiki/RIS_(file_format)
  19. CrossRef output schema - http://www.crossref.org/help/Content/CrossRef%20Schema/CrossRef%20Schema.htm
  20. Guidelines for implementing Dublin Core in XML - http://www.dublincore.org/documents/dc-xml-guidelines/
  21. JSON - http://www.json.org/
  22. Wikipedia: Atom (standard) - http://en.wikipedia.org/wiki/Atom_(standard)
  23. Wikipedia: RSS - http://en.wikipedia.org/wiki/RSS
  24. SPECTRUM - http://www.collectionstrust.org.uk/index.cfm/collection-management/spectrum/
  25. Encoded Archival Description - http://www.loc.gov/ead/
  26. Resource Description Framework - http://www.w3.org/RDF/
  27. RDF/XML - http://www.w3.org/TR/rdf-syntax-grammar/
  28. N-Triples - http://www.w3.org/TR/rdf-testcases/#ntriples
  29. N-Quads - http://sw.deri.org/2008/07/n-quads/
  30. RDF/JSON - http://n2.talis.com/wiki/RDF_JSON_Specification
  31. Semantic Web Crawling: A Sitemap Extension - http://sw.deri.org/2007/07/sitemapextension/
  32. voiD Guide - Using the Vocabulary of Interlinked Datasets - http://vocab.deri.ie/void/guide
  33. FOAF - http://www.foaf-project.org/
  34. BIBO - http://bibliontology.com/
  35. DCMI Metadata Terms - http://dublincore.org/documents/dcmi-terms/
  36. OAI Object Re-use and Exchange - http://www.openarchives.org/ore/
  37. FRBR - http://www.ifla.org/publications/functional-requirements-for-bibliographic-records
  38. British Library Catalogue Dataset in RDF/XML - http://www.archive.org/details/BritishLibraryRdf
  39. CIDOC CRM - http://www.cidoc-crm.org/
  40. RDFa in XHTML: Syntax and Processing - http://www.w3.org/TR/rdfa-syntax/
  41. Cool URIs for the Semantic Web - http://www.w3.org/TR/cooluris/
  42. JISC OpenBib Project - http://bibliography.okfn.org/p/jiscopenbib/
  43. CLAROS Project wiki - http://www.clarosnet.org/wiki/index.php?title=Main_Page
  44. Linked Open COPAC Archives Hub - http://blogs.ukoln.ac.uk/locah/