1 of 19

Danish newspapers registry

From printed books to Linked Data

2 of 19

Statsbiblioteket (The State and University Library) in Aarhus

3 printed volumes with historical metadata about danish newspaper publishing

Project goal: to digitalize and create a web application with the ability to edit existing data and create new data

3 of 19

Example from the book

4 of 19

Domain model

5 of 19

Data workflow

  1. TEI XML markup schema was generated
  2. Paper books were scanned to PDF (1920 pages)
  3. Images optically recognized
  4. 3 TEI XML files were produced (8 MB) and validated
  5. TEI files were converted to TriX (240 MB) using XSLT 2 (1850 lines)
  6. Over 800K RDF quads were imported into Dydra cloud triplestore

6 of 19

URI structure

  • Base URI: http://dedanskeaviser.dk/
  • Container per type: <persons>
  • Item URI template: /persons/{identifier}
    • Build using natural ID/key: dct:identifier
    • Avoid editable properties: dct:title
    • Otherwise, generate unique ID: generate-id()
      • Avoid clashes: /{volume}/{id}
      • Same for blank node IDs
    • Remember to URI-encode: encode-for-uri()

7 of 19

Vocabularies

Documents

  • SIOC sioc:Container, sioc:has_container
  • FOAF foaf:Document, foaf:isPrimaryTopicOf

Domain: use the most specific vocabulary, multiple RDF types

8 of 19

Named graphs

  • Graph-per-resource pattern (from Linked Data Patterns book)
  • Not exactly: more like graph-per-document:
    • 1 document
    • optional domain resource(s), e.g. newspaper and its editions
  • Resources from the same graph are edited together
  • Default graph used for provenance data about named graphs

9 of 19

Layout mockup

10 of 19

Implementation

  • Design domain ontology in OWL
    • define subclasses & subproperties if needed
  • Import RDF files into a triplestore
    • transform to other vocabularies (FRBR, OCLC, schema.org) if needed, e.g. using SPARQL Update
  • Build Graphity web application
    • create Maven Web project with Graphity Platform dependency
    • define declarative webapp structure (“sitemap”)
    • write layout XSLT stylesheets
  • Done!

11 of 19

Graphity sitemap ontology

Linked Data application ontology

  • Configuration (SPARQL endpoint, XSLT master stylesheet)
  • Container and document resources
  • Templates (override default ones)
    • URI /{.*} /persons
    • query DESCRIBE ?this DESCRIBE <persons>
      • query templates use SPIN Modeling Vocabulary
      • must be DESCRIBE or CONSTRUCT
  • SIOC user accounts and ACL authorizations
    • used by Graphity Platform authorization filter
  • SPIN data quality constraints

12 of 19

Layout stylesheets

  • RDF/XML transformed to XHTML using XSLT 2 stylesheets
    • server-side, run on Saxon
    • client-side, run on Saxon-CE (XSLT 2 processor in the browser)
  • Master includes per-container stylesheets, imports Graphity XSLT
  • Override & customize Graphity default templates
    • Read, Table, List, Map, Edit layout modes (Bootstrap-based)
  • Multilingual
  • HTTP headers, sitemap, ontologies, external Linked Data available as XSLT parameters or XML side-documents

13 of 19

Editing mode

  • Provided by Graphity
  • RDF/POST encodes RDF as standard HTML form inputs
  • RDF graph content manipulated as (X)HTML DOM
    • using both jQuery and Saxon-CE
  • Input validated using SPIN constraints
    • Check for missing properties, for example
  • Valid RDF forwarded to the triplestore
    • named graph URI like /graphs/newspapers/39-14
    • graph content created or replaced

14 of 19

Faceted search

  • Provided by Graphity
  • Implemented by adding FILTER added to base query on-the-fly
    • text search as regex()/str()
    • multiple choices as FILTER ?var IN (...)
  • Pagination and ordering supported
    • requires SELECT sub-query
    • OFFSET/LIMIT set on on-the-fly
    • ORDER BY set on-the-fly

15 of 19

Newspaper layout

16 of 19

Editing layout

17 of 19

Faceted search layout

18 of 19

Get in touch

We are based in Copenhagen, Denmark and Kaunas, Lithuania

Email: martynas@graphity.org

Twitter: https://twitter.com/pumba_lt

LinkedIn: http://www.linkedin.com/in/martynasjusevicius

Our homepage: http://graphityhq.com

19 of 19

If you are interested to hear more...

Graphity: generic Linked Data platform for interactive Web Applications”

13:00-14:00

Thank you! Questions?