1 of 2

Linked Data Corpus Creation with NIF

This document describes best practices to follow for the generation of Linked Data text corpora, using the NLP Interchange Format (NIF).

Target audience

Scope

Core concepts

Corpus creators and users seeking to make corpora interoperable and to publish them as linked data. Basic knowledge of RDF is mandatory for conversion. Basic knowledge of linked data and web server access is needed for publication.

Conversion of existing corpora into RDF using NIF, as well as creation of linked data corpora from textual data.

Corpus

We understand a corpus as a collection of documents. Documents contain text, represented as strings of characters and annotations that provide more information about these strings. NIF provides a way to identify strings using URIs and annotate them using an ontology.

String identification via URI�Strings are identified using a URI scheme consisting of: the prefix of the corpus URI; the character indices of beginning and end of the string; and a scheme identifier between document URI and string position identifier. Character indices in NIF are counted offset based, starting at zero before the first character and counting the gaps between the characters until after the last character of the referenced string: http://example.org/corpus/document#offset4_10

This URI scheme is valid for text/plain. Other mime types may require different URI schemes.

String annotation

After assigning URIs to meaningful strings of the corpus, these URIs can be annotated using the NIF core ontology (see page 2).

2 of 2

Example

The Semantic Web is a good idea.

<http://example.org/sem#offset_0_32>

a nif:String , nif:Context , nif:OffsetBasedString ;

nif:isString "The Semantic Web is a good idea."@en ;

nif:beginIndex "0"^^xsd:int ;

nif:endIndex "32"^^xsd:int .

<http://example.org/sem#offset_0_32>

a nif:String , nif:Sentence , nif:OffsetBasedString ;

nif:anchorOf "The Semantic Web is a good idea."@en ;

nif:beginIndex "0"^^xsd:int ;

nif:endIndex "32"^^xsd:int ;

nif:referenceContext

<http://example.org/sem#offset_0_32> .

<http://example.org/sem#offset_4_16>

a nif:String , nif:Phrase , nif:OffsetBasedString ;

nif:anchorOf "Semantic Web"@en ;

nif:beginIndex "4"^^xsd:int ;

nif:endIndex "16"^^xsd:int ;

nif:oliaLink <http://purl.org/olia/penn.owl#NNP> ;

itsrdf:taIdentRef

<http://dbpedia.org/resource/Semantic_Web> ;

nif:referenceContext

<http://example.org/sem#offset_0_32> .

Document

Sentence

  • Contains the string in nif:anchorOf
  • refers to Context with

nif:referenceContext

Context

  • Contains document text in nif:isString
  • nif:beginIndex is always 0

Words, Phrases

  • Contain the string in nif:anchorOf
  • refers to Context with

nif:referenceContext

  • POS tags mapped via OLiA
  • Entity references via

itsrdf:taIdentRef

Find a real world example at http://brown.nlp2rdf.org