Linked Data Corpus Creation with NIF
This document describes best practices to follow for the generation of Linked Data text corpora, using the NLP Interchange Format (NIF).
Target audience
Scope
Core concepts
Corpus creators and users seeking to make corpora interoperable and to publish them as linked data. Basic knowledge of RDF is mandatory for conversion. Basic knowledge of linked data and web server access is needed for publication.
Conversion of existing corpora into RDF using NIF, as well as creation of linked data corpora from textual data.
Corpus
We understand a corpus as a collection of documents. Documents contain text, represented as strings of characters and annotations that provide more information about these strings. NIF provides a way to identify strings using URIs and annotate them using an ontology.
String identification via URI�Strings are identified using a URI scheme consisting of: the prefix of the corpus URI; the character indices of beginning and end of the string; and a scheme identifier between document URI and string position identifier. Character indices in NIF are counted offset based, starting at zero before the first character and counting the gaps between the characters until after the last character of the referenced string: http://example.org/corpus/document#offset4_10
This URI scheme is valid for text/plain. Other mime types may require different URI schemes.
String annotation
After assigning URIs to meaningful strings of the corpus, these URIs can be annotated using the NIF core ontology (see page 2).
Website: http://site.nlp2rdf.org
Github: http://github.com/nlp2rdf
Example corpus: http://brown.nlp2rdf.org
Example
The Semantic Web is a good idea.
<http://example.org/sem#offset_0_32>
a nif:String , nif:Context , nif:OffsetBasedString ;
nif:isString "The Semantic Web is a good idea."@en ;
nif:beginIndex "0"^^xsd:int ;
nif:endIndex "32"^^xsd:int .
<http://example.org/sem#offset_0_32>
a nif:String , nif:Sentence , nif:OffsetBasedString ;
nif:anchorOf "The Semantic Web is a good idea."@en ;
nif:beginIndex "0"^^xsd:int ;
nif:endIndex "32"^^xsd:int ;
nif:referenceContext
<http://example.org/sem#offset_0_32> .
<http://example.org/sem#offset_4_16>
a nif:String , nif:Phrase , nif:OffsetBasedString ;
nif:anchorOf "Semantic Web"@en ;
nif:beginIndex "4"^^xsd:int ;
nif:endIndex "16"^^xsd:int ;
nif:oliaLink <http://purl.org/olia/penn.owl#NNP> ;
itsrdf:taIdentRef
<http://dbpedia.org/resource/Semantic_Web> ;
nif:referenceContext
<http://example.org/sem#offset_0_32> .
Document
Sentence
nif:referenceContext
Context
Words, Phrases
nif:referenceContext
itsrdf:taIdentRef
Find a real world example at http://brown.nlp2rdf.org
Namespaces and Ontologies
nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
olia: http://purl.org/olia
Website: http://site.nlp2rdf.org
Github: http://github.com/nlp2rdf
Example corpus: http://brown.nlp2rdf.org