1 of 14

*cats & LLOD

Menzo Windhouwer

KNAW HuC

CLARIAH

CLARIN

2 of 14

ISO 12620:1999

*cats & LLOD

2

10/5/22

3 of 14

Towards a Data Category Registry

  • Problems with ISO 12620:1999 a hardcoded list of data categories
    • Not easily extensible
    • Ordering heavily debated
    • Outdated and limited in range at the moment of release

*cats & LLOD

3

10/5/22

4 of 14

ISO 12620:2009

  • Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources
    • A data model for data category specifications inspired by ISO 11179
    • A procedure to standardize data category specification compliant with Annex ST
    • Each data category gets a unique Persistent Identifier (PID)
    • The Max Planck Institute for Psycholinguistics is appointed as the Registration Authority of the ISO/TC 37 DCR
  • Referred to by a growing number of ISO TC 37 standards
    • Lexical Markup Framework (LMF)
    • Linguistic Annotation Framework (LAF)
    • Morph-syntactic Annotation Framework (MAF)
    • could be more, e.g., Feature System Declarations (FSD)

*cats & LLOD

4

10/5/22

5 of 14

Example Data Category specification

  • Data category: /Grammatical gender/
    • Administrative part:
      • Identifier: grammaticalGender
      • PID: http://www.isocat.org/datcat/DC-1297
    • Descriptive part:
      • English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria.
      • French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou d'autres critères formels.
    • Linguistic part:
      • Morposyntax conceptual domain: /masculine/, /feminine/, /neuter/
      • French conceptual domain: /masculine/, /feminine/

*cats & LLOD

5

10/5/22

6 of 14

Data Category types

ISOcat introduction

6

10/4/22

writtenForm

string

open

grammaticalGender

string

neuter

masculine

feminine

closed

simple:

email

string

constrained

Constraint: .+@.+

complex:

7 of 14

*cats & LLOD

7

10/5/22

No ontological relationships?

  • Rationale:
    • Relation types and modeling strategies for a given data category may differ from application to application;
    • Motivation to agree on relation and modeling strategies will be stronger at individual application level;
    • Integration of multiple relation structures in DCR itself could lead to endless ontological clutter.

Solution under development:

RELcat a Relation Registry

8 of 14

ISOcat - the ISO TC 37/DCR

*cats & LLOD

8

10/5/22

  • A (coherent) set of Data Categories, in our case for linguistic resources
  • A system to manage this set:
    • Create and edit Data Categories
    • Share Data Categories, e.g., resolve PID references
    • Standardize Data Categories
  • An API for tools to access the DCR
    • e.g., ELAN, LEXUS, CMDI Component Editor, Arbil

  • Grass roots approach
    • Anyone can access the DCR and use or

create the data categories (s)he needs

9 of 14

*cats & LLOD

9

10/5/22

How to make semantics explicit?

  • Associate data categories with your resources
    • using the PIDs
  • Where to put the PIDs?
    • Preferably in a schema
    • Or in the resource itself (redundant)
    • Or in the metadata of the resource (less specific)

10 of 14

*cats & LLOD

10

10/5/22

10

Persistent Identifiers

  • Why not use the Data Category identifier?
    • Ambiguity: different domains use the same term but mean different ‘things’
    • Semantic rot: even in the same domain the meaning of a term changes over time
    • Persistence: for archived resources Data Category references should still be resolvable and point to the specification as it was at/close to time of creation
  • Use the Persistent IDentifier
    • ISO 24619:2011 Language resource management -- Persistent identification and access in language technology applications
    • ISOcat uses ‘cool URIs’
      • http://www.isocat.org/datcat/DC-1297 (/grammaticalGender/)
    • managed by the system

11 of 14

*cats & LLOD

11

10/5/22

Data Category Registry - ISOcat

Linguistic knowledge base

Linguistic resource (schema)

Data categories

Containers

Concepts

Concept Registry

Relation

Relation Registry - RELcat

ISOcat vision

Schema Registry - SCHEMAcat

12 of 14

Metadata TDG

  • Standardization efforts of the Metadata TDG stalled
    • Large overlap with the work/people at the Athens-Core meetings
      • Community level agreement is maybe enough
    • Activity motivation should not depend on one person, the TDG chair, only
      • The need for explicit and shared semantics is not clear enough yet … more evangelization needed
    • Unfamiliarity with the work
      • Terminologists are more used to this kind of review work
      • Online review vs. old ISO ‘paper’ process
    • Members have little time, it is difficult to sync schedules
      • TDG experts tend to be senior scientist
      • Continuous process vs. sporadic bursts of activity
    • Unpaid work
      • Project funding vs. wide acceptance in the community
      • However, a project might bootstrap a thematic domain
  • The same problems hold for other TDGs
    • Current tendency to tie data category (selection) standardization to a new/revised standard, e.g., MAF and TBX
    • Redesign of the standardization process is coming up
      • ISO is not actively supporting Annex ST Standards as Databases anymore

*cats & LLOD

12

10/5/22

13 of 14

Linguistic Linked Open Data

  • Linked Data started its rise incl. in linguistics

*cats & LLOD

13

10/5/22

14 of 14

The demise

  • MPI trimmed down the The Language Archive to its barebones, i.e. an archive
    • Development of the *cats stopped
    • MPI stepped down as the DCR Registration Authority
  • ISO TC37 and CLARIN each went their own way
    • CLARIN: took their relevant DCs to the SKOS-based CCR
      • gave them handles
      • assigned national CCR coordiators to manage the content
      • hosted at the Meertens Institute and I followed
    • ISO TC37: took isocat.org and its DCs to
      • a static site
      • Termweb to do curation by a benevolent dictator
      • datcat.info which leadsyou to an empty(?) termweb
  • ISO didn’t like isocat.org anymore and shut down the domain
    • The cool URIs were not cool anymore ☹

*cats & LLOD

14

10/5/22