1 of 8

Pursuing the elusive KPI: Filling the gaps in centre self-published standards-related information

CLARIN Standards Committee

2 of 8

What we do

The tasks of the CSC are (quoting from the 2019 Bylaws):

  • to collect, consolidate and prepare for publication in a single place its findings and recommendations related to standards;
  • to maintain the set of standards supported by CLARIN and adapt them to new developments within or outside CLARIN;
  • to publish and promote the standards supported by CLARIN;
  • to develop and implement procedures for the discussion of recommendations and the adoption of new standards;
  • to ensure harmonisation of standards between CLARIN ERIC and related initiatives;
  • to ensure communication with international standards bodies such as (but not limited to) ISO;
  • to advise the BoD in all matters related to standards.

2

3 of 8

The current questions

What data standards are accepted for deposit by the various CLARIN B centers?

Can we create a coherent list of formats and MIME-types accepted?

What are the recommended formats for certain types of resources?

3

4 of 8

Main objectives for the 2019-2020 cycle

  • To build on the research originated by Dieter Van Uytvanck for the BoD on the Key Performance Indicators (KPIs) relating to the percentage of Centres that publish explicit information on what data formats they accept.
  • From that, we will be able to see what the most common formats are as currently recommended in the bottom-up fashion, by the individual centres.
  • Not all centres have published explicit lists of formats
    • Many use very general statements “XML or any machine-readable format”
    • Some point to obsolete CLARIN-related standards lists of various kinds
    • Please see the section at https://www.clarin.eu/content/standards-and-formats#formats
    • If your centre is missing, please consider changing that!
    • (We will be happy to offer some hints, too)

4

5 of 8

Our current activity

5

6 of 8

(Partial) recommendations

  • include a timestamp/version number in the list
  • use the English language (in addition to the member language)
  • distinguish between recommended and "others", to stress (and reflect) what centres want, not what they have to accept because users bring these formats
  • use an appropriate degree of detail (XML is usually not specific enough) and recommend best practice format parameters (e.g. plain text in UTF-8 where possible)
  • Follow the format of one of the already existing lists (see next slide)

6

7 of 8

Examples to follow

7

8 of 8

Thanks!

8