1 of 10

Reading-bound data as inline secondary tags

Reading-bound data is best transported as inline secondary tags, proven both by practical experience and theoretical complexity.

2 of 10

The Apertium Mission

https://wiki.apertium.org/wiki/Bylaws includes non-machine-translation goals:

To maintain a modular, documented, open platform for machine translation and other human language processing tasks,
To favour the interchange and reuse of existing linguistic data,
To make integration with other free/open-source technologies easier

3 of 10

What do Apertium devs want?

Of those that replied to the apertium-stuff threads (1, 2, 3), the majority expressed a preference for

Short textual prefixes
Easy reusability of monolingual data

"...we're stating that we're OK having crappy monodixes because we *fix* that later on with trimming. I'm sure that's where we are now, but as a project that focuses a lot on provided free (as in speech) language resources that are later used for many other use cases, I don't feel comfortable with that status. I think we should aim to have as correct as possible dictionaries." [#]

4 of 10

Why?

Empowers language developers to do what they want, without needing to change all modules in the pipe or throw away information arbitrarily
Inline for easier readability and implementability
Backwards compatible - adding secondary data in monolinguals will not affect any pair that doesn't explicitly make use of them
Optional - nobody is forced to use any secondary tags
Pipes start out as simple as ever - they grow over time. Apertium's potential growth is severely limited by the current design. Free-form secondary tags almost eliminates this limitation.

5 of 10

Uses

Note that this is not about LU-bound data - the LU implementation is settled.

Potential uses, non-exhaustive list:

Syntactic function
Sentiment
Semantic prototypes (noun, adjective)
Semantic roles
Verb frames
Dependency

6 of 10

High-level examples

Sentiment: <sn:pos> - positive
Semantic prototypes: <s:Adom> - animal, domesticated� <s:cc-H> - concrete-countable, human-made
Semantic roles: <r:th> - theme� <r:ag> - agent
Verb frames: <vf:exist>
Dependency: <id:5><ip:7><il:subj>� self is #5, parent is #7, arc label is subject

7 of 10

In the stream

^There/there<adv><id:1><ip:2>$

^is/be<vbser><pres><p3><sg><vf:exist><id:2><ip:0>$

^world/world<n><sg><s:Lstar><r:th><id:4><ip:2>$

^dogs/dog<n><pl><s:Adom><r:atr><id:7><ip:5>$

^and/and<cnjcoo><id:8><ip:7>$

^people/people<n><sg><s:H><r:atr><id:9><ip:7>$

...easy to grep for, easy to write scripts for, easy to make filters for.

8 of 10

Practical experience

In VISL, GramTrans, and the Greenlandic pipes, we have free-form inline secondary tags. Works great.

VISL's style is fairly ugly and inscrutable - Apertium's doesn't need to be.

9 of 10

Alternatives

Because of the near-veto of the proposed format, alternative implementations were investigated.
The alternatives are quite literally orders of magnitude harder to implement and work with, because:

They put the data in the LU-bound superblank, not on the reading themselves. This is MUCH harder to parse.
To tie data to the reading, IDs are needed - currently about 4 different IDs, and they get very big.
Having to refer to multiple locations is horrendous for humans who try to read the stream.