Encoding formats for text alignments
Comments
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
Comment only
 
 
ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAF
1
ProjectURLTypeField
Encoding type
Encoding subtype
Languages
What's being aligned?
Alignment mechanism
Separate files?
Smallest alignment
Cross-project alignment?
Sample alignment
Comments
2
Arabic English Parallel News Text
https://secure.ldc.upenn.edu/
project
CLSGMLcustommulti
work-version <--> work-version
DOC/p/seg/@idysentencesno
<SentPair EnglishSegId= "2,3,4" ArabicSegId= "2" />
3
English Translation Treebank: An Nahar Newswire
https://secure.ldc.upenn.edu/
project
CLSGMLcustommulti
work-version <--> work-version
DOC/p/seg/@idysentencesno
4
Juxta
http://www.juxtasoftware.org/
toolDHXMLcustomsingle
work-version <--> work-version
yword tokensno
alignment identified in the software, not in the data
5
Multiple-Translation Chinese (MTC)
https://secure.ldc.upenn.edu/
project
CLSGMLcustommulti
work-version <--> work-version
DOC/p/seg/@idysentencesno
6
PLUG Word Aligner
http://stp.lingfil.uu.se/corpora/plug/pwa/
project
CLXMLcustommulti
work-version <--> work-version
TEI-like align/seg valuesnword tokensno
<align id='svenprf3' link='1-1'><seg lang='sv'><s>Sveriges neutralitetspolitik är av avgörande betydelse för vårt lands fred och oberoende.</s></seg><seg lang='en'><s>Sweden's policy of neutrality is of decisiveimportance for our peace and independence.</s></seg></align>
7
Classical Text Editor
http://cte.oeaw.ac.at
toolDHTXTcustommulti
work-version <--> work-version
anchors in separate filesycharacterno
Text 1: ...ἀνεχώρησε. {Q\s:33\Q}Μετὰ... and Text 2: ...departed. {Q\s:33\Q}Coming...
private, customized markup scheme; alignment also supports nine levels of apparatus critici, which is kept in the same file the main text is in
8
Alpheios
http://repos1.alpheios.net/exist/rest/db/app/align-entersentence.xhtml
toolDHXMLcustommulti
work-version <--> work-version
@n and @n-ref values cross-pointing
nword tokensno
<wds lnum="L1"><w n="1-1"><text>I</text><refs nrefs="1-3"/></w></wds><wds lnum="L2"><w n="1-3"><text>mir</text><refs nrefs="1-1"/></w></wds>
embedded in html code
9
GATE (General Architecture for Text Engineering)
https://gate.ac.uk/sale/tao/splitch20.html
toolDHXMLcustommulti
work-version <--> work-version
nword tokensnoSee URLexports to XML
10
XCES
http://www.xces.org/
project
DHXML
custom (based on TEI)
multi
work-version <--> work-version
two layers of links based on element ids
yword tokensno
Too many to illustrate concisely. See examples at http://www.xces.org/schema/#standoff
11
PLUG Word Aligner
http://stp.lingfil.uu.se/corpora/plug/pwa/
project
CLTXTLinköpingmulti
work-version <--> work-version
separate files, aligned by numbered lines
ysentencesno
1X:8:8:8:8:(1) Den kommer att fullföljas med kraft och konsekvens.(2) It will be pursued with firmness and consistency.
12
GALE Phase 1 Chinese Newsgroup Parallel Text
https://secure.ldc.upenn.edu/
project
CLTXT
tab-delimited
multi
work-version <--> work-version
separate files, aligned by numbered newlines
ysentencesno
13
Computational Historical Semantics
http://www.comphistsem.org/
project
DHXMLTEIsingleEditions <--> lexiconw/@lemmayword tokensno
@lemma="[{'id':1284294,'ls':[{'id':7478,'name':'prologus','wfs':[{'id':1284309,'name':'prologus'}]}],'name':'prologus@NN'}]"
I know it's not a strict bitext alignment, but the lexicon alignment mechanism is telling
14
Digital Comparative Edition and Translation of the Shorter Chinese Saṃyukta Āgama
http://buddhistinformatics.chibs.edu.tw/BZA/
project
DHXMLTEImulti
work-version <--> work-version
standoff file using elements outside TEI namespace
ytext clustersno
<mets:file MIMETYPE="text/xml" CHECKSUM="47aa467c7246ca102511c24d57ad2311" SIZE="6843" ID="b356za1336" CHECKSUMTYPE="MD5"><mets:FLocat LOCTYPE="URL" xlink:href="file:///xml/b356za1336.xml"/></mets:file>
15
Digital Thoreau
http://digitalthoreau.org/walden
project
DHXMLTEIsingle
work-version <--> work-version
textcrit module + xml:idy
phrases, readings
no
<app xml:id="walc04-app-0004"><lem>shutter</lem><rdg wit="#wc_0a">crevices </rdg><rdg wit="#wc_0d"><del>crevices</del><add rendition="pencil">shutter</add></rdg></app>
16
English-Norwegian Parallel Corpus
http://www.hf.uio.no/ilos/english/services/omc/enpc/
project
DHXMLTEImulti
work-version <--> work-version
s/@idysentencesno
<s id=DL2T.1.s19 corresp=DL2.1.s18>"Ikke glem at du har levd godt i fire år nå."</s>
includes morphology on word tokens; unusual English POS tags
17
Folger Shakespeare
http://www.folgerdigitaltexts.org/
project
DHXMLTEIsingle
Editions <--> commentary
yword tokensno
<w xml:id="w0000980" n="1.1.8">relief</w>
alignment possibilities not yet fully realized
18
InterText
http://wanthalf.saga.cz/intertext
toolCLXMLTEImulti
work-version <--> work-version
Multiple @xml:id identified by stand-off file of link elements
yword tokensno
<link type='2-1' xtargets='6:1 6:2;7:1' status='man'/>
Permits multiple export formats
19
MULTEXT-east
http://nl.ijs.si/ME/
project
CLXMLTEImulti
work-version <--> work-version
link/@xtargetsyword tokensno
<link xtargets="Oen.1.1.1.1 Oen.1.1.1.2 ; Oro.1.2.2.1"/>, <link n="1:1" targets="oana-en.xml#Oen.1.1.2.9 oana-ro.xml#Oro.1.2.3.7"/>, <w xml:id="Oen.1.1.2.2.6" lemma="a" ana="#Di">a</w>
Aligns up to ten at a time
20
PELCRAhttp://pelcra.pl/
project
CLXMLTEImulti
work-version <--> work-version
ytokenno
<w xml:id="w-1"><fs type="morph"><f name="orth">Historia</f></fs>...</w> plus <w xml:id="w-8"><fs type="morph"><f name="orth">History</f></fs>...</w> plus <linkGrp><link target="#w-1 #w-8" type="simple" pelcra:score="0.621777"/></linkGrp>
21
SAWS
http://ancientwisdoms.ac.uk
project
DHXMLTEImulti
work-version <--> work-version, apparatus
lb/@n plus app/@loc; stand-off app + children
y
lines (visually oriented), phrases, readings
no
<lb type="WJ" n="2.30"/>τοῦ ἀνθρώπου εἴτε καὶ ἀπὸ φθόνου, σοφῶς ὑπελθὼν ἔξελε τὸν κατη
22
Versioning Machine
http://v-machine.org
toolDHXMLTEIsingle
work-version <--> work-version
textcrit module + xml:idy
phrases, readings
no
<app> <rdg wit="#a660 #ll227">When </rdg> <rdg wit="#h201 #h72 #p1891 #l1894 #cp32">For </rdg> </app>
23
VVV Shakespeare
http://www.delightedbeauty.org
project
DHXMLTEImulti
work-version <--> work-version
nnovarious
TEI imported into native SQL database
24
mkAlign
http://www.tal.univ-paris3.fr/mkAlign/
toolDHXMLTMXmulti
work-version <--> work-version
single file, alignments as sole children of a common parent element, <vu>
nword tokensno
<tu><tuv xml:lang="EL"><seg> {καὶ,καί.I+Part} {πάντα,πᾶς.A} {προστίθησιν,προστίθημι.V} {ὑμῖν,ὑμεῖς.PRO+Per2p} {τὰ,ὁ.DET} {ἑαυτοῦ,ἑαυτοῦ.PRO+Ref3s}. </seg>/tuv><tuv xml:lang="KA"><seg> {და,და.I+Conj} {ყოველსა,ყოველი.PRO+Det} {თავისა,თავი.PRO+Ref} {თჳსისასა,თჳსი.PRO+Pos} {მოგცემს,მოცემა.V+Mas{თქუენ,თქუენ.PRO+Pers}. </seg></tuv></tu>
Can include part of speech data, embedded as plain text in leaf elements. For TMX see http://www.gala-global.org/oscarStandards/tmx/tmx14b.html
25
PLUG Word Aligner
http://stp.lingfil.uu.se/corpora/plug/pwa/
project
CLTXTUppsalamulti
work-version <--> work-version
single file, lines grouped in numbered sets
nsentencesno
# fields: (id,source,target) svenprf2 ledamöter members
26
ACL 2005 project
http://www.cse.unt.edu/~rada/wpt05/
project
CLTXTmulti
work-version <--> work-version
separate files, aligned by unnumbered newlines; companion concordance for token-to-token alignment
yword tokensno18 2 2 P 0.7
27
BOLT Phase 1 Egyptian Arabic Parallel Word Alignment DF
https://secure.ldc.upenn.edu/
project
CLTXTmulti
work-version <--> work-version
separate files, aligned by unnumbered lines; separate concordance for word tokens
yword tokensno
22-23(COR) 19[GLU],20-20,22(COR) 6-9(COR) 16-26(COR) 4-7(COR) 9[TOK]-14(COR) 14[MET]-(MTA)
28
BOLT Phase 2 Chinese Parallel Word Alignment and Tagging SMS
https://secure.ldc.upenn.edu/
project
CLTXTmulti
work-version <--> work-version
separate files, aligned by unnumbered lines; separate concordance for word tokens
yword tokensno
22-25[OMN],26,27,28[POS](GIS)
29
European Parliament Proceedings
http://www.statmt.org/europarl/
project
CLTXTmulti
work-version <--> work-version
separate files, aligned by unnumbered newlines
ysentencesno
30
Hong Kong Laws Parallel Text
https://secure.ldc.upenn.edu/
project
CLTXTmulti
work-version <--> work-version
ysentencesno
uses <s id="#.#">...</s> but not in XML-compliant manner
31
Kyoto Free Translation Task
http://www.phontron.com/kftt/
project
CLTXTmulti
work-version <--> work-version
separate files, aligned by unnumbered newlines
ysentencesno
32
Moses
http://www.statmt.org/moses/
toolCLTXTmulti
work-version <--> work-version
separate files, aligned by unnumbered lines; separate concordance for word tokens
yword tokensno0-0 0-1 1-2 2-3
33
News Commentary
http://www.statmt.org/wmt13/translation-task.html#download
project
CLTXTmulti
work-version <--> work-version
separate files, aligned by unnumbered newlines
ysentencesno
34
Parallel Aligned Hebrew-Aramaic and Greek texts of Jewish Scripture
http://ccat.sas.upenn.edu/gopher/text/religion/biblical/parallel/
project
DHTXTmulti
work-version <--> work-version
aligned word tokens on same line, tab delimited
nword tokensno
--+ '' =;XLM <ge41.1^> E)NU/PNION EI)=DEN
Uses betacode, specialized transliteration scheme
35
Collatex
http://collatex.net
toolDH
multiple
single
work-version <--> work-version
various: see http://collatex.net/doc/#output
nword tokensno
various: see http://collatex.net/doc/#output
Five different output methods supported: JSON, 3 XML formats (TEI, GraphML, customized), GraphViz DOT
36
Open Scripture Information Standard
http://www.bibletechnologies.net/OSISinformation/
project
DHXMLcustommulti
Bible-version/commentary <--> Bible-version/commentary
separate files, adopting XML schema
y
clauses (verses)
yes
cross-project alignment applies only to Bibles
37
Unified Standard Format Markup
http://ebible.org/usfx/#schema
project
DHXMLcustommulti
Bible-version/commentary <--> Bible-version/commentary
separate files, adopting XML schema
y
clauses (verses)
yes
cross-project alignment applies only to Bibles
38
Wiki Headlines
http://www.statmt.org/wmt13/translation-task.html#download
project
CLTXTmulti
work-version <--> work-version
single file, bitexts separated by three pipes
nword tokensno
Аарон Хант ||| Aaron Hunt
39
40
Notes
41
No known DH or CL projects make use of XLIFF, which one might think conducive to this type of work.
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Loading...
 
 
 
list of formats
TEI methods
TEI interoperability