Text Encoding Initiative
crash-course
Daniil Skorinkin, DH Network Potsdam
HUJI FUB winter school
Agenda for today
What is TEI
What is Text Encoding Initiative in a nutshell
What TEI is NOT!
But it is a cornerstone of the DH data universe
Some TEI-powered projects
TEI-powered projects examples: epigraphy
TEI-powered projects examples: manuscripts
TEI-powered projects examples: ‘programmable’ literary corpora:
TEI-powered projects examples: ‘programmable’ literary corpora:
Historical Letter Archives
Personal archives
Personal archives
Personal archives
The main idea of TEI
The main idea of TEI
A digital text is not yet a machine-readable data, it is nothing but a long string of symbols
Consider this excerpt from Marlowe:
THE MASSACRE AT PARIS. With the Death of the Duke of Guise. Enter Charles the French King, the Queene Mother, the King of Nauarre, the Prince of Condye, the Lord high Admirall, and the Queene of Nauarre, with others.
Charles
Prince of Nauarre my honourable brother,
Prince Condy, and my good Lord Admirall,
I Wishe this vnion and religious league,
Knit in these hands thus ioyn'd in nuptiall rites,
May not desolue, till death desolue our liues,
And that the natiue sparkes of princely loue,
That kindled first this motion in our hearts:
May still be feweld in our progenye.
Nauar
The many fauours which your grace hath showne,
From time to time, but specially in this:
Shall binde me euer to your highnes will,
In what Queen Mother or your grace commands.
A computer has no idea that
THE MASSACRE AT PARIS. With the Death of the Duke of Guise. Enter Charles the French King, the Queene Mother, the King of Nauarre, the Prince of Condye, the Lord high Admirall, and the Queene of Nauarre, with others.
[
Charles
{Prince of Nauarre my honourable brother, }
{Prince Condy, and my good Lord Admirall,}
{I Wishe this vnion and religious league,}
{Knit in these hands thus ioyn'd in nuptiall rites,}
{May not desolue, till death desolue our liues,}
{And that the natiue sparkes of princely loue,}
{That kindled first this motion in our hearts:}
{May still be feweld in our progenye.}
]
[
Nauar
The many fauours which your grace hath showne,
From time to time, but specially in this:
Shall binde me euer to your highnes will,
In what Queen Mother or your grace commands.
And of course from the text itself a computer has no idea about the metadata
Title:The Massacre at Paris
Author: Christopher Marlowe
Creation date: 1593
Dramatic form: tragedy
Number of characters: 47
Number of female characters: 6
…etc.
the idea of TEI is to store both metadata AND machine-readable data about certain elements of text together in one file:
the idea of TEI is to store both metadata AND machine-readable data about certain elements of text together in one file:
the idea of TEI is to store both metadata AND machine-readable data about certain elements of text together in one file:
the idea of TEI is to store both metadata AND machine-readable data about certain elements of text together in one file:
XML as the technological basis for TEI
XML = eXtensible Markup Language
*development and maintenance by W3C
Simplest XML-document
<text>
<line>Line of text</line>
</text>
text, line — elements of an XML�XML elements are expressed as tag pairs:
<text>, <line> — opening tags
</text>, </line> — closing tags
Reminder once again: XML is NOT a programming language. It does not really DO anything by itself
XML simply encodes the information in an explicit form for the computer to read:
All the rest is for us to implement on top of TEI:
Why then is it still good for you to know some XML?
* for example such well-known formats as .docx, .xlsx, .epub, .fb2, .gexf — are all XML based inside
Questions?
Three main rules of XML syntax
[XML is [always a [nested] tree] structure]
<text>
<line>
<word>Hello</word> <word>DH-ers!</word>
</line>
</text>
XML tags must have corresponding closing tags
Tags that open within some parent tags must close before the parent tag closes
Tags cannot open inside something and close outside
<text>
<line>
<word>Hello</word> <word>Jerusalem</word>
</text>
</line>
unclear what’s inside what (line in text? text in line?)
Root tag must be only one
<text>
why don’t we have
</text>
<text>
a common
root tag?
</text>
�if this is your entire document, this is wrong
If a tag has no textual content (e.g. just marks some point in your text)...
<text>
<line>
<word>Hello</word> <word>HUJI</word>
</line>
<pagebreak></pagebreak>
</text>
Можно:
..you can make this tag <selfclosing/>
<text>
<line>
<word>Hello</word> <word>HUJI</word>
</line>
<pagebreak/>
</text>
Можно:
But you can’t open the tag and NOT close it
<text>
<line>
<word>Hello</word> <word>HUJI</word>
</line>
<pagebreak>
</text>
XML elements have attributes
<text id=“001”>
<line number=“1”>I am a line</line>
<line number=“2”>Me too</line>
</text>
id, number are attributes
1,2 — values of the number attribute
Values are always in quotes!
Questions?
✅ ❌ Let’s check your XML basics
What is wrong with this XML?
<teacher>
<person>Renana Keydar </person>
<person>Noam Maeir</person>
<person>Dennis Mischke</person>
<person>Daniil Skorinkin </person>
<person>Mareike Schumacher</person>
<person>Yael Netzer</person>
<person>Eliese-Sophia Lincke</person>
<person>Barak Sober</person>
<person>Itay Marienberg</person>
<person>Or Rappel-Kroyzer</person>
What is wrong with this XML?
<teacher>
<person>Renana Keydar </person>
<person>Noam Maeir</person>
<person>Dennis Mischke</person>
<person>Daniil Skorinkin </person>
<person>Mareike Schumacher</person>
<person>Yael Netzer</person>
<person>Eliese-Sophia Lincke</person>
<person>Barak Sober</person>
<person>Itay Marienberg</person>
<person>Or Rappel-Kroyzer</person>
</teacher>
⚠️ Every tag has to be closed
What is wrong with this XML?
<teacher>
<person>Renana Keydar </person>
<person>Noam Maeir</person>
<person>Dennis Mischke</person>
<person>Daniil Skorinkin <person>
<person>Mareike Schumacher</person>
<person>Yael Netzer</person>
<person>Eliese-Sophia Lincke</person>
<person>Barak Sober</person>
<person>Itay Marienberg</person>
<person>Or Rappel-Kroyzer</person>
</teacher>
What is wrong with this XML?
<teacher>
<person>Renana Keydar </person>
<person>Noam Maeir</person>
<person>Dennis Mischke</person>
<person>Daniil Skorinkin </person>
<person>Mareike Schumacher</person>
<person>Yael Netzer</person>
<person>Eliese-Sophia Lincke</person>
<person>Barak Sober</person>
<person>Itay Marienberg</person>
<person>Or Rappel-Kroyzer</person>
</teacher>
⚠️ Closing tag should have the /
What is wrong with this XML?
<teacher>
<person>Renana Keydar </person>
<person>Noam Maeir</person>
<person>Dennis Mischke</person>
<person>Daniil Skorinkin </person>
<person>Mareike Schumacher</parson>
<person>Yael Netzer</person>
<person>Eliese-Sophia Lincke</person>
<person>Barak Sober</person>
<person>Itay Marienberg</person>
<person>Or Rappel-Kroyzer</person>
</teacher>
What is wrong with this XML?
<teacher>
<person>Renana Keydar </person>
<person>Noam Maeir</person>
<person>Dennis Mischke</person>
<person>Daniil Skorinkin </person>
<person>Mareike Schumacher</parson>
<person>Yael Netzer</person>
<person>Eliese-Sophia Lincke</person>
<person>Barak Sober</person>
<person>Itay Marienberg</person>
<person>Or Rappel-Kroyzer</person>
</teacher>
⚠️ Closing tag should match the opening tag
What is wrong with this XML?
<teacher>
<person>Renana Keydar </person>
<person>Noam Maeir</person>
<person>Dennis Mischke</person>
<person>Daniil Skorinkin </person>
<person>Mareike Schumacher</person>
<person>Yael Netzer</person>
<person>Eliese-Sophia Lincke</person>
<person>Barak Sober</person>
<person>Itay Marienberg</person>
<person>Or Rappel-Kroyzer</person>
</teacher>
It is actually correct :)
Ok, so now you know the XML
How is all this used in TEI?
TEI
Three examples:
Text versioning in TEI:
Меня преследуют две-три случайных фразы,
Весь день твержу: печаль моя жирна
О Боже, как жирны и синеглазы
Стрекозы смерти, как лазурь черна.
О. Мандельштам
Text versioning in TEI:
Меня преследуют две-три случайных фразы,
Весь день твержу: печаль моя жирна
О Боже, как жирны и синеглазы
Стрекозы смерти, как лазурь черна.
О. Мандельштам
Original authorial typeset:
This is what actually happened:
Меня преследуют две-три случайных фразы,
Весь день твержу: печаль моя жарка
О Боже, как жирны и синеглазы
Стрекозы смерти, как лазурь черна.
О. Мандельштам
жирна
Or, if we think ‘all versions are created equal’:
Меня преследуют две-три случайных фразы,
Весь день твержу: печаль моя
О Боже, как жирны и синеглазы
Стрекозы смерти, как лазурь черна.
О. Мандельштам
жирна
жарка
Text versioning in TEI:
Text versioning in TEI:
<choice> — parent tag for a text ‘fork’
<del> — what was deleted by the author
<add> — what was added by the author
Or, if it was not the author, but the editors:
<choice> — parent tag for a text ‘fork’
<sic> — authorial version
<corr> — corrected version
2. Document structure in TEI
Document structure:
Two first quatrains of the same poem:
Document structure: verse
<l> — any verse line
<lg> — line group, contains one or more verse lines functioning as a formal unit, e.g. a stanza, refrain, verse paragraph, etc.
type attribute can encode the specific stanza type
Document structure: long prose (parts, chapters etc.)
The beginning of ‘War and Peace’ in TEI:
Document structure: long prose
<div> — any division unit of a prosaic work, basically
<p> — a prosaic paragraph
<s> — a prosaic string
Document structure: drama (acts, scenes etc.)
Beginning of Lessing’s Emilia Galotti in TEI:
Document structure: drama
<sp> — speech instance
@who — character ID attribute
<speaker> — the in-text mention of character who’s about to speak
<stage> — stage direction (‘scenic remark’)
Named entities and dates in TEI:
3. Metadata within the document
Metadata within the document:
Metadata within the document:
Metadata within the document:
Metadata in TEI
<teiHeader> — an element containing all metadata subelements
<sourceDesc> — sources info
<profileDesc> — content metadata info (genres, character features etc.)
…and many more, see ch. 2 of the guidelines: tei-c.org/release/doc/tei-p5-doc/en/html/HD.html
Not purely text documents:
Postcard metadata in TEI
Postcard content in TEI
Sender-receiver metadata
Practice
How to create an XML
Editors with XML syntax highlight
…
Two more XML-tailored & TEI-tagset-aware options:
Let us see how to create and modify a TEI/XML in
Download the zip from Slack
Or here
We try it together now:
You have to wait until it hurts, until it clangs in your ears like the bells of hell, until nothing else counts but it, until it is everything, until you can’t do anything else but.
then sit down and write or stand up and write but write no matter what the other people are doing, no matter what they will do to you.
Lay the line down, a party of one, what a party, swarmed by the light, the time of the times, out of the tips of your fingers.
Both are in your zip
Let us encode it in TEI
This is a TEI document stub for you (in the zip too)
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Title</title>
</titleStmt>
<publicationStmt>
<p>Publication Information</p>
</publicationStmt>
<sourceDesc>
<p>Information about the source</p>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date when="2023-02-28">28 February 2023</date>
</creation>
</profileDesc>
</teiHeader>
<text>
<body>
<!-- REPLACE THIS WITH YOUR POEM-->
</body>
</text>
</TEI>
This is what we want in the end
This is what we want in the end