1 of 91

Text Encoding Initiative

crash-course

Daniil Skorinkin, DH Network Potsdam

HUJI FUB winter school

2 of 91

Agenda for today

  • Text Encoding Initiative (TEI) as the common standard for encoding text-related information in text-oriented humanities
  • XML as the technological foundation for TEI
  • TEI use cases and applications
  • 👩‍💻 👨‍💻 Tackling TEI/XML in XML-aware text editors: encoding one text together

3 of 91

What is TEI

4 of 91

What is Text Encoding Initiative in a nutshell

  • An established markup standard in digital humanities: a set of predefined tags for marking up texts and storing metadata
  • A common unifying ground for encoding poetry, prose, dramatic texts, digitized manuscript collections, letters etc.
  • A community developing the standard and developing tools that support working with TEI-encoded docs
    • e.g. software for web-publishing digital editions such as the TEI publisher
    • or TEI-based corpora platforms such as DraCor or COST European Literary Text Collection (ELTEC)

5 of 91

What TEI is NOT!

  • TEI is not a programming language/tool/instrument (unlike Python, R, Stylo, Gephi, Antconc and others).
  • TEI encoding by itself does not do anything and does not help you in any way. Sorry!

But it is a cornerstone of the DH data universe

6 of 91

Some TEI-powered projects

7 of 91

TEI-powered projects examples: epigraphy

8 of 91

TEI-powered projects examples: manuscripts

9 of 91

TEI-powered projects examples: ‘programmable’ literary corpora:

10 of 91

TEI-powered projects examples: ‘programmable’ literary corpora:

11 of 91

Historical Letter Archives

12 of 91

Personal archives

13 of 91

Personal archives

14 of 91

Personal archives

15 of 91

The main idea of TEI

16 of 91

The main idea of TEI

  • Texts AND their metadata should be preserved in a universal machine-readable format
  • Data encoding should be separated from its online representation
  • TEI standard as the middleground language for machines and humans
  • TEI tags as a layer of machine-readable knowledge on top of the text

17 of 91

A digital text is not yet a machine-readable data, it is nothing but a long string of symbols

18 of 91

Consider this excerpt from Marlowe:

THE MASSACRE AT PARIS. With the Death of the Duke of Guise. Enter Charles the French King, the Queene Mother, the King of Nauarre, the Prince of Condye, the Lord high Admirall, and the Queene of Nauarre, with others.

Charles

Prince of Nauarre my honourable brother,

Prince Condy, and my good Lord Admirall,

I Wishe this vnion and religious league,

Knit in these hands thus ioyn'd in nuptiall rites,

May not desolue, till death desolue our liues,

And that the natiue sparkes of princely loue,

That kindled first this motion in our hearts:

May still be feweld in our progenye.

Nauar

The many fauours which your grace hath showne,

From time to time, but specially in this:

Shall binde me euer to your highnes will,

In what Queen Mother or your grace commands.

19 of 91

A computer has no idea that

  • this text consists of some ‘units’
  • much less that some of these units might be more interesting to you:

THE MASSACRE AT PARIS. With the Death of the Duke of Guise. Enter Charles the French King, the Queene Mother, the King of Nauarre, the Prince of Condye, the Lord high Admirall, and the Queene of Nauarre, with others.

[

Charles

{Prince of Nauarre my honourable brother, }

{Prince Condy, and my good Lord Admirall,}

{I Wishe this vnion and religious league,}

{Knit in these hands thus ioyn'd in nuptiall rites,}

{May not desolue, till death desolue our liues,}

{And that the natiue sparkes of princely loue,}

{That kindled first this motion in our hearts:}

{May still be feweld in our progenye.}

]

[

Nauar

The many fauours which your grace hath showne,

From time to time, but specially in this:

Shall binde me euer to your highnes will,

In what Queen Mother or your grace commands.

20 of 91

And of course from the text itself a computer has no idea about the metadata

Title:The Massacre at Paris

Author: Christopher Marlowe

Creation date: 1593

Dramatic form: tragedy

Number of characters: 47

Number of female characters: 6

…etc.

21 of 91

the idea of TEI is to store both metadata AND machine-readable data about certain elements of text together in one file:

22 of 91

the idea of TEI is to store both metadata AND machine-readable data about certain elements of text together in one file:

23 of 91

the idea of TEI is to store both metadata AND machine-readable data about certain elements of text together in one file:

24 of 91

the idea of TEI is to store both metadata AND machine-readable data about certain elements of text together in one file:

25 of 91

XML as the technological basis for TEI

26 of 91

XML = eXtensible Markup Language

  • General markup languages syntax (no semantics!)
    • Dialects/extensions like TEI do have semantics
  • Purpose is to store data, not represent/encode appearances (main difference from its relative HTML)
  • Not tied to any particular software/company*

*development and maintenance by W3C

27 of 91

Simplest XML-document

<text>

<line>Line of text</line>

</text>

text, lineelements of an XML�XML elements are expressed as tag pairs:

<text>, <line> opening tags

</text>, </line> closing tags

28 of 91

Reminder once again: XML is NOT a programming language. It does not really DO anything by itself

XML simply encodes the information in an explicit form for the computer to read:

29 of 91

All the rest is for us to implement on top of TEI:

30 of 91

Why then is it still good for you to know some XML?

  • It’s a widespread data exchange technology*
  • Human-readable + machine-readable
  • XML is technically just a plain text — which means the data will not be lost because some format became obsolete
  • Optimal to tie information to <span>concrete spans </span> of a text
  • It is the basis for TEI and if you do not speak TEI you might get into trouble in the DH hood

* for example such well-known formats as .docx, .xlsx, .epub, .fb2, .gexf — are all XML based inside

31 of 91

Questions?

32 of 91

Three main rules of XML syntax

33 of 91

[XML is [always a [nested] tree] structure]

<text>

<line>

<word>Hello</word> <word>DH-ers!</word>

</line>

</text>

XML tags must have corresponding closing tags

Tags that open within some parent tags must close before the parent tag closes

34 of 91

Tags cannot open inside something and close outside

<text>

<line>

<word>Hello</word> <word>Jerusalem</word>

</text>

</line>

unclear what’s inside what (line in text? text in line?)

35 of 91

Root tag must be only one

<text>

why don’t we have

</text>

<text>

a common

root tag?

</text>

�if this is your entire document, this is wrong

36 of 91

If a tag has no textual content (e.g. just marks some point in your text)...

<text>

<line>

<word>Hello</word> <word>HUJI</word>

</line>

<pagebreak></pagebreak>

</text>

Можно:

37 of 91

..you can make this tag <selfclosing/>

<text>

<line>

<word>Hello</word> <word>HUJI</word>

</line>

<pagebreak/>

</text>

Можно:

38 of 91

But you can’t open the tag and NOT close it

<text>

<line>

<word>Hello</word> <word>HUJI</word>

</line>

<pagebreak>

</text>

39 of 91

XML elements have attributes

<text id=“001”>

<line number=“1”>I am a line</line>

<line number=“2”>Me too</line>

</text>

id, number are attributes

1,2 — values of the number attribute

Values are always in quotes!

40 of 91

Questions?

41 of 91

✅ ❌ Let’s check your XML basics

42 of 91

What is wrong with this XML?

<teacher>

<person>Renana Keydar </person>

<person>Noam Maeir</person>

<person>Dennis Mischke</person>

<person>Daniil Skorinkin </person>

<person>Mareike Schumacher</person>

<person>Yael Netzer</person>

<person>Eliese-Sophia Lincke</person>

<person>Barak Sober</person>

<person>Itay Marienberg</person>

<person>Or Rappel-Kroyzer</person>

43 of 91

What is wrong with this XML?

<teacher>

<person>Renana Keydar </person>

<person>Noam Maeir</person>

<person>Dennis Mischke</person>

<person>Daniil Skorinkin </person>

<person>Mareike Schumacher</person>

<person>Yael Netzer</person>

<person>Eliese-Sophia Lincke</person>

<person>Barak Sober</person>

<person>Itay Marienberg</person>

<person>Or Rappel-Kroyzer</person>

</teacher>

44 of 91

⚠️ Every tag has to be closed

45 of 91

What is wrong with this XML?

<teacher>

<person>Renana Keydar </person>

<person>Noam Maeir</person>

<person>Dennis Mischke</person>

<person>Daniil Skorinkin <person>

<person>Mareike Schumacher</person>

<person>Yael Netzer</person>

<person>Eliese-Sophia Lincke</person>

<person>Barak Sober</person>

<person>Itay Marienberg</person>

<person>Or Rappel-Kroyzer</person>

</teacher>

46 of 91

What is wrong with this XML?

<teacher>

<person>Renana Keydar </person>

<person>Noam Maeir</person>

<person>Dennis Mischke</person>

<person>Daniil Skorinkin </person>

<person>Mareike Schumacher</person>

<person>Yael Netzer</person>

<person>Eliese-Sophia Lincke</person>

<person>Barak Sober</person>

<person>Itay Marienberg</person>

<person>Or Rappel-Kroyzer</person>

</teacher>

47 of 91

⚠️ Closing tag should have the /

48 of 91

What is wrong with this XML?

<teacher>

<person>Renana Keydar </person>

<person>Noam Maeir</person>

<person>Dennis Mischke</person>

<person>Daniil Skorinkin </person>

<person>Mareike Schumacher</parson>

<person>Yael Netzer</person>

<person>Eliese-Sophia Lincke</person>

<person>Barak Sober</person>

<person>Itay Marienberg</person>

<person>Or Rappel-Kroyzer</person>

</teacher>

49 of 91

What is wrong with this XML?

<teacher>

<person>Renana Keydar </person>

<person>Noam Maeir</person>

<person>Dennis Mischke</person>

<person>Daniil Skorinkin </person>

<person>Mareike Schumacher</parson>

<person>Yael Netzer</person>

<person>Eliese-Sophia Lincke</person>

<person>Barak Sober</person>

<person>Itay Marienberg</person>

<person>Or Rappel-Kroyzer</person>

</teacher>

50 of 91

⚠️ Closing tag should match the opening tag

51 of 91

What is wrong with this XML?

<teacher>

<person>Renana Keydar </person>

<person>Noam Maeir</person>

<person>Dennis Mischke</person>

<person>Daniil Skorinkin </person>

<person>Mareike Schumacher</person>

<person>Yael Netzer</person>

<person>Eliese-Sophia Lincke</person>

<person>Barak Sober</person>

<person>Itay Marienberg</person>

<person>Or Rappel-Kroyzer</person>

</teacher>

52 of 91

It is actually correct :)

53 of 91

Ok, so now you know the XML

54 of 91

How is all this used in TEI?

55 of 91

TEI

  • TEI is an implementation of XML, but as a standard it has a concrete set of tags with given semantics
  • These semantics, best practices and examples are described in the TEI guidelines at tei-c.org/
  • Another useful source is the ‘TEI by example’ website: teibyexample.org

56 of 91

Three examples:

  1. Text versioning in XML
  2. Basic text structuring
  3. Metadata storage inside the doc (the <teiHeader>)

57 of 91

  1. Text versioning in TEI

58 of 91

Text versioning in TEI:

Меня преследуют две-три случайных фразы,

Весь день твержу: печаль моя жирна

О Боже, как жирны и синеглазы

Стрекозы смерти, как лазурь черна.

О. Мандельштам

59 of 91

Text versioning in TEI:

Меня преследуют две-три случайных фразы,

Весь день твержу: печаль моя жирна

О Боже, как жирны и синеглазы

Стрекозы смерти, как лазурь черна.

О. Мандельштам

60 of 91

Original authorial typeset:

61 of 91

This is what actually happened:

Меня преследуют две-три случайных фразы,

Весь день твержу: печаль моя жарка

О Боже, как жирны и синеглазы

Стрекозы смерти, как лазурь черна.

О. Мандельштам

жирна

62 of 91

Or, if we think ‘all versions are created equal’:

Меня преследуют две-три случайных фразы,

Весь день твержу: печаль моя

О Боже, как жирны и синеглазы

Стрекозы смерти, как лазурь черна.

О. Мандельштам

жирна

жарка

63 of 91

Text versioning in TEI:

64 of 91

Text versioning in TEI:

<choice> — parent tag for a text ‘fork’

<del> — what was deleted by the author

<add> — what was added by the author

65 of 91

Or, if it was not the author, but the editors:

<choice> — parent tag for a text ‘fork’

<sic> — authorial version

<corr> — corrected version

66 of 91

2. Document structure in TEI

67 of 91

Document structure:

Two first quatrains of the same poem:

68 of 91

Document structure: verse

<l> — any verse line

<lg>line group, contains one or more verse lines functioning as a formal unit, e.g. a stanza, refrain, verse paragraph, etc.

type attribute can encode the specific stanza type

69 of 91

Document structure: long prose (parts, chapters etc.)

The beginning of ‘War and Peace’ in TEI:

70 of 91

Document structure: long prose

<div> — any division unit of a prosaic work, basically

<p> — a prosaic paragraph

<s> — a prosaic string

71 of 91

Document structure: drama (acts, scenes etc.)

Beginning of Lessing’s Emilia Galotti in TEI:

72 of 91

Document structure: drama

<sp> — speech instance

@who — character ID attribute

<speaker> — the in-text mention of character who’s about to speak

<stage> — stage direction (‘scenic remark’)

73 of 91

Named entities and dates in TEI:

74 of 91

3. Metadata within the document

75 of 91

Metadata within the document:

76 of 91

Metadata within the document:

77 of 91

Metadata within the document:

78 of 91

Metadata in TEI

<teiHeader> — an element containing all metadata subelements

<sourceDesc> — sources info

<profileDesc> — content metadata info (genres, character features etc.)

…and many more, see ch. 2 of the guidelines: tei-c.org/release/doc/tei-p5-doc/en/html/HD.html

79 of 91

Not purely text documents:

80 of 91

Postcard metadata in TEI

81 of 91

Postcard content in TEI

82 of 91

Sender-receiver metadata

83 of 91

Practice

84 of 91

How to create an XML

  1. Nothing stops you from doing XML in a simple notepad
  2. But it’s too easy to break the tree structure there, and too hard to navigate without XML syntax highlight
  3. Some notepads (Notepad++, Sublime, OxygenXML, jEdit and some others) can fold-unfold tree elements
  4. Some more XML-tailored notepads can also ease manual tagging a great deal (try selecting the text and doing Cmd/Ctrl + E in jEdit or Oxygen and you’ll see)

85 of 91

Editors with XML syntax highlight

  • Notepad++ (Windows)
  • Sublime (everywhere)
  • Atom (everywhere)
  • BBEdit (Mac OS)
  • Microsoft XML Notepad (Windows)
  • Code Browser (Linux, Windows)
  • Brackets (everywhere)

Two more XML-tailored & TEI-tagset-aware options:

  • jEdit (everywhere) — supports TEI (install the plugin), free, but quite buggy and cumbersome, sadly
  • OxygenXML (everywhere) — supports TEI, paid, working quite well

86 of 91

Let us see how to create and modify a TEI/XML in

  • jEdit
  • OxygenXML
  • Random ‘coder’ editor �(i’ll use sublime, but notepad++, atom and others are no different)

87 of 91

Download the zip from Slack

Or here

88 of 91

We try it together now:

  • Here is a document with a short poem:

  • Here is the plain text of the document

You have to wait until it hurts, until it clangs in your ears like the bells of hell, until nothing else counts but it, until it is everything, until you can’t do anything else but.

then sit down and write or stand up and write but write no matter what the other people are doing, no matter what they will do to you.

Lay the line down, a party of one, what a party, swarmed by the light, the time of the times, out of the tips of your fingers.

Both are in your zip

Let us encode it in TEI

89 of 91

This is a TEI document stub for you (in the zip too)

<?xml version="1.0" encoding="UTF-8"?>

<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>

<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"

schematypens="http://purl.oclc.org/dsdl/schematron"?>

<TEI xmlns="http://www.tei-c.org/ns/1.0">

<teiHeader>

<fileDesc>

<titleStmt>

<title>Title</title>

</titleStmt>

<publicationStmt>

<p>Publication Information</p>

</publicationStmt>

<sourceDesc>

<p>Information about the source</p>

</sourceDesc>

</fileDesc>

<profileDesc>

<creation>

<date when="2023-02-28">28 February 2023</date>

</creation>

</profileDesc>

</teiHeader>

<text>

<body>

<!-- REPLACE THIS WITH YOUR POEM-->

</body>

</text>

</TEI>

90 of 91

This is what we want in the end

91 of 91

This is what we want in the end