2 of 30

Parliamentary corpora

Important

what (and how) is said in parliament has consequences
interesting for researchers, journalists, NGOs, active citizens

Easy

to get (published on the Web)
to disseminate (no copyright, no privacy protection)

Hard

interoperability (formatting, annotation)
comparability (time span, metadata)
language barrier

Needed

a high quality, uniformly and richly encoded set of European parliamentary corpora

3 of 30

What is CLARIN?

CLARIN is a digital infrastructure offering data, tools and services to support research based on language resources

A distributed network of 70 centres with 24 member countries and 2 observers

4 of 30

What is ParlaMint?

CLARIN flagship project:

ParlaMint I (2020–2021)
ParlaMint II (2022–2023)

Main deliverable:

uniformly encoded transcriptions of speeches from European parliaments
covering at least 2015–2022
rich metadata
linguistically annotated
named entities
machine translated into English
openly available

5 of 30

ParlaMint I (2020-2021)

Compiled 17 corpora, each compiled by the partner from that country
Available for download in CLARIN.SI repository and for analysis in CLARIN.SI noSketch Engine
Described in:�Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Nikola Ljubešić, Kiril Simov, Andrej Pančur, Michał Rudolf, Matyáš Kopp, Starkaður Barkarson, Steinþór Steingrímsson, Çağrı Çöltekin, Jesse de Does, Katrien Depuydt, Tommaso Agnoloni, Giulia Venturi, María Calzada Pérez, Luciana D. de Macedo, Costanza Navarretta, Giancarlo Luxardo, Matthew Coole, Paul Rayson, Vaidas Morkevičius, Tomas Krilavičius, Roberts Darǵis, Orsolya Ring, Ruben van Heusden, Maarten Marx & Darja Fišer.�The ParlaMint corpora of parliamentary proceedings. �Language Resources & Evaluation 57:415–448 (2023). 10.1007/s10579-021-09574-0.

6 of 30

ParlaMint II (2022-2023)

WP1: Documentation, interoperability, metadata (Lead: Tomaž Erjavec (IJS), Matyáš Kopp (UFAL))

T1.1: Harmonization of encoding, T1.2: Git management, T1.3 Adding metadata

WP2: Corpus expansion (Lead: Tomaž Erjavec (IJS))

T2.1: Adding new corpora, T2.2: Extending existing corpora, T2.3: Data distribution

WP3: Corpus enrichment (Lead: Nikola Ljubešić (IJS) // Taja Kuzman (IJS), Paul Rayson (UCREL))

T3.1: Machine translation and semantic tagging, T.3.2: Multimodality

WP4: Engagement activities (Lead: Darja Fišer (INZ), Cagri Coltekin (TUB))

T4.1: Tutorial, T4.2: Hackathon, T4.3: Shared task, T4.4: Showcases

WP5: Coordination (Lead: Maciej Ogrodniczuk (IPI-PAN), Petya Osenova (IICT-BAS))

T5.1: Management, T5.2: Dissemination, T5.3: External monitoring

7 of 30

Harmonization of encoding

Framework:

Text Encoding Initiative (TEI) Guidelines

Parla-CLARIN encoding schema:

a general TEI customisation for parliamentary corpora

ParlaMint encoding schema:

compatible with but much strict than Parla-CLARIN

schema specification
defines 111 TEI elements
encoding guidelines

8 of 30

Element documentation

9 of 30

Git Management

Everything except complete corpora is on GitHub
GitHub Issues: �main communication channel in ParlaMint II
GitHub Pages:�encoding guidelines
GitHub Actions: �validation of corpus samples

https://github.com/clarin-eric/ParlaMint

10 of 30

Added taxonomies

Unification and localisation of taxonomies

11 of 30

Added metadata

Added info on ministers, L-R political orientation from Wikipedia, Chapel Hill Expert Survey (CHES) political orientation variables
Implemented merging TSV files into XML corpora

⇒

12 of 30

Adding new corpora�(17 + 12 = 29 countries and autonomous regions)

Austria

Basque Country

Bosnia and Herzegovina

Belgium

Bulgaria

Catalonia

Croatia

Czech Republic

Denmark

Estonia

Finland

France

Galicia

Greece

Hungary

Iceland

Italy

Latvia

Netherlands

Norway

Poland

Portugal

Serbia

Slovenia

Spain

Sweden

Turkey

Ukraine

13 of 30

Extending existing corpora

14 of 30

Preparing the corpora by partners

Each partner prepared their own corpus
Including linguistic annotation:

Tokenisation and sentence segmentation
Lemmatisation
Universal Dependencies morphological features
Universal Dependencies syntax
Named Entities (4 class)

Many different tools were used for linguistic annotation, most popular:

UDPipe (8 corpora)
CLASSLA-Stanza (5 corpora)
Stanza (4 corpora)

15 of 30

Validation and deployment

Validation was taken very seriously:

against ParlaMint RelaxNG schemas
against ParlaMint ODD schema
XSLT script for content validation
down-conversion releveals errors
/using the corpus reveals further errors/

Deployment pipeline:

adding TSV metadata
fixing noted but uncorrected mistakes
adding common metadata (e.g. titles, extents, tag-usage)
validation
down-conversion (txt, TSV metadata, CoNLL-U, vertical)
packing for a release

16 of 30

Data distribution

Download 4.0 (CC BY) from the CLARIN.SI repository:

TEI “plain text” format
with added linguistic annotation
machine translated into English
samples and environment: ParlaMint GitHub

Analysis in the following tools:

noSketchEngine without login and with login
KonText
TEITOK (Maarten Janssen @ UFAL)

17 of 30

TEITOK - Corpus description

18 of 30

TEITOK - People view

19 of 30

TEITOK - Organizations view

20 of 30

TEITOK - Transcriptions view

21 of 30

TEITOK - Corpus search

22 of 30

Machine translation

Pre-trained OPUS-MT models
Cover all ParlaMint languages
Post-processing to correct translations of proper names
English has typical NMT errors, including factual errors!

23 of 30

USAS Semantic tagging

Key semantic categories used by female speakers in all parliaments

24 of 30

Multimodality

Four pilot speech corpora

Complicated:

transcripts have different order from recordings
not all recordings are transcribed or made public
transcripts only vaguely correspond to the speeches
thousands of hours of recordings and tens of millions of words of transcripts

25 of 30

Engagement activities

Tutorials on usage of ParlaMint corpora

Digital Humanities conference 2023, Graz
European Summer University in Digital Humanities 2023, Cluj

ParlaMint tasks in the scope of Helsinki Digital Humanities

Networks of power: gender analysis in selected European parliaments (2022)
Splitting lips: polarization through parliamentary speech (2023)
Echoes of the chambers: studying democracy through parliamentary speeches (2024)

Shared task using ParlaMint corpora held at CLEF 2024

Ideology and Power Identification in Parliamentary Debates

26 of 30

SHOWCASE 1: Networks of power

27 of 30

SHOWCASE 2: Emotions running high

28 of 30

Current work

V4.1 just out:

fixes many bugs and restructures the GitHub repository
Danish: linguistically re-annotated, speeches marked with topics
Portuguese: extended until March 2024
Ukrainian: extended until November 2023, improved uk vs. ru language marking on speech segments

Submitting LREV paper:

29 of 30

Future directions

ParlaMint4Ever

“How to make your own ParlaMint corpus” tutorial
ParlaMint Light corpora
Centralising and automating corpus construction

Community project ideas

Extend country and time coverage (contemporary as well as historical debates)
Link to other databases

Comparative Agendas Project, a coding schema of 21 topics for political agendas
PartyFacts metadatabase on political party metadata
V-DEM surveys on the state of democracies

Link to other types of documents (e.g. voting results, legislation, party manifestos)

Beyond parliamentary data

Framework for interoperable corpora production

1 of 30