1 of 30

From Democracy to Data and Back

ParlaMint

June 27 2024 | CLaDA-BG, Sofia

Darja Fišer

CLARIN ERIC

2 of 30

Parliamentary corpora

  • Important
    • what (and how) is said in parliament has consequences
    • interesting for researchers, journalists, NGOs, active citizens
  • Easy
    • to get (published on the Web)
    • to disseminate (no copyright, no privacy protection)
  • Hard
    • interoperability (formatting, annotation)
    • comparability (time span, metadata)
    • language barrier
  • Needed
    • a high quality, uniformly and richly encoded set of European parliamentary corpora

3 of 30

What is CLARIN?

  • CLARIN is a digital infrastructure offering data, tools and services to support research based on language resources

  • A distributed network of 70 centres with 24 member countries and 2 observers

4 of 30

What is ParlaMint?

CLARIN flagship project:

  • ParlaMint I (2020–2021)
  • ParlaMint II (2022–2023)

Main deliverable:

  • uniformly encoded transcriptions of speeches from European parliaments
  • covering at least 2015–2022
  • rich metadata
  • linguistically annotated
  • named entities
  • machine translated into English
  • openly available

5 of 30

ParlaMint I (2020-2021)

  • Compiled 17 corpora, each compiled by the partner from that country
  • Available for download in CLARIN.SI repository and for analysis in CLARIN.SI noSketch Engine
  • Described in:�Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Nikola Ljubešić, Kiril Simov, Andrej Pančur, Michał Rudolf, Matyáš Kopp, Starkaður Barkarson, Steinþór Steingrímsson, Çağrı Çöltekin, Jesse de Does, Katrien Depuydt, Tommaso Agnoloni, Giulia Venturi, María Calzada Pérez, Luciana D. de Macedo, Costanza Navarretta, Giancarlo Luxardo, Matthew Coole, Paul Rayson, Vaidas Morkevičius, Tomas Krilavičius, Roberts Darǵis, Orsolya Ring, Ruben van Heusden, Maarten Marx & Darja Fišer.The ParlaMint corpora of parliamentary proceedings. �Language Resources & Evaluation 57:415–448 (2023). 10.1007/s10579-021-09574-0.

6 of 30

ParlaMint II (2022-2023)

WP1: Documentation, interoperability, metadata (Lead: Tomaž Erjavec (IJS), Matyáš Kopp (UFAL))

  • T1.1: Harmonization of encoding, T1.2: Git management, T1.3 Adding metadata

WP2: Corpus expansion (Lead: Tomaž Erjavec (IJS))

  • T2.1: Adding new corpora, T2.2: Extending existing corpora, T2.3: Data distribution

WP3: Corpus enrichment (Lead: Nikola Ljubešić (IJS) // Taja Kuzman (IJS), Paul Rayson (UCREL))

  • T3.1: Machine translation and semantic tagging, T.3.2: Multimodality

WP4: Engagement activities (Lead: Darja Fišer (INZ), Cagri Coltekin (TUB))

  • T4.1: Tutorial, T4.2: Hackathon, T4.3: Shared task, T4.4: Showcases

WP5: Coordination (Lead: Maciej Ogrodniczuk (IPI-PAN), Petya Osenova (IICT-BAS))

  • T5.1: Management, T5.2: Dissemination, T5.3: External monitoring

7 of 30

Harmonization of encoding

  • Framework:
    • Text Encoding Initiative (TEI) Guidelines
  • Parla-CLARIN encoding schema:
    • a general TEI customisation for parliamentary corpora
  • ParlaMint encoding schema:
    • compatible with but much strict than Parla-CLARIN
  • schema specification
  • defines 111 TEI elements
  • encoding guidelines

8 of 30

Element documentation

9 of 30

Git Management

  • Everything except complete corpora is on GitHub
  • GitHub Issues: �main communication channel in ParlaMint II
  • GitHub Pages:�encoding guidelines
  • GitHub Actions: �validation of corpus samples

10 of 30

Added taxonomies

  • Unification and localisation of taxonomies

11 of 30

Added metadata

  • Added info on ministers, L-R political orientation from Wikipedia, Chapel Hill Expert Survey (CHES) political orientation variables
  • Implemented merging TSV files into XML corpora

12 of 30

Adding new corpora�(17 + 12 = 29 countries and autonomous regions)

Austria

Basque Country

Bosnia and Herzegovina

Belgium

Bulgaria

Catalonia

Croatia

Czech Republic

Denmark

Estonia

Finland

France

Galicia

Greece

Hungary

Iceland

Italy

Latvia

Netherlands

Norway

Poland

Portugal

Serbia

Slovenia

Spain

Sweden

Turkey

UK

Ukraine

13 of 30

Extending existing corpora

14 of 30

Preparing the corpora by partners

  • Each partner prepared their own corpus
  • Including linguistic annotation:
    • Tokenisation and sentence segmentation
    • Lemmatisation
    • Universal Dependencies morphological features
    • Universal Dependencies syntax
    • Named Entities (4 class)
  • Many different tools were used for linguistic annotation, most popular:
    • UDPipe (8 corpora)
    • CLASSLA-Stanza (5 corpora)
    • Stanza (4 corpora)

15 of 30

Validation and deployment

Validation was taken very seriously:

  1. against ParlaMint RelaxNG schemas
  2. against ParlaMint ODD schema
  3. XSLT script for content validation
  4. down-conversion releveals errors
  5. /using the corpus reveals further errors/

Deployment pipeline:

  1. adding TSV metadata
  2. fixing noted but uncorrected mistakes
  3. adding common metadata (e.g. titles, extents, tag-usage)
  4. validation
  5. down-conversion (txt, TSV metadata, CoNLL-U, vertical)
  6. packing for a release

16 of 30

Data distribution

  • Download 4.0 (CC BY) from the CLARIN.SI repository:
  • Analysis in the following tools:
    • noSketchEngine without login and with login
    • KonText
    • TEITOK (Maarten Janssen @ UFAL)

17 of 30

TEITOK - Corpus description

18 of 30

TEITOK - People view

19 of 30

TEITOK - Organizations view

20 of 30

TEITOK - Transcriptions view

21 of 30

TEITOK - Corpus search

22 of 30

Machine translation

  • Pre-trained OPUS-MT models
  • Cover all ParlaMint languages
  • Post-processing to correct translations of proper names
  • English has typical NMT errors, including factual errors!

23 of 30

USAS Semantic tagging

Key semantic categories used by female speakers in all parliaments

24 of 30

Multimodality

Four pilot speech corpora

Complicated:

  • transcripts have different order from recordings
  • not all recordings are transcribed or made public
  • transcripts only vaguely correspond to the speeches
  • thousands of hours of recordings and tens of millions of words of transcripts

25 of 30

Engagement activities

  • Tutorials on usage of ParlaMint corpora
    • Digital Humanities conference 2023, Graz
    • European Summer University in Digital Humanities 2023, Cluj
  • ParlaMint tasks in the scope of Helsinki Digital Humanities
    • Networks of power: gender analysis in selected European parliaments (2022)
    • Splitting lips: polarization through parliamentary speech (2023)
    • Echoes of the chambers: studying democracy through parliamentary speeches (2024)
  • Shared task using ParlaMint corpora held at CLEF 2024
    • Ideology and Power Identification in Parliamentary Debates

26 of 30

SHOWCASE 1: Networks of power

27 of 30

SHOWCASE 2: Emotions running high

28 of 30

Current work

V4.1 just out:

  • fixes many bugs and restructures the GitHub repository
  • Danish: linguistically re-annotated, speeches marked with topics
  • Portuguese: extended until March 2024
  • Ukrainian: extended until November 2023, improved uk vs. ru language marking on speech segments

Submitting LREV paper:

29 of 30

Future directions

  • ParlaMint4Ever
    • “How to make your own ParlaMint corpus” tutorial
    • ParlaMint Light corpora
    • Centralising and automating corpus construction
  • Community project ideas
    • Extend country and time coverage (contemporary as well as historical debates)
    • Link to other databases
      • Comparative Agendas Project, a coding schema of 21 topics for political agendas
      • PartyFacts metadatabase on political party metadata
      • V-DEM surveys on the state of democracies
    • Link to other types of documents (e.g. voting results, legislation, party manifestos)
  • Beyond parliamentary data
    • Framework for interoperable corpora production

30 of 30

Thank you and see you at

https://www.clarin.eu/parlamint