1 of 20

MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

ACL 2023

July 11, 2023

Presenter:

David Ifeoluwa Adelani (@davlanade)

UCL, United Kingdom

2 of 20

2

3 of 20

Motivation

Part-of-Speech (POS) tagging is a process of assigning the most probable grammatical category (or tag) to each word (or token) in a given sentence of a particular natural language.

A fundamental step for many NLP applications like

machine translation,
parsing,
text chunking,
speech recognition and synthesis
spell and grammar checking

3

Figure obtained from Joakim Nivre. Multilingual Dependency Parsing from Universal Dependencies to Sesame Street . In TSD, 2020

No large-scale POS dataset for African languages, only one Sub-Saharan African language (Wolof) has training data in UD.

4 of 20

Contributions

We develop MasakhaPOS—the largest POS dataset for 20 typologically diverse African languages.

We highlight the challenges of annotating POS for these languages using the universal dependencies (UD).

We explored better ways to extend POS tagging to unseen languages by leveraging cross-lingual transfer methods and geographically/syntactically-related languages.

We developed POS taggers using CRF and multilingual pre-trained language models.

4

5 of 20

Languages and data size

Annotated corpus is based on the news domain

NC: Niger-Congo

5

Hausa

Kiswahili

chiShona

Setswana

isiXhosa

isiZulu

Chichewa

Kiswahili

Dholuo

Kiswahili

Luganda

Kinyarwanda

Ghomala

Wolof

Mossi

Twi

Ewe

Fon

Yoruba

Igbo

Naija

Data split roughly: 800/100/600 train/dev/test

6 of 20

Annotation methodology

Annotation guide is based on Universal dependency (UD)

Number of part-of-speech tags: 17

6

Annotation Methodology

{n=1400} sentences

{n=100} sentences

Train a POS tagger (RemBERT)

Manually label

Automatic Labeling

Fix incorrect tags

https://universaldependencies.org/u/pos/

7 of 20

Challenges in annotating POS with UD guidelines

Definition of word

chiShona word: ndakazomuona
In English: I eventually saw him

Clitics

Wolof word: ci ab -> cib
In English: in a

One unit or multitoken words?

In Setswana: ngwana yo o ratang
In English: the child who likes

Verb or conjunction?

Example 1:

Yorùbá: Olú gbàgbé pé Bolá tí jàde
English: Olu forgot that Bola has gone

Example 2:

Yorùbá: Olú pé wọn wá
English: Olu said they came

Adjective or Verb?
Adverbs or particles?

7

Tokenization and word segmentation

2. POS Ambiguities

8 of 20

Baseline models

Conditional Random Field (CRF)
Multilingual pre-trained language models

Massively multilingual (>=100 languages)

mBERT
XLM-R
RemBERT

African-centric (11-23 African languages)

AfriBERTa
AfroLM
AfroXLMR-base/large

8

Supports 2 - 8 focus languages

Supports 6 - 20 focus languages

9 of 20

Baseline results

Full-supervised - 800 training sentences

Multilingual pre-trained language models provides better performance than CRF

9

10 of 20

Baseline results

Full-supervised - 800 training sentences

Larger PLMs leads to further boost in performance

10

11 of 20

Baseline results

Full-supervised - 800 training sentences

African-centric PLMs that cover more African languages are more effective

11

12 of 20

Cross-lingual Transfer: Adapting to unseen languages

Impossible to create benchmark POS datasets for over 2,000 African languages
Need to develop effective methods for cross-lingual transfer.
How do we adapt existing POS datasets to MasakhaPOS?

Some tricks to improve cross-lingual transfer are:

Identify available dataset in any language (e.g on UD)

Preferably, languages that are geographically and linguistically related to the target language

Leverage available monolingual data in the target language for improved transfer.

Especially, parameter-efficient transfer learning methods like MAD-X

12

13 of 20

Transfer results: on aggregate

Geographically close source languages: Afrikaans, Arabic, English, French, Naija, Romanian and Wolof

FT-Eval: Fine-tune PLM on SOURCE, zero-shot evaluation on TARGET

Parameter-efficient fine-tuning methods:

LT-SFT (Ansell et al., 2021)
MAD-X (Pfeiffer et al, 2020)

13

14 of 20

Transfer results: on aggregate

Geographically close source languages: Afrikaans, Arabic, English, French, Naija, Romanian and Wolof

FT-Eval: Fine-tune PLM on a source language, perform zero-shot evaluation on the target language.

Parameter-efficient fine-tuning methods:

LT-SFT (Ansell et al., 2021)
MAD-X (Pfeiffer et al, 2020)

14

15 of 20

Transfer results: fine-grained analysis

How important is the source language using MAD-X adaptation?

English, Romanian, Wolof, and Shona

Evaluation on: Fon, Yoruba, and isiZulu

15

Multi-source adaptation is very effective

16 of 20

Conclusion

We developed MasakhaPOS—the largest POS dataset for 20 typologically diverse African languages.
We highlight the challenges of annotating POS for these languages using the universal dependencies (UD).
We developed POS taggers using CRF and multilingual pre-trained language models in both full-supervised settings and cross-lingual transfer settings.

16

Code/data: https://github.com/masakhane-io/masakhane-pos

17 of 20

Thank you

17

Hausa

Kiswahili

chiShona

Setswana

isiXhosa

isiZulu

Chichewa

Kiswahili

Dholuo

Kiswahili

Luganda

Kinyarwanda

Ghomala

Wolof

Mossi

Twi

Ewe

Fon

Yoruba

Igbo

Naija

MasakhaPOS

18 of 20

BACKUP SLIDE

18

19 of 20

Parameter-Efficient Fine-tuning with Adapters

19

STEP 1: Train language adapters on every language of interest

STEP 2: Train task adapter together with the language adapter, only modify task adapter, others parameters are frozen

Ruder 2022: NLP for African languages @Indaba

STEP 3: Zero-Shot Transfer to Target Language by replacing the source language adapter but keeping the task adapter.

20 of 20

Lottery Ticket Sparse Fine-tunings (LT-SFT)

Lottery Ticket Hypothesis (LTH) (Frankle & Carbin, 2019) that states that each neural model contains a sub-network (a "winning ticket") that, if trained again in isolation, can reach or even surpass the performance of the original model.

20

Ansell et al. 2022. Composable Sparse Fine-Tuning for Cross-Lingual Transfer. In ACL 2022.