1 of 43

A Focus on Machine Translation for African Languages

Jade Abbott & Laura Martinus

2 of 43

DLIndaba 2018

September�

Weekly Reading Sessions

Towards NMT For African Languages��ML4D Workshop

NeurIPS�December 2018

Benchmarking NMT for Southern African Languages

ACL 2019 June�WiNLP workshop

3 of 43

Machine

Translation

Model

Source Language

Target Language

unjani

how are you?

Machine Translation

4 of 43

Agglutinative

Low Resourced

Low Discoverability

Reproducibility

No Benchmarks

Lack of Focus

PROBLEMS FACING AFRICAN LANGUAGES

5 of 43

Agglutinative

Low Resourced

Low Discoverability

Reproducibility

No Benchmarks

Lack of Focus

PROBLEMS FACING AFRICAN LANGUAGES

6 of 43

Lack of Focus

The general belief was that modernization was best achieved in an imported official language. The reason for this is that such a language is already widely used in science and technology and hence the experience gained in the use of the language can be copied, particularly through transfer of technology.

What is often ignored in this argument is that only a small part of the populace can be involved in a development strategy based on the use of an imported official language

Ayo Bamgbose, 2011 ~ University of Ibadan�“African Languages Today: The Challenge of and Prospects for Empowerment under Globalization”

7 of 43

Submissions to WiNLP ACL 2019 By Country

8 of 43

Agglutinative

Low Resourced

Low Discoverability

Reproducibility

No Benchmarks

Lack of Focus

PROBLEMS FACING AFRICAN LANGUAGES

9 of 43

Reproducibility

only ones to publish

code & data

10 of 43

Reproducibility

BLEU SCORE IS INCOMPARABLE ACROSS DATASETS

11 of 43

Reproducibility

BLEU SCORE IS INCOMPARABLE ACROSS DATASETS

WHAT DO WE NEED?

12 of 43

Reproducibility

BLEU SCORE IS INCOMPARABLE ACROSS DATASETS

WHAT DO WE NEED?

EVALUATION SETS

13 of 43

Human Benchmarks

Afrikaans

isiZulu

N. Sotho

Setswana

Xitsonga

50

8

28

20

21

14 of 43

Agglutinative

Low Resourced

Low Discoverability

Bad Reproducibility

No Benchmarks

Lack of Focus

PROBLEMS FACING AFRICAN LANGUAGES

15 of 43

Agglutinative

Low Resourced

Low Discoverability

Bad Reproducibility

No Benchmarks

Lack of Focus

PROBLEMS FACING AFRICAN LANGUAGES

16 of 43

Problems Discovering Research?

Problems Discovering Data?

  • Locked behind paywalls
  • Small linguistic journals
  • Obscure conferences
  • Tiny research groups in europe
  • DOESN’T EXIST

  • Proprietary
  • Hoarded by linguists in region
  • Tiny research groups in europe
  • Not digitized

17 of 43

Where to Find More Data?

18 of 43

Abbey

Chitumbuka

Ewondo

Jula

Kuhane (Subiya)

Medumba

Pidgin (Cameroon)

Swati

Acholi

Chiyao

Fang

Kabiye

Kwangali

Mende

Pidgin (West Africa)

Tigrinya

Afrikaans

Chokwe

Fante

Kabuverdianu

Kwanyama

Moore

Rumanyo

Tiv

Ahanta

Chopi

Fe'fe'

Kabyle

Kyangonde

Mozambican Sign Language

Rutoro

Toupouri

Aja

Chuabo

Fon

Kalanga (Botswana)

Lamba

Nama

Sango

Tshiluba

Amharic

Cibemba

Frafra

Khana

Lhukonzo

Ndau

Sehwi

Tshwa

Angolan Sign Language

Cinamwanga

Fulfulde (Cameroon)

Kikamba

Limbum

Ndebele

Sena

Tsonga

Arabic (Morocco)

Cinyanja

Ga

Kikaonde

Lingala

Ndebele (Zimbabwe)

Senoufo (Cebaara)

Twi

Attié

Comorian (Ngazidja)

Ghanaian Sign Language

Kikongo

Loma

Ndonga

Sepedi

Twi (Asante)

Awing

Congolese Sign Language

Ghomálá'

Kikuyu

Lomwe

Ngangela

Sepulana

Ugandan Sign Language

Bafia

Dagaare

Gitonga

Kiluba

Luganda

Ngiemboon

Sesotho (Lesotho)

Umbundu

Bakoko

Dagbani

Gokana

Kimbundu

Lunda

Ngombale

Sesotho (South Africa)

Urhobo

Baoule

Damara

Gouro

Kinande

Luo

Nigerian Pidgin

Setswana

Uruund

Bassa (Cameroon)

Dangme

Gun

Kinyarwanda

Luvale

Nyaneka

Seychelles Creole

Venda

Bissau Guinean Creole

Douala

Guéré

Kipende

Macua

Nyungwe

Shona

Wolaita

Bété

Edo

Herero

Kirundi

Mahorian (Roman)

Nzema

Sidama

Xhosa

Changana (Mozambique)

Efik

Ibinda

Kisi

Makaa

Obolo

Silozi

Yacouba

Chichewa

Esan

Idoma

Kisonge

Mambwe-Lungu

Okpe

Somali

Yemba

Chitonga

Ethiopian Sign Language

Igbo

Kongo

Mauritian Creole

Ombamba

South African Sign Language

Yoruba

Chitonga (Malawi)

Eton

Igede

Kpelle

Mbo

Oromo

Swahili

Zambian Sign Language

Chitonga (Zimbabwe)

Ewe

Isoko

Krio

Mbukushu

Otetela

Swahili (Congo)

Zimbabwe Sign Language

Zulu

19 of 43

Where to Find More Data?

Governments

Corporations

Media Agencies

20 of 43

Agglutinative

Low Resourced

Low Discoverability

Bad Reproducibility

No Benchmarks

Lack of Focus

PROBLEMS FACING AFRICAN LANGUAGES

21 of 43

Agglutinative

22 of 43

Agglutinative

Special Tokenization Technique -> Byte Pair Encoding

y@@

noma

s@@

w@@

ukuthi

on

ing

b@@

as

ng@@

(Sennrich et al, 2015 )

['I', 'like', 'ea', 'ting', 'app', 'l', 'es!</w>']

23 of 43

Agglutinative

Low Resourced

Low Discoverability

Bad Reproducibility

No Benchmarks

Lack of Focus

PROBLEMS FACING AFRICAN LANGUAGES

24 of 43

What does it mean to be low-resourced?

Incomplete or no dictionaries

Little or no parallel corpora

Little or no monolingual corpora

25 of 43

  • Model Choice
  • Tokenization
  • Use Monolingual Corpora
  • Data Augmentation
  • Transfer Learning
  • Incorporate Linguistic Information
  • Mad hacks

SOLVING FOR LOW RESOURCE LANGUAGES

26 of 43

  • Model Choice
  • Tokenization
  • Use Monolingual Corpora
  • Data Augmentation
  • Transfer Learning
  • Incorporate Linguistic Information
  • Mad hacks

SOLVING FOR LOW RESOURCE LANGUAGES

27 of 43

Model Choice

Statistical Machine Translation

28 of 43

Model Choice

Neural Machine Translation

(Vaswani et al, 2017 Attention is all You need)

Also check out: http://jalammar.github.io/illustrated-transformer/

29 of 43

SMT > NMT

30 of 43

SMT < Properly optimized NMT

(Sennrich et al, 2019, Revisiting Low-Resource Neural Machine Translation)

31 of 43

Our SMT vs NMT Experiments

Afrikaans

isiZulu

N. Sotho

Setswana

Xitsonga

# sentences

53 172

26 728

30 777

123 868

192 587

SMT BLEU

21.37

2.32

10.48

7.47

10.02

Optimized NMT BLEU

20.60

1.34

10.94

15.60

17.98

32 of 43

  • Model Choice
  • Tokenization
  • Use Monolingual Corpora
  • Data Augmentation
  • Transfer Learning
  • Incorporate Linguistic Information
  • Mad hacks

SOLVING FOR LOW RESOURCE LANGUAGES

33 of 43

  • Model Choice
  • Tokenization
  • Use Monolingual Corpora
  • Data Augmentation
  • Transfer Learning
  • Incorporate Linguistic Information
  • Mad hacks

SOLVING FOR LOW RESOURCE LANGUAGES

34 of 43

Dictionary -> Multilingual Embeddings

Word Translation Without Parallel Data (Conneau, 2017)

35 of 43

Monolingual Corpora -> Unsupervised NMT

(Artetxe, 2017)

(Lample, 2018)

(Artetxe, 2019)

36 of 43

Monolingual Corpora -> Unsupervised NMT

37 of 43

Monolingual Corpora -> Unsupervised NMT

Our experiments for English-to-Zulu BLEU

Algorithm

BLEU

NMT

2.32

Unsupervised MT

4.45

Human Benchmark

8.5

38 of 43

  • Model Choice
  • Tokenization
  • Use Monolingual Corpora
  • Data Augmentation
  • Transfer Learning
  • Incorporate Linguistic Information
  • Mad hacks

SOLVING FOR LOW RESOURCE LANGUAGES

39 of 43

  1. Publish code on github
  2. Publish datasets on opus/github
  3. Publish results at conferences /workshops (even early results)

Best Practices for African NMT

40 of 43

41 of 43

42 of 43

Submissions to WiNLP ACL 2019 By Country

43 of 43

CALL TO ACTION

Let’s put Africa on the NMT map!

Train an NMT model for your own language data

Let’s get Africa published at ACL

https://github.com/jaderabbit/masakhane

masakhanetranslation@gmail.com