A Focus on Machine Translation for African Languages
Jade Abbott & Laura Martinus
DLIndaba 2018
September�
Weekly Reading Sessions
Towards NMT For African Languages��ML4D Workshop
NeurIPS�December 2018
Benchmarking NMT for Southern African Languages
ACL 2019 June�WiNLP workshop�
Machine
Translation
Model
Source Language
Target Language
unjani
how are you?
Machine Translation
Agglutinative
Low Resourced
Low Discoverability
Reproducibility
No Benchmarks
Lack of Focus
PROBLEMS FACING AFRICAN LANGUAGES
Agglutinative
Low Resourced
Low Discoverability
Reproducibility
No Benchmarks
Lack of Focus
PROBLEMS FACING AFRICAN LANGUAGES
Lack of Focus
The general belief was that modernization was best achieved in an imported official language. The reason for this is that such a language is already widely used in science and technology and hence the experience gained in the use of the language can be copied, particularly through transfer of technology.
What is often ignored in this argument is that only a small part of the populace can be involved in a development strategy based on the use of an imported official language
Ayo Bamgbose, 2011 ~ University of Ibadan�“African Languages Today: The Challenge of and Prospects for Empowerment under Globalization”
Submissions to WiNLP ACL 2019 By Country
Agglutinative
Low Resourced
Low Discoverability
Reproducibility
No Benchmarks
Lack of Focus
PROBLEMS FACING AFRICAN LANGUAGES
Reproducibility
only ones to publish
code & data
Reproducibility
BLEU SCORE IS INCOMPARABLE ACROSS DATASETS
Reproducibility
BLEU SCORE IS INCOMPARABLE ACROSS DATASETS
WHAT DO WE NEED?
Reproducibility
BLEU SCORE IS INCOMPARABLE ACROSS DATASETS
WHAT DO WE NEED?
EVALUATION SETS
Human Benchmarks
Afrikaans | isiZulu | N. Sotho | Setswana | Xitsonga |
50 | 8 | 28 | 20 | 21 |
Agglutinative
Low Resourced
Low Discoverability
Bad Reproducibility
No Benchmarks
Lack of Focus
PROBLEMS FACING AFRICAN LANGUAGES
Agglutinative
Low Resourced
Low Discoverability
Bad Reproducibility
No Benchmarks
Lack of Focus
PROBLEMS FACING AFRICAN LANGUAGES
Problems Discovering Research?
Problems Discovering Data?
Where to Find More Data?
Abbey | Chitumbuka | Ewondo | Jula | Kuhane (Subiya) | Medumba | Pidgin (Cameroon) | Swati |
Acholi | Chiyao | Fang | Kabiye | Kwangali | Mende | Pidgin (West Africa) | Tigrinya |
Afrikaans | Chokwe | Fante | Kabuverdianu | Kwanyama | Moore | Rumanyo | Tiv |
Ahanta | Chopi | Fe'fe' | Kabyle | Kyangonde | Mozambican Sign Language | Rutoro | Toupouri |
Aja | Chuabo | Fon | Kalanga (Botswana) | Lamba | Nama | Sango | Tshiluba |
Amharic | Cibemba | Frafra | Khana | Lhukonzo | Ndau | Sehwi | Tshwa |
Angolan Sign Language | Cinamwanga | Fulfulde (Cameroon) | Kikamba | Limbum | Ndebele | Sena | Tsonga |
Arabic (Morocco) | Cinyanja | Ga | Kikaonde | Lingala | Ndebele (Zimbabwe) | Senoufo (Cebaara) | Twi |
Attié | Comorian (Ngazidja) | Ghanaian Sign Language | Kikongo | Loma | Ndonga | Sepedi | Twi (Asante) |
Awing | Congolese Sign Language | Ghomálá' | Kikuyu | Lomwe | Ngangela | Sepulana | Ugandan Sign Language |
Bafia | Dagaare | Gitonga | Kiluba | Luganda | Ngiemboon | Sesotho (Lesotho) | Umbundu |
Bakoko | Dagbani | Gokana | Kimbundu | Lunda | Ngombale | Sesotho (South Africa) | Urhobo |
Baoule | Damara | Gouro | Kinande | Luo | Nigerian Pidgin | Setswana | Uruund |
Bassa (Cameroon) | Dangme | Gun | Kinyarwanda | Luvale | Nyaneka | Seychelles Creole | Venda |
Bissau Guinean Creole | Douala | Guéré | Kipende | Macua | Nyungwe | Shona | Wolaita |
Bété | Edo | Herero | Kirundi | Mahorian (Roman) | Nzema | Sidama | Xhosa |
Changana (Mozambique) | Efik | Ibinda | Kisi | Makaa | Obolo | Silozi | Yacouba |
Chichewa | Esan | Idoma | Kisonge | Mambwe-Lungu | Okpe | Somali | Yemba |
Chitonga | Ethiopian Sign Language | Igbo | Kongo | Mauritian Creole | Ombamba | South African Sign Language | Yoruba |
Chitonga (Malawi) | Eton | Igede | Kpelle | Mbo | Oromo | Swahili | Zambian Sign Language |
Chitonga (Zimbabwe) | Ewe | Isoko | Krio | Mbukushu | Otetela | Swahili (Congo) | Zimbabwe Sign Language |
| | | | | | | Zulu |
Where to Find More Data?
Governments
Corporations
Media Agencies
Agglutinative
Low Resourced
Low Discoverability
Bad Reproducibility
No Benchmarks
Lack of Focus
PROBLEMS FACING AFRICAN LANGUAGES
Agglutinative
Agglutinative
Special Tokenization Technique -> Byte Pair Encoding
y@@ | |
| noma |
| s@@ |
| w@@ |
| ukuthi |
| on |
| ing |
| b@@ |
| as |
| ng@@ |
(Sennrich et al, 2015 )
['I', 'like', 'ea', 'ting', 'app', 'l', 'es!</w>']
Agglutinative
Low Resourced
Low Discoverability
Bad Reproducibility
No Benchmarks
Lack of Focus
PROBLEMS FACING AFRICAN LANGUAGES
What does it mean to be low-resourced?
Incomplete or no dictionaries
Little or no parallel corpora
Little or no monolingual corpora
SOLVING FOR LOW RESOURCE LANGUAGES
SOLVING FOR LOW RESOURCE LANGUAGES
Model Choice
Statistical Machine Translation
Model Choice
Neural Machine Translation
(Vaswani et al, 2017 Attention is all You need)
Also check out: http://jalammar.github.io/illustrated-transformer/
SMT > NMT
SMT < Properly optimized NMT
(Sennrich et al, 2019, Revisiting Low-Resource Neural Machine Translation)
Our SMT vs NMT Experiments
| Afrikaans | isiZulu | N. Sotho | Setswana | Xitsonga |
# sentences | 53 172 | 26 728 | 30 777 | 123 868 | 192 587 |
SMT BLEU | 21.37 | 2.32 | 10.48 | 7.47 | 10.02 |
Optimized NMT BLEU | 20.60 | 1.34 | 10.94 | 15.60 | 17.98 |
SOLVING FOR LOW RESOURCE LANGUAGES
SOLVING FOR LOW RESOURCE LANGUAGES
Dictionary -> Multilingual Embeddings
Word Translation Without Parallel Data (Conneau, 2017)
Monolingual Corpora -> Unsupervised NMT
(Artetxe, 2017)
(Lample, 2018)
(Artetxe, 2019)
Monolingual Corpora -> Unsupervised NMT
Monolingual Corpora -> Unsupervised NMT
Our experiments for English-to-Zulu BLEU
Algorithm | BLEU |
NMT | 2.32 |
Unsupervised MT | 4.45 |
Human Benchmark | 8.5 |
SOLVING FOR LOW RESOURCE LANGUAGES
Best Practices for African NMT
Evaluation Set: https://repo.sadilar.org/handle/20.500.12185/506
Code & Data: https://github.com/LauraMartinus/ukuxhumana
Publication: https://www.aclweb.org/anthology/papers/W/W19/W19-3632/
Website: https://ethionlp.github.io/
Data: On its way!
Publication: https://www.aclweb.org/anthology/papers/W/W19/W19-3611/
Submissions to WiNLP ACL 2019 By Country
CALL TO ACTION
Let’s put Africa on the NMT map!
Train an NMT model for your own language data
Let’s get Africa published at ACL
https://github.com/jaderabbit/masakhane
masakhanetranslation@gmail.com