1 of 11

GMU Systems for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks

Jonathan Kabala Mbuya

Antonios Anastasopoulos

{jmbuya, antonis}@gmu.edu

2 of 11

Motivation

https://mitratranslations.com/en/tag/website-language-vesion/

Communication is at the core of human societies

Over 7100 languages in the world

Languages play a key part in communication

Translation helps with cross language

communication

Machine & Speech Translation can help

3 of 11

Challenges

Often oral
No standard written system
Data scarcity (labeled data)

Dialectal and Low-resource Speech Translation

Potential Solution

Use unlabeled data for pretraining
Explore self-supervised speech translation models

Machine Translation

Speech Translation

4 of 11

IWSLT 2023 Tasks

5 of the 6 available Low-resource Tasks

1 Dialectal Task

Task	Train Set Hours	Task type
Irish to English	11	Low-resource
Marathi to Hindi	15.3	Low-resource
Pashto to French	61	Low-resource
Tamasheq to French	17	Low-resource
Quechua to Spanish	1.6	Low-resource
Tunisian Arabic to English	160	Dialectal

Constrained

234 hours of Tamasheq Audio only

Unconstrained

Librispeech 960 hours of English Audio only
MUST-C English to French (492 hours)
XLSR-53 data (53 languages)

Additional Data

5 of 11

Proposed Methods: Baseline Models

End-to-end speech translation (E2E)

End-to-end speech translation with ASR encoder initialization (E2E-ASR)

6 of 11

Proposed Methods: Using Self-Supervised Speech Models

Wav2vec 2.0: W2V-E2E

Hubert: Hubert-E2E

XLSR-53: XLSR-E2E

7 of 11

Layer Removal Results on Wav2vec 2.0

Based on Pasad et al., 2022

8 of 11

Low-resource Task Results

9 of 11

Low-resource Task Results

10 of 11

Dialectal Task Results

11 of 11

Conclusion

Removing the top 3 layers of Wav2vec 2.0 and XLSR-53 models and fine-tuning on a ST task helps improve the BLEU score on low-resource languages
Our results suggests that Hubert model does not need any layer removal for fine-tuning to achieve the best results
Hubert and Wav2vec 2.0 models seems to have comparable results
Surprisingly XLSR-53 model did not performs as good as Wav2vec 2.0 trained on English data on low resource tasks
However, XLSR-53 achieves the best result for the dialectal task