1 of 11

GMU Systems for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks

Jonathan Kabala Mbuya

Antonios Anastasopoulos

{jmbuya, antonis}@gmu.edu

2 of 11

1

Motivation

https://mitratranslations.com/en/tag/website-language-vesion/

Communication is at the core of human societies

Over 7100 languages in the world

Languages play a key part in communication

Translation helps with cross language

communication

Machine & Speech Translation can help

3 of 11

Challenges

    • Often oral
    • No standard written system
    • Data scarcity (labeled data)

2

Dialectal and Low-resource Speech Translation

Potential Solution

    • Use unlabeled data for pretraining
    • Explore self-supervised speech translation models

Machine Translation

Speech Translation

4 of 11

3

IWSLT 2023 Tasks

5 of the 6 available Low-resource Tasks

1 Dialectal Task

Task

Train Set Hours

Task type

Irish to English

11

Low-resource

Marathi to Hindi

15.3

Low-resource

Pashto to French

61

Low-resource

Tamasheq to French

17

Low-resource

Quechua to Spanish

1.6

Low-resource

Tunisian Arabic to English

160

Dialectal

Constrained

    • 234 hours of Tamasheq Audio only

Unconstrained

    • Librispeech 960 hours of English Audio only
    • MUST-C English to French (492 hours)
    • XLSR-53 data (53 languages)

Additional Data

5 of 11

4

Proposed Methods: Baseline Models

End-to-end speech translation (E2E)

End-to-end speech translation with ASR encoder initialization (E2E-ASR)

6 of 11

5

Proposed Methods: Using Self-Supervised Speech Models

Wav2vec 2.0: W2V-E2E

Hubert: Hubert-E2E

XLSR-53: XLSR-E2E

7 of 11

6

Layer Removal Results on Wav2vec 2.0

Based on Pasad et al., 2022

8 of 11

7

Low-resource Task Results

9 of 11

8

Low-resource Task Results

10 of 11

9

Dialectal Task Results

11 of 11

10

Conclusion

  • Removing the top 3 layers of Wav2vec 2.0 and XLSR-53 models and fine-tuning on a ST task helps improve the BLEU score on low-resource languages
  • Our results suggests that Hubert model does not need any layer removal for fine-tuning to achieve the best results
  • Hubert and Wav2vec 2.0 models seems to have comparable results
  • Surprisingly XLSR-53 model did not performs as good as Wav2vec 2.0 trained on English data on low resource tasks
  • However, XLSR-53 achieves the best result for the dialectal task