1 of 13

Team QUESPA Submission

International Workshop on Spoken Language Translation

IWSLT 2023

Dialog and Low-Resource Track

John E. Ortega (Northeastern University),

Rodolfo Zevallos (Universitat de Pompeu Fabra) and

William Chen (Carnegie Mellon University)

2 of 13

Agenda

  • Team QUESPA
  • Quechua background
  • Challenge description
  • Development approaches
  • Submitted approaches
  • Final results

3 of 13

Team QUESPA

Northeastern University. Former organizer of several low-resource workshops and conference. Multiple publications with machine transaltion and NLP on low-resource languages including Quechua, Galician, and more.

Universitat Pompeu Fabra. Doctoral student with a thesis in low-resource languages. Former organizer of workshops on indigenous languages in the Americas. Multiple publications on low-resource languages, specializing in Peruvian languages.

Carnegie Mellon University. Master’s student with a focus on speech technology. Co-organizer of several low-resource workshops and conferences. Multiple publications and work on speech audio and low-resource techniques.

John E. Ortega

William Chen

Rodolfo Zevallos

4 of 13

Quechua Background

  • Quechua is spoken by around 8 million people.
  • Agglutinating, poly-synthetic complex morphology.
  • 3 morphemes per word (EN: 🡪 1.5 morphemes per word)

5 of 13

Challenge Description

  • https://github.com/Llamacha/IWSLT2023_Quechua_data
  • 1.4 hours of translated speech date Quechua to Spanish
  • 60 hours of additional audio data (not translated)
  • Additional parallel data (for MT)
  • 52k parallel sentences which are mostly biblical
  • Speech translation measured by BLEU, if less than 5 BLEU, results are taken using Word Error Rate.
  • Constrained (no external data or systems allowed) the MT parallel data was allowed.
  • Unconstrained (external data and/or systems allowed) along with Constrained data.

6 of 13

Developmental Approaches

  • Constrained
    • ASR + MT – Wav2letter++ and MT
    • ASR + MT – FBANK and MT
    • ASR + MT – Totonac non-conformer
    • ASR + MT – Totonac with conformer
  • Best ASR WER was 40.2 using Totonac with Conformer architecture.
  • None of the developmental systems with pipeline approaches were kept.

7 of 13

Developmental Approaches

  • Unconstrained
    • ASR + MT – Wav2letter++ and MT
    • ASR + MT – Wav2Vec2 fine-tuned with augmentation
    • ASR + MT – FLEURS fine tuned on constrained
    • ASR + MT – FLEURS fine tuned on 55 hours (additional) and constrained
    • ASR + MT – FLEURS fine tuned on 55 hours (additional) and constrained with language model.
  • Wav2letter++ and FLEURS fine tuned on 55 hours models were used for final system.

8 of 13

Developmental Approaches

  • Unconstrained/Constrained Machine Translation
    • OpenNMT Transformer on MT Data + Speech Translations
    • OpenNMT Transformer with NLLB (no Quechua data)
    • Fairseq 101 from WMT 2021 (includes Quechua data)
    • Fairseq 101 from WMT 2021 (includes Quechua data and fine tuned on MT + Speech translations)
    • Fairseq 200 NLLB Pre-trained Language Model (no fine tuning but has Quechua data)
    • Fairseq 200 NLLB Pre-trained Language Model (fine tuned on MT + Speech translations)
    • Hugging Face 200 NLLB (includes Quechua data)
  • 3 systems kept and tried on final approach: OpenNMT, Fairseq 101 NLLB (fine tuned), Fairseq 200 NLLB (with fine tuned)

9 of 13

Submitted Approaches

  • Primary Constrained
    • 1.25 BLEU
    • S2T - Fairseq transformer and MelFilterBanks features
  • Constrained 1
    • 0.13 BLEU
    • Cascade - Wav2Letter++ with OpenNMT transformer.
  • Constrained 2
    • 0.11 BLEU
    • Cascade - Totonac conformer with OpenNMT transformer.

10 of 13

Submitted Approaches

  • Primary Unconstrained
    • BLEU = 15.36
    • Cascade – ASR (fine-tuned) + MT (fine-tuned)
    • ASR System
      • 102-language FLEURS (Conneau et al., 2023) dataset.
      • Conformer (Gulatiet al., 2020) encoder and transformer decoder.
      • Language model
      • Trained using hybrid CTC/attention loss (Watanabe et al., 2017) and hierarchical language identification conditioning (Chen et al., 2023).
    • MT System
      • Fairseq (Ott et al., 2019)
      • Flores 101 NLLB (Guzmán et al., 2019) pre-trained language model.
      • Transformer architecture used in WMT 2021, fine-tuned on MT and Speech Translation from constrained data.
  • Contrastive 1 (similar no LM, 15.27 BLEU) and Contrastive 2 (wav2letter++, 10.75 BLEU)

11 of 13

Submitted Approaches

12 of 13

Final Results

13 of 13

Thanks!