1 of 11

NAVER LABS Europe’s Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track

Edward Gow-Smith, Alexandre Berard, Marcely Zanon Boito and Ioan Calapodescu

2 of 11

Overview

  • In our work we introduce multilingual parameter-efficient models for ST which leverage strong pretrained models for Speech and MT
  • We obtain the winning score for both Que-Es and Taq-Fr
  • We also show competitive results for high-resource ST (IWSLT 2021)
  • We perform extensive additional experiments to investigate:
    • The effect of various hyperparameters
    • Incremental learning strategies
    • Zero-shot performance

3 of 11

Architecture

  • Pretrained MT model frozen except for first few layers
  • Low-dimension (64) bottleneck adapter modules in other layers
  • Features extracted from frozen pretrained speech representation model
  • Single CNN layer
  • Trained on multilingual ST + ASR data
  • Ability to decode both speech and text with little overhead

4 of 11

IWSLT 2023 Submission

  • Que-Es Primary: XLS-R large (18th layer) + NLLB 3.3B, 2 layers finetuned, ensemble of 3 seeds
  • Que-Es Contrastive 1: best dev seed of above
  • Que-Es Contrastive 2: same as above but with NLLB 1.3B
  • Taq-Fr Primary: Niger-Mali wav2vec 2.0 (8th layer) + NLLB 1.3B, 3 layers finetuned, ensemble of 3 seeds
  • Taq-Fr Contrastive 1: best dev seed of above
  • Taq-Fr Contrastive 2: same as Que-Es Primary

ST Training Data:

  • TED-LIUM v2 En-En
  • mTEDx Fr-Fr Es-Es Fr-En Fr-Es Es-Fr Es-En
  • IWSLT 2023 training data

5 of 11

IWSLT 2021 Evaluation

  • We also evaluate on the IWSLT 2021 Multilingual ST task, with both supervised and zero-shot evaluation
  • We outperform the best submission on the training directions: our models require substantially less compute and data to train, and they are not ensembled
  • We perform worse than FAIR on the zero-shot directions, but these are not truly zero-shot for their submission

6 of 11

Speech Models

  • The wav2vec 2.0 models trained with Tamasheq outperform the massively multilingual models on Taq-Fr
  • wav2vec 2.0: layer 8 gives best performance on Taq-Fr
  • HuBERT: layer 11 gives best performance
  • XLS-R L gives best performance on Fr-En

7 of 11

MT Models

  • NLLB gives better performance than mBART at the same size
  • Bigger NLLB models give better performance (but smaller for 1.3B - 3.3B)

8 of 11

Convolution Layers

  • Looking at the number of convolutional layers, we find little effect on performance, but a big effect on decoding speed
  • Models with fewer conv layers converge faster

9 of 11

Incremental Learning

  • Our approach has a high compute cost when retraining for a new language
  • We look at incremental learning: training first on the IWSLT 2021 data, then adapting to Taq-Fr through four different approaches
  • Even with only 1.6M params, we achieve good performance

10 of 11

Zero-Shot Performance

  • Since we do not train any decoder-parameters, and only train the first few encoder layers, the model retain zero-shot capabilities
  • We investigate this for two pairs: Taq-En (seen during ST training) and Taq-Ko (only seen during pretraining)
  • For Taq-En we beat the cascaded baseline across all settings
  • For Taq-Ko decoder adapters are harmful to performance – 16% of the output sentences for the larger adapters

11 of 11

Thanks for Listening