1 of 12

The USTC’s Dialect Speech Translation System for IWSLT 2023

Pan Deng1, Shihao Chen, Weitai Zhang1,2, Jie Zhang, Lirong Dai

1University of Science and Technology of China, Hefei, China

2iFlytek Research, Hefei, China

2 of 12

Dialect Speech Translation

  • Direction: Translating Tunisian Arabic dialect to English
  • Dialect data features:
    • Low-resource
    • Non-standard orthography
    • Somehow similar to the corresponding standard language (Modern Standard Arabic, MSA)
  • Contribution:
    • Improve data augmentation methods for dialect MT
    • Explore the transfer from Tunisian Arabic(Ta) to MSA for E2E ST

3 of 12

Data Conditions

    • Unconstrained condition:
    • Constrained condition:
      • Condition A (Tunisian ST)
      • Condition B (MSA ASR/MT)
      • Condition C (Add private MSA ASR/MT data)

4 of 12

Experiments

    • Cascaded ST System:
      • ASR model ensemble
      • MT data augmentation
      • Robust fine-tune

    • End-to-end ST System
      • SATE
      • Hybrid SATE

    • Submission ST System
      • Ensemble of cascaded and end-to-end ST system

5 of 12

Cascaded ST System

    • ASR model ensemble
      • Three ASR model architectures
        • VGG-Conformer
        • VGG-Transformer
        • GateCNN-Conformer

6 of 12

Cascaded ST System

    • MT data augmentation
      • Data augmentation with MSA data

      • Data augmentation with BTFT model
        • Build Ta->En MT model and En->Ta model and translate from each directions
        • Pretrain and fine-tune the MSA2Ta model to enhance its translation quality
        • Prepare pseudo Ta-MSA paired data from both Tunisian and MSA domain

7 of 12

Cascaded ST System

    • Robust fine-tuning for MT model
      • Constrained Fine-tune
      • Error Adaptation Fine-tune

8 of 12

Cascaded ST System

    • Results of Cascaded ST System
      • Pretrain the model with BTFT data and use constrained fine-tune method brings considerable improvement
      • Pretrain the MSA2Ta or En2Ta model can improve translation quality
      • The pseudo dialect MT data generated by the MSA*2Ta model is more beneficial

9 of 12

End-to-end ST System

    • Models
      • SATE
        • Tunisian ASR
        • Ta2En MT

      • Hybrid-SATE
        • Tunisian ASR
        • Ta2MSA* MT encoder (Text Encoder)
        • MSA2En MT

10 of 12

End-to-end ST System

    • Results of End-to-end ST System
      • SATE is slightly better than Hybrid-SATE in condition C
      • Adding Hybrid-SATE as a sub-model in model ensemble is very useful

11 of 12

Submission System

    • Results of the Submission System
      • Cascaded ST is better than E2E ST in constrained condition

      • E2E ST outperforms Cascaded ST in unconstrained condition, especially in condition C
      • Hybrid SATE is a significant sub-model for model ensemble, brings an average BLEU improvement of 0.45

12 of 12

Thank you !