1 of 24

1

Languages Transferred Within the Encoder:

On Representation Transfer in Zero-Shot Multilingual Translation

Zhi Qu, Chenchen Ding, Taro Watanabe

Zhi Qu

D3 Student, Natural Language Processing Lab., Nara Institute of Science and Technology, Japan;

Internship, Advanced Translation Technology Lab., National Institute of Information and Communications Technology, Japan.

June 26, 2025

2 of 24

2

Scan it to visit my homepage!

(https://zhiqu22.github.io)

Then, you can find my slides today!

1

2

3 of 24

3

Background

Multilingual Neural Machine Translation

A system enables arbitrary translations from multiple languages to multiple languages.

Zero-Shot Translation

Given a model, if a translation pair from language A to language B existed in training, we say A → B is the supervised translation pair.

On the contrary, if A→B is not seen in training, we say A→B is a zero-shot translation pair.

4 of 24

4

Background

Two main paradigms:

  • MNMT-specific model: A single model trained by parallel data only.
  • LLM-driven methods: Utilizing or fine-tuning LLMs to support MNMT.

Has the traditional method been abandoned?

We say: No!

But why?

Traditional

Current spotlight

5 of 24

5

Background out of this work

The comparison between two main paradigms:

  • Traditional:
    • NLLB-3.3B, a dense encoder-decoder model supporting more than 200 languages. And its performance is better than GPT 3.5. [Zhu et al., 2024]
    • Gao et al. (2022) and Zhang et al. (2022) show that decoder-only architecture is weak in MNMT.

  • LLM-driven methods:
    • Tower 7B [Alves et al., 2024] supports only 10 languages and underperforms GPT-4.
    • ALMA 13B [Xu et al., 2024] supports only 6 languages and outperforms NLLB-3.3B and GPT-4 on these 6 languages.
    • XALMA 13B (+ adapters) [Xu et al., 2025] supports 50 languages, however, the translation directions are not arbitrary.
    • BigTranslate 13B [Yang et al., 2023], which supports more than 100 languages, is comparable to GPT 3.5.

6 of 24

6

Problem

Our 1st motivation:

We want to unveil the success of encoder-decoder architectures at the representation level to guide the future study in MNMT.

Problem:

Although there are a lot of related works did this, prior works show a discrepancy, in other words, two “seemingly opposite” conclusions.

Our 2nd motivation:

Why? Can we reach a unified conclusion?

7 of 24

7

Problem

The discrepancy:

  • an ideal encoder is expected to distinguish representations by the target language (Kudugunta et al., 2019; Liu et al., 2021; Tan and Monz, 2023; Stap et al., 2023; Sun et al., 2024).
  • an ideal encoder is expected to learn language-agnostic representations, capturing general cross-lingual features that are transferable across languages (Pan et al., 2021; Gu and Feng, 2022; Gao et al., 2023; Bu et al., 2024).

Analysis based on sentence-level representations extracted from the encoder output:

  1. shows representations are clustered by their target languages (analyzed by SVCCA scores);
  2. shows representations from different source languages are aligned (analyzed by t-SNE).

8 of 24

8

Setup for investigation

Datasets:

The training and validation data are English-centric, thus, the translation between two non-English languages is zero-shot.

  • Europarl-15, 15 European languages (2 × 14 pairs in total). Each language has 189,310 instances, and all languages are semantically parallel.
  • TED-19, 19 various languages (36 pairs), each translation pair contains 103,093 to 214,111 instances.

Models:

Transformer with 6/8/10 encoder layers and 6 decoder layers, size of 512 × 1024.

Input

Output

Encoder

Decoder

[de] Hello, world.

Hallo, welt.

An encoder-decoder model

9 of 24

9

Identity Pair

Our tool for analysis:

Identity pair refers to a pseudo and zero-shot pair translating a sentence to itself, which can present the optimal state of a language.

Example:

  • A real pair of en → de:

[de] Hello, world. → Hallo, welt.

  • The identity pair of en:

[en] Hello, world. → Hello, world.

  • The identity pair of de:

[de] Hallo, welt. → Hallo, welt.

Explanations:

  • This is a process without another language’s influence. (why it can present the state of a language)
  • It’s a zero-shot pair, but the model can perfectly recover the source sentence from the hidden representations. (why it is optimal.)
  • Indeed, it is a proxy of language representations, which means the result is indirect. However, it’s still a better solution than comparing real translation pairs.

In fact, identity pairs have been intuitively used in prior works (Tiedemann and Scherrer, 2019; Thompson and Post, 2020; Bu et al., 2024), as an assumed indicator of target-language-specific representation states.

However, these works did not define, validate and utilize identity pairs.

10 of 24

10

1st Analysis:

Language Transfer Within the Encoder

Given the language (1) and the language (2), we have 3 sentence-level cases:

  • (i) Comparing (2) → (2) and (1) → (2) to shows target language features by SVCCA.
  • (ii) Comparing (2) → (2) and (2) → (1) to shows source language features by SVCCA.
  • (iii) Comparing (2) → (2) and (1) → (1) to shows language-agnostic features by SVCCA.

Layer-wise results of 6 encoder layers:

* SVCCA refers to the singular value canonical correlation analysis used to compare the similarity between two vectors.

We follow Liu et al. (2021) to compare sentence-level representation by mean-pooling all tokens.

11 of 24

11

1st Analysis:

Language Transfer Within the Encoder

Given the language (1) and the language (2), we have 3 sentence-level cases:

  • (i) Comparing (2) → (2) and (1) → (2) to shows target language features by SVCCA.
  • (ii) Comparing (2) → (2) and (2) → (1) to shows source language features by SVCCA.
  • (iii) Comparing (2) → (2) and (1) → (1) to shows language-agnostic features by SVCCA.

Layer-wise results of 8 and 10 encoder layers:

* SVCCA refers to the singular value canonical correlation analysis used to compare the similarity between two vectors.

We follow Liu et al. (2021) to compare sentence-level representation by mean-pooling all tokens.

12 of 24

12

1st Analysis:

Language Transfer Within the Encoder

What phenomena that we can observe?

  • The feature of target language is gradually increasing; source is gradually decreasing.
  • At the output of the encoder, the language feature of representation is very clear:
    • target feature is superiority
    • source feature and language-agnostic feature are not much different.

We can conclude:

Translation representations are indeed transferred to the target language layer-by-layer.

This supports and supplements the opinion of Kudugunta et al. (2019).

However, how to explain the semantic alignment?

13 of 24

13

1st Analysis:

Language Transfer Within the Encoder

Token-level analysis by t-SNE in TED-19:

4 real sentences translating to English, and the identity pair belongs to English.

Embedding Layer

var. of 1.45

Encoder Output

var. of 0.09

Language Transfer means:

Representations from different languages are semantically aligned at the representational subspace of the target language.

The “seemingly opposite” opinions are two sides of the same thing!

14 of 24

14

2nd Analysis:

Entanglements Hindering the Transfer

Let’s further explore the case with different target languages:

  • Representations of identities are uniformly distributed.
  • Representations of supervised translations are similar.
  • Representations of zero-shot translations are entangled.

This phenomenon fits the performance scoring, i.e., identity > supervised translation > zero-shot translation.

Namely, the entanglement across languages causes the zero-shot translation deficiency.

15 of 24

15

3rd Analysis:

Language Features in the Decoder

The most ideal situation for the decoder is that the hidden states are only involved with the target language features.

However, this is hard to reach:

  • The source sentence is aligned instead of converted to target language completely.
  • The decoder also suffers from the entanglements.

Based on our analysis, the lower layer of decoder is weak in distinguishing language features.

* Given the time limitation, we only give a simple explanation for this part.

16 of 24

16

On the encoder side:

Low-rank Language-specific Embedding (LoLE)

  • Initializing a set of embedding specified to each language.
  • Biasing the token representation at the top-2 layer of encoder by the target language.
  • Only the head portion is biased.

Methods

On the decoder side:

Language-specific Contrastive Learning of Representations (LCLR)

  • At the bottom decoder layer’s output.
  • An instance in the training batch will be the anchor.
  • An instance translating to the same target language will be the positive instance.
  • Other instances translating to the different target language will be the negative instances.

17 of 24

17

Datasets:

  • Europarl-15
  • TED-19
  • OPUS-100 (revised version), a large-scale dataset consists of 95 languages, 188 pairs, and 109.2 million instances in total.

Experimental Setup

Evaluation metrics: SacreBLEU, BERTScores

Training from scratch:

  • 6 encoder layers × 6 decoder layers;
  • Inner size of 1024 for Europarl-15 and TED;
  • Inner size of 2048 for OPUS-100.

Fine-tuning, only experimented with TED-19:

  • M2M-418M
  • M2M-1.2B
  • mBART50

18 of 24

18

Experimental Results

19 of 24

19

Experimental Results

These pre-trained model are trained by another strategy instead of adding target language tag at the input beginning.

The improvement prove that, improving target language information is always benefit.

20 of 24

20

First of all:

Averaged scores may ignore the language-specific tendency. However, hundreds pairs are hard to illustrate.

So, we show averaged scores on language family.

4th Analysis:

Correlation between Performance and Representation

The difference of performance has a similar tendency with SVCCA scores.

To prove this, we conduct Pearson Correlation Analysis:

  • Coefficients: range of [0.585, 0.855], mean of 0.77, variance of 0.04.
  • p-values: range of [4e-5, 0.021], mean of 0.002, variance of 3e-5.
  • Conclusion: they are positively correlated.

21 of 24

21

  1. Layer-wise representations in the encoder:
    • The distinguishing of target and source languages is clearer.
    • The lower layers focused on source language information more.

  • The entanglements:
    • Representations of zero-shot translations are disentangled.

  • Layer-wise representations in the decoder:
    • The target-language feature becomes more important.

5th Analysis:

Improved Representation

22 of 24

22

  1. Our analyses unveil the mechanism of traditional MNMT models:
    • Language representations are indeed transferred to target language spaces.
    • Representations from different source languages are semantically aligned in the target language spaces.

  • Our theory shows how to improve the MNMT models effectively:

Improving target language feature in transferring.

  • Our theory indeed can guide the further improving direction on MNMT.
    • Our methodologies in this work are good practice.
    • We have already conducted many works by this theory:
      • Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation. Zhe Cao, Zhi Qu, Hidetaka Kamigaito and Taro Watanabe. EMNLP 2024.
      • Disentangling Pretrained Representation to Leverage Low-Resource Languages in Multilingual Machine Translation. Frederikus Hudi, Zhi Qu, Hidetaka Kamigaito and Taro Watanabe. LREC-COLING 2024.
      • Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation. Zhi Qu, Yiran Wang, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe. Under-reviewing.
      • Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation. Zhi Qu, Yiran Wang, Jiannan Mao, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe. ACL 2025.

Summarization

23 of 24

23

Thank you for your listening!

Q&A

24 of 24

24

Limitation

We mentioned the identity pair is a proxy.

Thus, our analysis is based on bidirectional training.

We conduct an additional experiment, removing it from the decoder and removing nl from the encoder. Then, we use German as a middle language to replace the role of identity pair.

Although additional result (b) has slightly difference from (a), it keeps the similar tendency.