1
Languages Transferred Within the Encoder:
On Representation Transfer in Zero-Shot Multilingual Translation
Zhi Qu, Chenchen Ding, Taro Watanabe
Zhi Qu
D3 Student, Natural Language Processing Lab., Nara Institute of Science and Technology, Japan;
Internship, Advanced Translation Technology Lab., National Institute of Information and Communications Technology, Japan.
June 26, 2025
2
Scan it to visit my homepage!
(https://zhiqu22.github.io)
Then, you can find my slides today!
1
2
3
Background
Multilingual Neural Machine Translation
A system enables arbitrary translations from multiple languages to multiple languages.
Zero-Shot Translation
Given a model, if a translation pair from language A to language B existed in training, we say A → B is the supervised translation pair.
On the contrary, if A→B is not seen in training, we say A→B is a zero-shot translation pair.
4
Background
Two main paradigms:
Has the traditional method been abandoned?
We say: No!
But why?
Traditional
Current spotlight
5
Background out of this work
The comparison between two main paradigms:
6
Problem
Our 1st motivation:
We want to unveil the success of encoder-decoder architectures at the representation level to guide the future study in MNMT.
Problem:
Although there are a lot of related works did this, prior works show a discrepancy, in other words, two “seemingly opposite” conclusions.
Our 2nd motivation:
Why? Can we reach a unified conclusion?
7
Problem
The discrepancy:
Analysis based on sentence-level representations extracted from the encoder output:
8
Setup for investigation
Datasets:
The training and validation data are English-centric, thus, the translation between two non-English languages is zero-shot.
Models:
Transformer with 6/8/10 encoder layers and 6 decoder layers, size of 512 × 1024.
Input
Output
Encoder
Decoder
[de] Hello, world.
Hallo, welt.
An encoder-decoder model
9
Identity Pair
Our tool for analysis:
Identity pair refers to a pseudo and zero-shot pair translating a sentence to itself, which can present the optimal state of a language.
Example:
[de] Hello, world. → Hallo, welt.
[en] Hello, world. → Hello, world.
[de] Hallo, welt. → Hallo, welt.
Explanations:
In fact, identity pairs have been intuitively used in prior works (Tiedemann and Scherrer, 2019; Thompson and Post, 2020; Bu et al., 2024), as an assumed indicator of target-language-specific representation states.
However, these works did not define, validate and utilize identity pairs.
10
1st Analysis:
Language Transfer Within the Encoder
Given the language (1) and the language (2), we have 3 sentence-level cases:
Layer-wise results of 6 encoder layers:
* SVCCA refers to the singular value canonical correlation analysis used to compare the similarity between two vectors.
We follow Liu et al. (2021) to compare sentence-level representation by mean-pooling all tokens.
11
1st Analysis:
Language Transfer Within the Encoder
Given the language (1) and the language (2), we have 3 sentence-level cases:
Layer-wise results of 8 and 10 encoder layers:
* SVCCA refers to the singular value canonical correlation analysis used to compare the similarity between two vectors.
We follow Liu et al. (2021) to compare sentence-level representation by mean-pooling all tokens.
12
1st Analysis:
Language Transfer Within the Encoder
What phenomena that we can observe?
We can conclude:
Translation representations are indeed transferred to the target language layer-by-layer.
This supports and supplements the opinion of Kudugunta et al. (2019).
However, how to explain the semantic alignment?
13
1st Analysis:
Language Transfer Within the Encoder
Token-level analysis by t-SNE in TED-19:
4 real sentences translating to English, and the identity pair belongs to English.
Embedding Layer
var. of 1.45
Encoder Output
var. of 0.09
Language Transfer means:
Representations from different languages are semantically aligned at the representational subspace of the target language.
The “seemingly opposite” opinions are two sides of the same thing!
14
2nd Analysis:
Entanglements Hindering the Transfer
Let’s further explore the case with different target languages:
This phenomenon fits the performance scoring, i.e., identity > supervised translation > zero-shot translation.
Namely, the entanglement across languages causes the zero-shot translation deficiency.
15
3rd Analysis:
Language Features in the Decoder
The most ideal situation for the decoder is that the hidden states are only involved with the target language features.
However, this is hard to reach:
Based on our analysis, the lower layer of decoder is weak in distinguishing language features.
* Given the time limitation, we only give a simple explanation for this part.
16
On the encoder side:
Low-rank Language-specific Embedding (LoLE)
Methods
On the decoder side:
Language-specific Contrastive Learning of Representations (LCLR)
17
Datasets:
Experimental Setup
Evaluation metrics: SacreBLEU, BERTScores
Training from scratch:
Fine-tuning, only experimented with TED-19:
18
Experimental Results
19
Experimental Results
These pre-trained model are trained by another strategy instead of adding target language tag at the input beginning.
The improvement prove that, improving target language information is always benefit.
20
First of all:
Averaged scores may ignore the language-specific tendency. However, hundreds pairs are hard to illustrate.
So, we show averaged scores on language family.
4th Analysis:
Correlation between Performance and Representation
The difference of performance has a similar tendency with SVCCA scores.
To prove this, we conduct Pearson Correlation Analysis:
21
5th Analysis:
Improved Representation
22
Improving target language feature in transferring.
Summarization
23
Thank you for your listening!
Q&A
24
Limitation
We mentioned the identity pair is a proxy.
Thus, our analysis is based on bidirectional training.
We conduct an additional experiment, removing it from the decoder and removing nl from the encoder. Then, we use German as a middle language to replace the role of identity pair.
Although additional result (b) has slightly difference from (a), it keeps the similar tendency.