Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking
Nikita Moghe
Mark Steedman
Alexandra Birch
A Task Oriented Multilingual Dialogue System
2
Language Understanding�
Response
Generation
Dialogue�Manager
Hi, I'm looking for a restaurant inside the Arden Fair mall
What kind of restaurant are you looking for?
inform(service=restaurant, location=Arden Fair mall)
request(cuisine)
Hallo, ich bin auf der Suche nach einem Restaurant im Arden Fair Einkaufszentrum
Welche Art von Restaurant suchen Sie denn genau?
Language: German
[Data hungry]
[Data hungry]
[Data hungry]
English
German
Is Transfer Learning via Multilingual Encoders sufficient?
… collecting labelled training data for every language is infeasible
“transfer” the knowledge from one language to another using shared multilingual encoders like mBERT
But, multilingual encoders are trained with data which is different than human conversations.
How do we make mBERT more conversational without extensive training?
3
Key Contribution
4
Fine-tuning multilingual pre-trained models with different but related dialogue data and tasks before fine-tuning on the target tasks enhances transfer learning between languages for dialogue state tracking
Proposed Method
5
mBERT
Und was wird aus uns? Uns bleibt [MASK] Paris. But what [MASK] us? We’ll always have Paris.
mBERT+
{price: expensive,
area: center}
Das im westlichen Teil der Stadt, bitte. Kann ich die Telefonnummer haben?
{gegend: westen,
request: telefon}
Classification Network
Dialogue State
Tracker
Pretrained�Language Model
Intermediate�Fine-tuning
Target Task Training
source + x% target language
Target Task Evaluation
source and target language
mBERT+
Find me some luxurious hotel in the center
immer
about
mBERT+
Classification Network
Dialogue State
Tracker
Intermediate Task Data
00:01:15,200 --> 00:01:20,764
Nehmt die Halme, schlagt sie oben ab, entfernt die Bl¨atter
00:01:15,200 --> 00:01:20,764
Take the stalks, cut them off at the top,removes the leaves
Why?
Parallel
Conversational
Continuous
Available for 1782 language pairs
How?
We extract 200K subtitles as they are
6
Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. Lison and Tidermann, LREC 2016
Cross-Lingual Intermediate Tasks
Task: Predict the masked word
7
Who is it, Martin? A bat, Professor. Don’t waste your pellets. You’ll never harm the bat.
Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein Schrot darauf. Dieser Fledermaus können Sie nichts anhabe
English subtitle
German subtitle
Properties:�
Cross-Lingual Intermediate Tasks
Cross-lingual dialogue modelling (XDM)
8
Who is it, Martin? A [MASK] Professor. Don’t waste your pellets
Dieser Fledermaus [MASK] Sie nichts anhabe
Who is it, Martin? A bat, Professor. Don’t waste your pellets. You’ll never harm the bat.
Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein Schrot darauf. Dieser Fledermaus können Sie nichts anhabe
English subtitle
German subtitle
Cross-Lingual Intermediate Tasks
Response Masking (RM)
9
Who is it, Martin? A bat, Professor. Don’t waste your pellets. You’ll never harm the bat.
Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein Schrot darauf. Dieser Fledermaus können Sie nichts anhabe
Who is it, Martin? A bat, Professor. Don’t waste your pellets
[MASK] [MASK] [MASK] [MASK] [MASK] [MASK]
Cross-Lingual Intermediate Tasks
Translation Language Modelling (TLM)
10
Who is it, Martin? A [MASK], Professor. Don’t waste your pellets. You’ll never [MASK] the bat. Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein [MASK] darauf. Dieser Fledermaus können Sie [MASK] anhabe
Who is it, Martin? A bat, Professor. Don’t waste your pellets. You’ll never harm the bat.
Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein Schrot darauf. Dieser Fledermaus können Sie nichts anhabe
Dialogue State Tracking (DST)
Mapping conversation to semantic form which stores user’s goals
11
Target dialogue state language can be either target or source language�U: User, S: System
Multilingual encoder based Dialogue State Tracker
U: I want to dine at a lavish restaurant somewhere in the north of the town.�S: Do you have a specific cuisine in mind?�U: How about Italian?
Restaurant: {price: expensive,�area: north, cuisine: italian}
→
→
Multilingual encoder based Dialogue State Tracker
U: 我正在寻找一家价格便宜且评分为 5 的餐厅。�I am looking for a restaurant that has cheap price range and a rating of 5.
Restaurant: {price: cheap,�rating: 5}
→
→
Training
Evaluation
Experimental Setup
12
Overview of the ninth dialog system technology challenge: DSTC9., Gunasekara et al., arXiv 2020
Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints, Mrksic et al,, ACL 2017
SUMBT: Slot-utterance matching for universal and scalable belief tracking. Lee et al., ACL 2019
Name | Parallel MultiWoZ | Multilingual WoZ |
Original dataset | English MultiWoZ 2.1 | English WoZ 2.0 |
Domains | 7 | 1 |
Languages | English (En) →Chinese(Zh)�Chinese → English | English → German (De)�English → Italian (It) |
Dialogue State Tracker | SUMBT (Lee et al., 2019) | Multiple Binary Classification Model |
Training Data | 100% Source + 10% Target | 100% Source |
Dialogue State Language | Source | Target |
All experiments with mBERT
Results
13
Competitive Models: Intermediate Fine-tuning
Baseline�vanilla mBERT encoder
Monolingual Dialogue Modelling (MonoDM)�Fine-tune with monolingual chats in source and target languages
Task Adaptive Pre-Training (TAPT)�Fine-tune with target task chats in source language
14
. Don’t stop pretraining:Adapt language models to domains and tasks, Gururangan et al., ACL 2020
Comparison with Intermediate Fine-tuning models
15
Intermediate Fine-tuning helps, Cross-lingual Intermediate Fine-tuning works even better
Comparison with Intermediate Fine-tuning models
16
Intermediate Fine-tuning helps, Cross-lingual Intermediate Fine-tuning works even better
Competitive Models: Literature
Cross-lingual Neural Belief Tracker (XL-NBT)�Teacher-student network based transfer learning
Attention informed mixed language training (AMLT)�Replace high attention words with target language synonyms and re-train DST
Cross-lingual Code-Switched Augmentation (CLCSA)�dynamic replacement of source language words with target language words during DST training
17
Comparison with competitive models
18
A cross-lingual neural belief tracking framework, Chen et al., EMNLP 2018�Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems. Liu et al., AAAI 2020�Cosda-ml: Multi-lingual code-switching data augmentation for zero-shot cross-lingual NLP. Qin et al., IJCAI 2020
Competitive Models: Machine Translation
Translate train�mBERT encoder + Translate the training set into target language
In-language training �mBERT encoder + 100% target language training set
Translate test�Monolingual BERT + Translate the test set into the source language at evaluation
19
.XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization., Hu et a;l., ACL 2020
Comparison with Machine Translation based methods
20
Multilingual encoder + Intermediate Fine-tuning > Intermediate Machine Translation based models
Data Efficiency of Cross-Lingual Intermediate Fine-tuning
21
1.9
Vanilla mBERT
mBERT + TLM
12.3
Cross-lingual intermediate fine-tuning has better performance even with as low as 1% of target language task data
Ablation Studies
Intermediate data when matched with the domain of the task is better�(movie subtitles > news text)
Creating intermediate training examples more/less than using 200K subtitles does not lead to improved performance for these datasets
Using subtitles with dialogue history is better than using isolated subtitles
22
Takeaways
Cross-lingual intermediate fine-tuning:
23