1 of 23

Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking

Nikita Moghe

Mark Steedman

Alexandra Birch

2 of 23

A Task Oriented Multilingual Dialogue System

2

Language Understanding�

Response

Generation

Dialogue�Manager

Hi, I'm looking for a restaurant inside the Arden Fair mall

What kind of restaurant are you looking for?

inform(service=restaurant, location=Arden Fair mall)

request(cuisine)

Hallo, ich bin auf der Suche nach einem Restaurant im Arden Fair Einkaufszentrum

Welche Art von Restaurant suchen Sie denn genau?

Language: German

[Data hungry]

[Data hungry]

[Data hungry]

English

German

3 of 23

Is Transfer Learning via Multilingual Encoders sufficient?

… collecting labelled training data for every language is infeasible

transfer” the knowledge from one language to another using shared multilingual encoders like mBERT

But, multilingual encoders are trained with data which is different than human conversations.

How do we make mBERT more conversational without extensive training?

3

4 of 23

Key Contribution

4

Fine-tuning multilingual pre-trained models with different but related dialogue data and tasks before fine-tuning on the target tasks enhances transfer learning between languages for dialogue state tracking

5 of 23

Proposed Method

5

mBERT

Und was wird aus uns? Uns bleibt [MASK] Paris. But what [MASK] us? We’ll always have Paris.

mBERT+

{price: expensive,

area: center}

Das im westlichen Teil der Stadt, bitte. Kann ich die Telefonnummer haben?

{gegend: westen,

request: telefon}

Classification Network

Dialogue State

Tracker

Pretrained�Language Model

Intermediate�Fine-tuning

Target Task Training

source + x% target language

Target Task Evaluation

source and target language

mBERT+

Find me some luxurious hotel in the center

immer

about

mBERT+

Classification Network

Dialogue State

Tracker

6 of 23

Intermediate Task Data

00:01:15,200 --> 00:01:20,764

Nehmt die Halme, schlagt sie oben ab, entfernt die Bl¨atter

00:01:15,200 --> 00:01:20,764

Take the stalks, cut them off at the top,removes the leaves

Why?

Parallel

Conversational

Continuous

Available for 1782 language pairs

How?

We extract 200K subtitles as they are

6

Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. Lison and Tidermann, LREC 2016

7 of 23

Cross-Lingual Intermediate Tasks

Task: Predict the masked word

7

Who is it, Martin? A bat, Professor. Don’t waste your pellets. You’ll never harm the bat.

Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein Schrot darauf. Dieser Fledermaus können Sie nichts anhabe

English subtitle

German subtitle

Properties:�

  1. Use contextual information from dialogue history�
  2. Use cross-lingual information from parallel data

8 of 23

Cross-Lingual Intermediate Tasks

Cross-lingual dialogue modelling (XDM)

8

Who is it, Martin? A [MASK] Professor. Don’t waste your pellets

Dieser Fledermaus [MASK] Sie nichts anhabe

Who is it, Martin? A bat, Professor. Don’t waste your pellets. You’ll never harm the bat.

Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein Schrot darauf. Dieser Fledermaus können Sie nichts anhabe

English subtitle

German subtitle

9 of 23

Cross-Lingual Intermediate Tasks

Response Masking (RM)

9

Who is it, Martin? A bat, Professor. Don’t waste your pellets. You’ll never harm the bat.

Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein Schrot darauf. Dieser Fledermaus können Sie nichts anhabe

Who is it, Martin? A bat, Professor. Don’t waste your pellets

[MASK] [MASK] [MASK] [MASK] [MASK] [MASK]

10 of 23

Cross-Lingual Intermediate Tasks

Translation Language Modelling (TLM)

10

Who is it, Martin? A [MASK], Professor. Don’t waste your pellets. You’ll never [MASK] the bat. Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein [MASK] darauf. Dieser Fledermaus können Sie [MASK] anhabe

Who is it, Martin? A bat, Professor. Don’t waste your pellets. You’ll never harm the bat.

Wer ist denn da, Martin? Eine Fledermaus, Herr Professor. Verschwenden Sie kein Schrot darauf. Dieser Fledermaus können Sie nichts anhabe

11 of 23

Dialogue State Tracking (DST)

Mapping conversation to semantic form which stores user’s goals

11

Target dialogue state language can be either target or source language�U: User, S: System

Multilingual encoder based Dialogue State Tracker

U: I want to dine at a lavish restaurant somewhere in the north of the town.�S: Do you have a specific cuisine in mind?�U: How about Italian?

Restaurant: {price: expensive,�area: north, cuisine: italian}

Multilingual encoder based Dialogue State Tracker

U: 我正在寻找一家价格便宜且评分为 5 的餐厅。�I am looking for a restaurant that has cheap price range and a rating of 5.

Restaurant: {price: cheap,�rating: 5}

Training

Evaluation

12 of 23

Experimental Setup

12

Overview of the ninth dialog system technology challenge: DSTC9., Gunasekara et al., arXiv 2020

Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints, Mrksic et al,, ACL 2017

SUMBT: Slot-utterance matching for universal and scalable belief tracking. Lee et al., ACL 2019

Name

Parallel MultiWoZ

Multilingual WoZ

Original dataset

English MultiWoZ 2.1

English WoZ 2.0

Domains

7

1

Languages

English (En) →Chinese(Zh)�Chinese → English

English → German (De)�English → Italian (It)

Dialogue State Tracker

SUMBT (Lee et al., 2019)

Multiple Binary Classification Model

Training Data

100% Source + 10% Target

100% Source

Dialogue State Language

Source

Target

All experiments with mBERT

13 of 23

Results

13

14 of 23

Competitive Models: Intermediate Fine-tuning

Baseline�vanilla mBERT encoder

Monolingual Dialogue Modelling (MonoDM)�Fine-tune with monolingual chats in source and target languages

Task Adaptive Pre-Training (TAPT)�Fine-tune with target task chats in source language

14

. Don’t stop pretraining:Adapt language models to domains and tasks, Gururangan et al., ACL 2020

15 of 23

Comparison with Intermediate Fine-tuning models

15

Intermediate Fine-tuning helps, Cross-lingual Intermediate Fine-tuning works even better

16 of 23

Comparison with Intermediate Fine-tuning models

16

Intermediate Fine-tuning helps, Cross-lingual Intermediate Fine-tuning works even better

17 of 23

Competitive Models: Literature

Cross-lingual Neural Belief Tracker (XL-NBT)�Teacher-student network based transfer learning

Attention informed mixed language training (AMLT)�Replace high attention words with target language synonyms and re-train DST

Cross-lingual Code-Switched Augmentation (CLCSA)�dynamic replacement of source language words with target language words during DST training

17

18 of 23

Comparison with competitive models

18

A cross-lingual neural belief tracking framework, Chen et al., EMNLP 2018�Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems. Liu et al., AAAI 2020�Cosda-ml: Multi-lingual code-switching data augmentation for zero-shot cross-lingual NLP. Qin et al., IJCAI 2020

19 of 23

Competitive Models: Machine Translation

Translate train�mBERT encoder + Translate the training set into target language

In-language training �mBERT encoder + 100% target language training set

Translate test�Monolingual BERT + Translate the test set into the source language at evaluation

19

.XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization., Hu et a;l., ACL 2020

20 of 23

Comparison with Machine Translation based methods

20

Multilingual encoder + Intermediate Fine-tuning > Intermediate Machine Translation based models

21 of 23

Data Efficiency of Cross-Lingual Intermediate Fine-tuning

21

1.9

Vanilla mBERT

mBERT + TLM

12.3

Cross-lingual intermediate fine-tuning has better performance even with as low as 1% of target language task data

22 of 23

Ablation Studies

Intermediate data when matched with the domain of the task is better�(movie subtitles > news text)

Creating intermediate training examples more/less than using 200K subtitles does not lead to improved performance for these datasets

Using subtitles with dialogue history is better than using isolated subtitles

22

23 of 23

Takeaways

Cross-lingual intermediate fine-tuning:

  • Has significant improvement over vanilla mBERT for cross-lingual DST �(> 20% on exact match for all directions)�
  • Better than competitive intermediate machine translation methods �(Parallel MultiWoZ dataset) �
  • Provides additive effect with state-of-the-art methods �(Multilingual WoZ dataset)�
  • Only requires ~14 hours of training on a 2080Ti�
  • Can be extended to 1782 language pairs

23