1 of 68

Back to the noisy channel

Taro Watanabe

taro at is.naist.jp

NAIST NLP

YANS 2021: Back to the noisy channel

2 of 68

Early MT research

2

When I look at an article in Russian, I say; “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

... I frankly am afraid the boundaries of words in different languages are too vague .... to make any quasimechanical translation scheme very hopeful.

YANS 2021: Back to the noisy channel

3 of 68

History of MT

3

2016

Google NMT

2013

Neural MT

2000

Statistical MT

Phrase-based MT

Syntax-based MT

1990

Example-based MT

IBM Model

1960

ALPAC report

Rule-based MT

Systran

1950

Code breaking

YANS 2021: Back to the noisy channel

4 of 68

MT as transfer

4

昨日麒麟を散歩した。

I walked a giraffe yesterday.

昨日麒麟を散歩した。

I walked a giraffe yesterday.

散歩

arg0: ?

arg1: 麒麟

temp: 昨日

walk

arg0: I

arg1: giraffe

temp: yesterday

Event

walk(?, giraffe)

date(yesterday)

YANS 2021: Back to the noisy channel

5 of 68

MT as transfer: The Vauquois triangle

5

昨日麒麟を散歩した。

I walked a giraffe yesterday.

昨日麒麟を散歩した。

I walked a giraffe yesterday.

散歩

arg0: ?

arg1: 麒麟

temp: 昨日

walk

arg0: I

arg1: giraffe

temp: yesterday

Event

walk(?, giraffe)

date(yesterday)

Interlingua

Semantic

Syntax

Words

YANS 2021: Back to the noisy channel

6 of 68

Example-based MT

6

Lookup similar examples and edits

YANS 2021: Back to the noisy channel

7 of 68

Bilingual Data

7

上海浦东开发与法制建设同步

新华社上海二月十日电(记者谢金虎、张持坚)

上海浦东近年来颁布实行了涉及经济、贸易、建设、规划、科技、文教等领域的七十一件法规性文件,确保了浦东开发的有序进行。

浦东开发开放是一项振兴上海,建设现代化经济、贸易、金融中心的跨世纪工程,因此大量出现的是以前不曾遇到过的新情况、新问题。

对此,浦东不是简单的采取“干一段时间,等积累了经验以后再制定法规条例”的做法,而是借鉴发达国家和深圳等特区的经验教训,聘请国内外有关专家学者,积极、及时地制定和推出法规性文件,使这些经济活动一出现就被纳入法制轨道。

去年初浦东新区诞生的中国第一家医疗机构药品采购服务中心,正因为一开始就比较规范,运转至今,成交药品一亿多元,没有发现一例回扣。

The development of Shanghai's Pudong is in step with the establishment of its legal system

Xinhua News Agency, Shanghai, February 10, by wire (reporters Jinhu Xie and Chijian Zhang)

In recent years Shanghai's Pudong has promulgated and implemented 71 regulatory documents relating to areas such as economics, trade, construction, planning, science and technology, culture and education, etc., ensuring the orderly advancement of Pudong's development.

Pudong's development and opening up is a century-spanning undertaking for vigorously promoting Shanghai and constructing a modern economic, trade, and financial center. Because of this, new situations and new questions that have not been encountered before are emerging in great numbers.

In response to this, Pudong is not simply adopting an approach of "work for a short time and then draw up laws and regulations only after waiting until experience has been accumulated." Instead, Pudong is taking advantage of the lessons from experience of developed countries and special regions such as Shenzhen by hiring appropriate domestic and foreign specialists and scholars, by actively and promptly formulating and issuing regulatory documents, and by ensuring that these economic activities are incorporated into the sphere of influence of the legal system as soon as they appear.

Precisely because as soon as it opened it was relatively standardized, China's first drug purchase service center for medical treatment institutions, which came into being at the beginning of last year in the Pudong new region, in operating up to now, has concluded transactions for drugs of over 100 million yuan and hasn't had one case of kickback.

YANS 2021: Back to the noisy channel

8 of 68

Guess Translation

8

上海浦东开发与法制建设同步

新华社上海二月十日电(记者谢金虎、张持坚)

上海浦东近年来颁布实行了涉及经济、贸易、建设、规划、科技、文教等领域的七十一件法规性文件,确保了浦东开发的有序进行。

浦东开发开放是一项振兴上海,建设现代化经济、贸易、金融中心的跨世纪工程,因此大量出现的是以前不曾遇到过的新情况、新问题。

对此,浦东不是简单的采取“干一段时间,等积累了经验以后再制定法规条例”的做法,而是借鉴发达国家和深圳等特区的经验教训,聘请国内外有关专家学者,积极、及时地制定和推出法规性文件,使这些经济活动一出现就被纳入法制轨道。

去年初浦东新区诞生的中国第一家医疗机构药品采购服务中心,正因为一开始就比较规范,运转至今,成交药品一亿多元,没有发现一例回扣。

The development of Shanghai's Pudong is in step with the establishment of its legal system

Xinhua News Agency, Shanghai, February 10, by wire (reporters Jinhu Xie and Chijian Zhang)

In recent years Shanghai's Pudong has promulgated and implemented 71 regulatory documents relating to areas such as economics, trade, construction, planning, science and technology, culture and education, etc., ensuring the orderly advancement of Pudong's development.

Pudong's development and opening up is a century-spanning undertaking for vigorously promoting Shanghai and constructing a modern economic, trade, and financial center. Because of this, new situations and new questions that have not been encountered before are emerging in great numbers.

In response to this, Pudong is not simply adopting an approach of "work for a short time and then draw up laws and regulations only after waiting until experience has been accumulated." Instead, Pudong is taking advantage of the lessons from experience of developed countries and special regions such as Shenzhen by hiring appropriate domestic and foreign specialists and scholars, by actively and promptly formulating and issuing regulatory documents, and by ensuring that these economic activities are incorporated into the sphere of influence of the legal system as soon as they appear.

Precisely because as soon as it opened it was relatively standardized, China's first drug purchase service center for medical treatment institutions, which came into being at the beginning of last year in the Pudong new region, in operating up to now, has concluded transactions for drugs of over 100 million yuan and hasn't had one case of kickback.

YANS 2021: Back to the noisy channel

9 of 68

Two modeling approaches to MT

  • Direct modeling (i.e., classification)
    • Good learning algorithm + good model
    • Huge example inputs (in source language) and outputs (in target language).
  • Code breaking (i.e., noisy channel, or generative model)
    • Assume that the target language is known.
    • Many examples of translated texts in source language.
    • Split into submodels + complex decoding

9

YANS 2021: Back to the noisy channel

10 of 68

Code Breaking

  • Assume that outputs, y, was distorted by noise (noisy channel).
  • Separates the problem into two: translation model and language model.

10

Y

Noisy channel

Decoder

Y’

X

Source

YANS 2021: Back to the noisy channel

11 of 68

Statistical MT

11

昨日麒麟を散歩した。

I walked a giraffe yesterday.

P(昨日 | yesterday)

P(散歩 | walked)

P(麒麟 | giraffe)

P(麒麟 | dog)

...

P(昨日 | yesterday I)

P(麒麟 | a giraffe)

P(散歩した | I walked)

P(散歩した | I walk)

...

Translation model

P(I walked a giraffe...)

P(he walked a dog...)

Language model

MT as code breaking

YANS 2021: Back to the noisy channel

12 of 68

Direct Modeling

  • Directly represent the translation process by a single model.
  • Learns the model parameters on bilingual data, i.e., pairs of source/target texts.

12

X

transfer

Y

YANS 2021: Back to the noisy channel

13 of 68

Neural MT

13

昨日麒麟を散歩した。

I walked a giraffe yesterday.

Direct modeling by Neural Networks

YANS 2021: Back to the noisy channel

14 of 68

Deeper model with residual connection

14

Stack layers for better representation.

Residual connection to avoid vanishing gradient

RNNs are slow

Similar networks by Zhou et al., (2016) and Wu et al., (2016)

YANS 2021: Back to the noisy channel

15 of 68

Parallel computation

  • Efficient training by convolution (CNN) which allows faster computation.
  • However, it requires many layers, e.g., 40, to capture long distance relations.

15

Encoder by multiple layers of CNN + gating

Decoder by multiple layers of CNN + gating

YANS 2021: Back to the noisy channel

16 of 68

Long distance relation

  • Efficient training by parallel computation with long distance relation represented by self-attention.
  • Inefficient decoding since it needs to memorize long history.

16

Attention is position neutral

Transformer is attention heavy

YANS 2021: Back to the noisy channel

17 of 68

Quality Improvements on WMT14 EN-FR by Neural MT (NMT)

17

YANS 2021: Back to the noisy channel

18 of 68

Research Trends

  • Data: Noise / Low resource / Multilingual data
  • Capacity: Deeper modeling / Smallish model
  • OOV: Rare words / Named entities
  • Features: Context / Domain / Style / Syntax / Semantics
  • Inference: Streaming / Non-autoregressive / Quantization
  • Mode: Supervised / Unsupervised
  • Others: Explainability / Controllability / Bias / Hallucination
  • Evaluation: Count-based / Model-based / Learning-based

18

YANS 2021: Back to the noisy channel

19 of 68

Back to the noisy channel

19

YANS 2021: Back to the noisy channel

20 of 68

History of MT

20

2016

Google NMT

2013

Neural MT

2000

Statistical MT

Phrase-based MT

Syntax-based MT

1990

Example-based MT

IBM Model

1960

ALPAC report

Rule-based MT

Systran

1950

Code breaking

YANS 2021: Back to the noisy channel

21 of 68

Why noisy channel?

Direct model is biased by highly predictive outputs, a.k.a, explained away effects (Klein and Manning, 2001).

  • Extremely large data is necessary to avoid the bias.
  • The distribution for x is unknown during decoding, and thus, the behavior of the model prediction is unpredictable.

21

YANS 2021: Back to the noisy channel

22 of 68

Why noisy channel?

The model is trying to select likely outputs a priori and to explain the distribution of x in decoding.

  • The channel model is robust to noise in x, since the prior is a strong belief on what should be predicted by the model.
  • Usually small(ish) data is sufficient to train P(x | y).

22

YANS 2021: Back to the noisy channel

23 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

23

YANS 2021: Back to the noisy channel

24 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

24

YANS 2021: Back to the noisy channel

25 of 68

Simple and Effective Noisy Channel Model

Employ a standard NMT, e.g., Transformer (Vaswani et al., 2017), for P(x | y).

  • Reranking is easy. However, decoding is complex and slow.
  • Model combination of two directions, p(x | y) and p(y | x).
  • Follow-up work for empirically faster decoding (Bhosale et al., 2020).

25

YANS 2021: Back to the noisy channel

26 of 68

Decoding: Direct Model

For each time step, preserves k-best candidates in beam search.

  • For each prefix, computes p(Y | y1 … yt, x) and multiplied by p(y1 … yt | x).

26

<s>

a

the

a

the

giraffe

dog

cat

giraffe

I

He

p(I | x)

p(He | x)

walked

walk

walks

walked

p(walked | I, x) × p(I | x)

p(walked | He, x) × p(He | x)

YANS 2021: Back to the noisy channel

27 of 68

Decoding: Noisy Channel Model

For each prefix, computes p(x | y) and p(y).

  • p(y) is easy when using an RNN.
  • p(x | y) has to be recomputed for all possible words in Y.
  • Non-monotonicity problem.

27

<s>

I

He

walked

walk

walks

walked

a

the

a

the

giraffe

dog

cat

giraffe

p(x | I) × p(I)

p(x | He) × p(He)

p(x | I walked) × p(I walked)

p(x | He walked) × p(He walked)

YANS 2021: Back to the noisy channel

28 of 68

Decoding: Approximation

Filter vocabulary space by p(y | x).

  • Computes p(x | y) for k × k candidates in a batch manner.
  • However, non-monotonicity problems still exists.

28

<s>

I

He

walked

walk

walks

walked

a

the

a

the

giraffe

dog

cat

giraffe

p(x | I) × p(I)

p(x | I walked) × p(I walked)

p(x | He) × p(He)

p(x | He walked) × p(He walked)

Filter by

p(Y | y1...yt, x)

YANS 2021: Back to the noisy channel

29 of 68

Model Combination

Combine p(y | x) and p(x | y) with length normalization.

  • Additional parameter tuning for λ on validation/development data.

29

YANS 2021: Back to the noisy channel

30 of 68

Experimental Results

30

WMT17 De-En BLEU

Best results by NCM

Reranking results

Larger gains by reranking

YANS 2021: Back to the noisy channel

31 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

31

YANS 2021: Back to the noisy channel

32 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

32

YANS 2021: Back to the noisy channel

33 of 68

Neural Noisy Channel Model

Introduce a latent variable z to indicate monotonic alignment between x and y.

  • Based on a segment-based direct model (Yu et al., 2016).
  • Factorize the channel model into two sub-components, alignment and word probabilities.
  • Inference is efficiently carried out by Dynamic Programming.
  • Decoding is approximated by the direct model.
  • Model combination.

33

YANS 2021: Back to the noisy channel

34 of 68

Factorization

Split into two sub-models, alignment model and word model using z.

  • p(zi | zi-1, ...): predict the next “segment” of y.
  • p(xi | ...): predict the next word of x.

34

Alignment model

Word model

Latent alignment variable

YANS 2021: Back to the noisy channel

35 of 68

Alignment: z

For each position in x, specify how to monotonically segment y, i.e., the end position of each span.

  • The idea is very similar to IBM Models (Brown et al., 1993).

35

y

x

z1 = 2

z2 = 2

z3 = 3

z4 = 4

z5 = 4

z6 = 5

z7 = 5

z8 = 5

z9 = 5

z10 = 7

z11 = 7

z12 = 8

YANS 2021: Back to the noisy channel

36 of 68

Action sequence: a

Transition is modeled by an action sequence a = {SHIFT, EMIT}|x| × |y|.

  • SHIFT: Read from y and Increment an index for zi.
  • EMIT: Stay and emit xi.

36

36

y

x

SHIFT, EMIT

EMIT

SHIFT, EMIT

SHIFT, EMIT

EMIT

SHIFT, EMIT

EMIT

YANS 2021: Back to the noisy channel

37 of 68

Alignment Model

p(zi | zi-1, ...) conditionally depends on a.

  • Alignment probability computation depends only on the prefix of x and y.

37

YANS 2021: Back to the noisy channel

38 of 68

Instantiated as NN

The model is instantiated as two LSTMs, one for x and the other for y.

  • Probabilities for sub-components are simply MLP over the concatenated hidden states.

38

x1

x2

x3

y1

y2

y3

YANS 2021: Back to the noisy channel

39 of 68

Inference

Forward-backward algorithm (Rabiner, 1989) for efficient inference.

  • Backward is not necessary for exponential-family models. You can simply backprop (Eisner 2016).

39

An external variable to memorize intermediate computation.

Recurrence is happening here.

Sum over all z

YANS 2021: Back to the noisy channel

40 of 68

Decoding

Approximate decoding by searching for the maximum of y and z.

  • Use a direct model p(y | x) to filter the search space.
  • Less prone to search error, since the model can compute prefix score (?)

40

YANS 2021: Back to the noisy channel

41 of 68

Model Combination

Model combination of two directions and length bias.

  • No length normalization (?)

41

YANS 2021: Back to the noisy channel

42 of 68

Experimental Results

  • Complex decoding similar to Yee et al., (2019) but much faster, i.e., O(|x| + |y|) in contrast with O(|x||y|).

42

LDC zh-en

YANS 2021: Back to the noisy channel

43 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

43

YANS 2021: Back to the noisy channel

44 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

44

YANS 2021: Back to the noisy channel

45 of 68

Document-level MT

Extend MT task to document-level context.

  • No special context-aware modeling, e.g., Voita et al. (2018).
    • A standard MT, e.g., Transformer (Vaswani et al., 2017), for P(x | y)
    • A standard LM, e.g., Transformer-XL (Dai et al., 2019), for P(y).
  • Decoding is sentence-wise reranking with model combination.

45

YANS 2021: Back to the noisy channel

46 of 68

Factorization

Translating the whole document x into y as a noisy channel model.

  • Sentence translation model: p(xi | yi)
  • Document language model: p(yi | y<i)

46

YANS 2021: Back to the noisy channel

47 of 68

Graphical Model

Very strong conditional independence assumption of x and y.

  • However, x influences the decision of y in decoding.
  • The same trick in naïve Bayes classifiers.

47

YANS 2021: Back to the noisy channel

48 of 68

Decoding

Compute k-best translations for all sentences in x using q(y | x), then rescore with beam search using p(x | y) p(y).

  • q(y | x) is either sentence or document-wise model.

48

x1

x2

x3

y1, 1

y1, 2

y1, 3

y2, 1

y2, 2

y2, 3

y3, 1

y3, 2

y3, 3

Proposal by p(y | x)

Search by p(x | y) p(y)

YANS 2021: Back to the noisy channel

49 of 68

Model Combination

49

YANS 2021: Back to the noisy channel

50 of 68

Experimental Results

50

LDC zh-en

YANS 2021: Back to the noisy channel

51 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

51

YANS 2021: Back to the noisy channel

52 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

52

YANS 2021: Back to the noisy channel

53 of 68

Text Classification

Given a label, generate an input text (Ding and Gimpel, 2019).

  • Latent variable to capture topics.
  • Generative model, i.e., channel model, is based on Yogatama et al. (2017).

53

YANS 2021: Back to the noisy channel

54 of 68

Experimental Results

54

YANS 2021: Back to the noisy channel

55 of 68

Question Answering

Find an answer (a) to a question (q) under context (c), e.g., document or image (Lewis and Fan, 2019).

  • p(a | c): prior for all possible answers.
  • p(q | a, c): conditional language model for questions.

55

YANS 2021: Back to the noisy channel

56 of 68

Experimental Results

56

Robust results on adversarial SQuAD (Jia and Liang, 2017)

Good results on multi-paragraph inputs, though not trained with contexts.

YANS 2021: Back to the noisy channel

57 of 68

Noisy channel modeling is explainable

57

Interpret which question words are explained by the answer.

YANS 2021: Back to the noisy channel

58 of 68

Dialogue

Given context (C), predict state (B), dialogue act (A) and response (R) (Liu et al., 2021).

  • p(C, B): prior for context and state.
  • p(A, R | C, B): Conditional language model.

58

YANS 2021: Back to the noisy channel

59 of 68

Experimental Results: MultiWOZ

59

YANS 2021: Back to the noisy channel

60 of 68

Grammatical Error Correction

A straightforward application of noisy channel modeling (Flacks et al., 2019).

  • A simple dictionary is used for p(x | c).
  • BERT (Devlin et al., 2019) or GPT-2 (Radford et al., 2019) as a prior.

60

  • Wikipedia edit history
  • Spell checker
  • Number agreement
  • Verb forms

YANS 2021: Back to the noisy channel

61 of 68

Experimental Results

61

BEA 2019 Shared Task

YANS 2021: Back to the noisy channel

62 of 68

Zero/few-shot text classification

Given a verbalized classification label, predict an input text (Min et al., 2021).

  • Noisy channel variant of Brown et al. (2020).
  • Employ a uniform distribution for label prior.

62

  • Zero-shot:
  • Concat:
  • Ensemble:

YANS 2021: Back to the noisy channel

63 of 68

Experimental Results

63

YANS 2021: Back to the noisy channel

64 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

64

YANS 2021: Back to the noisy channel

65 of 68

Noisy channel with NN

Machine Translation

  • Simple and Effective Noisy Channel Model (Yee et al., 2019)
  • Neural Noisy Channel Model (Yu et al., 2017)
  • Document-level MT (Yu et al., 2020)

Other tasks

65

Your task/application!

YANS 2021: Back to the noisy channel

66 of 68

Back to the noisy channel

66

“The crude force of computers is not science” COLING review of Brown et al. (1988)

deep learning

YANS 2021: Back to the noisy channel

67 of 68

Recipes for the noisy channel

  • Think inverse direction.
    • A SOTA direct model in inverse direction is usually sufficient w/o deeper structure.
  • Design sub-components for efficient training and decoding.
    • It is not trivial and many open questions remain.
  • Complex decoding: use a direct model for filtering search space.
    • Surprisingly, many tricks have been investigated in the past.
  • Need extra model combination and tuning.
    • Usually stabler than data augmentation/fine tuning in direct model.

67

YANS 2021: Back to the noisy channel

68 of 68

Questions?

68

YANS 2021: Back to the noisy channel