1 of 71

CS458 Natural language Processing

Lecture 18

Machine Translation

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 71

Introduction

3 of 71

Machine Translation: Application

Information Access

We might want to translate some instructions on the web, information access perhaps the recipe for a favorite dish, or the steps for putting together some furniture. Or we might want to read an article in a newspaper, or get information from an online resource like Wikipedia or a government webpage in some other language.

4 of 71

Machine Translation: Application

Reduce Digital Divide

Digital Divide is the fact that much more information is available in English and other languages spoken in wealthy countries. Web searches in English return much more information than searches in other languages, and online resources like Wikipedia are much larger in English and other higher-resourced languages. High-quality translation can help provide information to speakers of lower-resourced languages.

5 of 71

Machine Translation: Application

Aid Human Translators

Another common use of machine translation is to aid human translators. MT systems are routinely used to produce a draft translation that is fixed up in a post-editing phase by a human translator. This task is often called computer-aided translation or CAT. CAT is commonly used as part of localization: the task of adapting content or a product to a particular language community.

6 of 71

Machine Translation: Application

Human Communication Needs

Finally, a more recent application of MT is to in-the-moment human communication needs. This includes incremental translation, translating speech on-the-fly before the entire sentence is complete, as is commonly used in simultaneous interpretation. Image-centric translation can be used for example to use OCR of the text on a phone camera image as input to an MT system to translate menus or street signs.

7 of 71

Machine Translation Algorithm

The standard algorithm for MT is the encoder-decoder network.

Recall that encoder-decoder or sequence-to-sequence models are used for tasks in which we need to map an input sequence to an output sequence that is a complex function of the entire input sequence

8 of 71

Machine Translation Algorithm

9 of 71

Machine Translation: Challenges

The words of the target language don’t necessarily agree with the words of the source language in number or order

In English, the verb write is in the middle of the sentence, while in Japanese, the verb kaita comes at the end.

Japanese sentence doesn’t require pronoun he, while English does.

10 of 71

Machine Translation: Challenges

The ordering differs in major ways: The Chinese order of the noun phrase is “peaceful using outer space conference of suggestions” while the English has “suggestions of the … conference on peaceful use of outer space”

The order differs in minor ways (the date is ordered differently)

11 of 71

Machine Translation: Challenges

English requires the in many places that Chinese doesn’t, and adds some details (like “in which” and “it”) that aren’t necessary in Chinese.

12 of 71

Machine Translation: Challenges

Chinese doesn’t grammatically mark plurality on nouns (unlike English, which has the “-s” in “recommendations”), and so the Chinese must use the modifier 各项/various to make it clear that there is not just one recommendation.

13 of 71

Language Divergences and Typology

14 of 71

Language Divergences and Typology

There are about 7,000 languages in the world.

Some aspects of human language universal seem to be universal, holding true for every one of these languages, or are statistical universals, holding true for most of these languages.

Yet languages also differ in many ways.

15 of 71

Language Divergences and Typology

Understanding what causes such translation divergences (Dorr, 1994) can help us build better MT models.

We often distinguish the idiosyncratic and lexical differences that must be dealt with one by one (the word for “dog” differs wildly from language to language), from systematic differences that we can model in a general way (many languages put the verb before the grammatical object; others put the verb after the grammatical object). The study of these systematic cross-linguistic similarities and differences is called linguistic typology.

16 of 71

Word Order Typology

German, French, English, and Mandarin, for example, are all SVO (Subject-Verb-Object) languages, meaning that the verb tends to come between SOV the subject and object. Hindi and Japanese, by contrast, are SOV languages, meaning that the verb tends to come at the end of basic clauses, and Irish and Arabic are VSO languages. Two languages that share their basic word order type often have other similarities. For example, VO languages generally have prepositions, whereas OV languages generally have postpositions.

17 of 71

Word Order Typology

18 of 71

Word Order Typology

19 of 71

Lexical Divergences

The way that languages differ in lexically dividing up conceptual space may be more complex than this one-to-many translation problem, leading to many-to-many mappings.

20 of 71

Lexical Divergences: Lexical Gap

Further, one language may have a lexical gap, where no word or phrase, short of an explanatory footnote, can express the exact meaning of a word in the other language.

For example, English does not have a word that corresponds neatly to Mandarin xiao` or Japanese oyako¯ko¯ (in English one has to make do with awkward phrases like filial piety or loving child, or good son/daughter for both).

21 of 71

Lexical Divergences: Lexical Gap

Languages differ systematically in how the conceptual properties of an event are mapped onto specific words. Talmy (1985, 1991) noted that languages can be characterized by whether direction of motion and manner of motion are marked on the verb or on the “satellites”: particles, prepositional phrases, or adverbial phrases.

English: The bottle floated out.

Spanish: La botella salió flotando.

The bottle exited floating.

22 of 71

Lexical Divergences: Lexical Gap

Verb-framed languages mark the direction of motion on the verb (leaving the satellites to mark the manner of motion), like Spanish acercarse ‘approach’, alcanzar ‘reach’, entrar ‘enter’, salir ‘exit’.

Satellite-framed languages mark the direction of motion on the satellite (leaving the verb to mark the manner of motion), like English crawl out, float off, jump down, run after.

23 of 71

Morphological Typology

Morphologically, languages are often characterized along two dimensions of variation.

The first is the number of morphemes per word, ranging from isolating languages like Vietnamese and Cantonese, in which each word generally has one morpheme, to polysynthetic languages like Siberian Yupik (“Eskimo”), in which a single word may have very many morphemes, corresponding to a whole sentence in English.

24 of 71

Morphological Typology

The second dimension is the degree to which morphemes are segmentable, ranging from agglutinative languages like Turkish, in which morphemes have relatively clean boundaries, to fusion languages like Russian, in which a single affix may conflate multiple morphemes, like -om in the word stolom (table-SG-INSTR-DECL1), which fuses the distinct morphological categories instrumental, singular, and first declension.

25 of 71

Referential Density

Languages that can omit pronouns are called pro-drop languages. Even among the pro-drop languages, there are marked differences in frequencies of omission. Japanese and Chinese, for example, tend to omit far more than does Spanish.

This dimension of variation across languages is called the dimension of referential density.

26 of 71

Referential density

[El jefe] i dio con un libro. 0 / i Mostró su hallazgo a un descifrador ambulante.

[The boss] came upon a book. [He] showed his find to a wandering decoder.

We say that languages that tend to use more pronouns are more referentially dense than those that use more zeros. Referentially sparse languages, like Chinese or Japanese, that require the hearer to do more inferential work to recover antecedents are also called cold languages. Languages that are more explicit and make it easier for the hearer are called hot languages.

27 of 71

Referential density

Translating from languages with extensive pro-drop, like Chinese or Japanese, to non-pro-drop languages like English can be difficult since the model must somehow identify each zero and recover who or what is being talked about in order to insert the proper pronoun.

28 of 71

Machine Translation using Encoder-Decoder

29 of 71

Encoder-Decoder

The standard architecture for MT is the encoder-decoder transformer or sequence-to-sequence model.

Most machine translation tasks make the simplification that we can translate each sentence independently, so we’ll just consider individual sentences.

30 of 71

Encoder-Decoder

Given a sentence in a source language, the MT task is then to generate a corresponding sentence in a target language.

For example, an MT system is given an English sentence like

The green witch arrived

and must translate it into the Spanish sentence:

Llego la bruja verde

31 of 71

Encoder-Decoder

MT uses supervised machine learning: at training time the system is given a large set of parallel sentences (each sentence in a source language matched with a sentence in the target language), and learns to map source sentences into target sentences.

In practice, rather than using words (as in the example above), we split the sentences into a sequence of subword tokens (tokens can be words, or subwords, or individual characters).

32 of 71

Encoder-Decoder

The systems are then trained to maximize the probability of the sequence of tokens in the target language y1,..., ym given the sequence of tokens in the source language x1,..., xn:

P(y1,..., ym|x1,..., xn)

33 of 71

Encoder-Decoder

Rather than use the input tokens directly, the encoder-decoder architecture consists of two components, an encoder and a decoder. The encoder takes the input words x = [x1,..., xn] and produces an intermediate context h. At decoding time, the system takes h and, word by word, generates the output y:

h = encoder(x)

yt+1 = decoder(h, y1,..., yt) ∀t ∈ [1,...,m]

34 of 71

Tokenization

Subword tokenization algorithms, like the BPE algorithm

BPE takes a pair of tokens (bytes), looks at the frequency of each pair, and merges the pair which has the highest combined frequency. The process is greedy as it looks for the highest combined frequency.

It can have instances where there is more than one way to encode a particular word. It then gets difficult for the algorithm to choose subword tokens as there is no way to prioritize which one to use first. Hence, the same input can be represented by different encodings impacting the accuracy of the learned representations.

35 of 71

Tokenization

Some systems use a variant of BPE called the wordpiece algorithm

A shared vocabulary for the source and target languages, which makes it easy to copy tokens (like names) from source to target.

36 of 71

Tokenization: Wordpiece Algorithm

Instead of choosing the most frequent set of tokens to merge, chooses merges based on which one most increases the language model probability of the tokenization.

Wordpieces use a special symbol at the beginning of each token:

words:

Jet makers feud over seat width with big orders at stake

Wordpieces:

_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

37 of 71

Tokenization: Wordpiece Algorithm

The wordpiece algorithm is given a training corpus and a desired vocabulary size V, and proceeds as follows:

1. Initialize the wordpiece lexicon with characters (for example a subset of Unicode characters, collapsing all the remaining characters to a special unknown character token).

38 of 71

Tokenization: Wordpiece Algorithm

2. Repeat until there are V wordpieces:

(a) Train an n-gram language model on the training corpus, using the current set of wordpieces.

(b) Consider the set of possible new wordpieces made by concatenating two wordpieces from the current lexicon. Choose the one new wordpiece that most increases the language model probability of the training corpus.

39 of 71

Tokenization: Alternatives

Unigram Algorithm

In unigram tokenization, instead of building up a vocabulary by merging tokens, we start with a huge vocabulary of every individual unicode character plus all frequent sequences of characters (including all space-separated words, for languages with spaces), and iteratively remove some tokens to get to a desired final vocabulary size.

40 of 71

Tokenization: Alternatives

SentencePiece Algorithm

41 of 71

Creating the Training Data

Machine translation models are trained on a parallel corpus, sometimes called a bitext, a text that appears in two (or more) languages.

Europarl corpus: contains between 400,000 and 2 million sentences each from 21 European languages.

42 of 71

Creating the Training Data

43 of 71

Creating the Training Data: Sentence Alignment

Given two documents that are translations of each other, we generally need two steps to produce sentence alignments:

• a cost function that takes a span of source sentences and a span of target sentences and returns a score measuring how likely these spans are to be translations.

• an alignment algorithm that takes these scores to find a good alignment between the documents.

44 of 71

Creating the Training Data: Sentence Alignment

To score the similarity of sentences across languages, we need to make use of a multilingual embedding space, in which sentences from different languages are in the same embedding space (Artetxe and Schwenk, 2019).

here nSents() gives the number of sentences (this biases the metric toward many alignments of single sentences instead of aligning very large spans).

45 of 71

Details of the Encoder-Decoder Model

The encoder-decoder architecture is made up of two transformers:

an encoder, which is the same as the basic transformers
a decoder, which is augmented with a cross-attention layer

The encoder takes the source language input word tokens X = x1,..., xn and maps to output representation H^enc = h1,..., hn; via a stack of encoder blocks.

The decoder is essentially a conditional language model that attends to the encoder representation and generates the target words one by one, at each timestep conditioning on the source sentence and the previously generated target language words to generate a token.

46 of 71

Details of the Encoder-Decoder Model

47 of 71

Details of the Encoder-Decoder Model

48 of 71

Decoding in MT: Beam Search

Greedy decoding algorithm: at each time step t in generation, the output yt is chosen by computing the probability for each word in the vocabulary and then choosing the highest probability word (the argmax):

The beam search algorithm maintains multiple choices until later when we can see which one is best.

49 of 71

Decoding in MT: Beam Search

In beam search we model decoding as searching the space of possible generations, represented as a search tree whose branches represent actions (generating a token), and nodes represent states (having generated a particular prefix). We search for the best action sequence, i.e., the string with the highest probability.

50 of 71

Decoding in MT: Beam Search

51 of 71

Decoding in MT: Beam Search

Instead, MT systems generally decode using beam search, a heuristic search method first proposed by Lowerre (1976).

In beam search, instead of choosing the best token to generate at each timestep, we keep k possible tokens at each step.

This fixed-size memory footprint k is called the beam width, on the metaphor of a flashlight beam that can be parameterized to be wider or narrower.

52 of 71

Decoding in MT: Beam Search

53 of 71

Decoding in MT: Beam Search

54 of 71

Minimum Bayes Risk Decoding

Minimum Bayes risk or MBR decoding is an alternative decoding algorithm that can work even better than beam search and also tends to be better than the other decoding algorithms like temperature sampling.

55 of 71

Translating in low-resource situations

56 of 71

Data Augmentation

Here we briefly introduce two commonly used approaches for dealing with this data sparsity: backtranslation, which is a special case of the general statistical technique called data augmentation, and multilingual models, and also discuss some socio-technical issues.

Data augmentation is a statistical technique for dealing with insufficient training data, by adding new synthetic data that is generated from the current natural data.

57 of 71

Data Augmentation: Backtranslation

The most common data augmentation technique for machine translation is called backtranslation.

Backtranslation relies on the intuition that while parallel corpora may be limited for particular languages or domains, we can often find a large (or at least larger) monolingual corpus, to add to the smaller parallel corpora that are available. The algorithm makes use of monolingual corpora in the target language by creating synthetic bitexts.

58 of 71

Data Augmentation: Backtranslation

In backtranslation, our goal is to improve source-to-target MT, given a small parallel text (a bitext) in the source/target languages, and some monolingual data in the target language. We first use the bitext to train a MT system in the reverse direction: a target-to-source MT system . We then use it to translate the monolingual target data to the source language. Now we can add this synthetic bitext (natural target sentences, aligned with MT-produced source sentences) to our training data, and retrain our source-to-target MT model.

59 of 71

Data Augmentation: Backtranslation

Backtranslation has various parameters.

One is how we generate the backtranslated data; we can run the decoder in greedy inference, or use beam search. Or we can do sampling, like the temperature sampling algorithm we saw in Chapter 9.

Another parameter is the ratio of backtranslated data to natural bitext data; we can choose to upsample the bitext data (include multiple copies of each sentence).

60 of 71

Multilingual models

In a multilingual translator, we train the system by giving it parallel sentences in many different pairs of languages.

One advantage of a multilingual model is that they can improve the translation of lower-resourced languages by drawing on information from a similar language in the training data that happens to have more resources.

61 of 71

Sociotechnical issues

One problem is that for low-resource languages, especially from low-income countries, native speakers are often not involved as the curators for content selection, as the language technologists, or as the evaluators who measure performance.

62 of 71

MT Evaluation

63 of 71

MT Evaluation

Translations are evaluated along two dimensions:

1. adequacy: how well the translation captures the exact meaning of the source sentence. Sometimes called faithfulness or fidelity.

2. fluency: how fluent the translation is in the target language (is it grammatical, clear, readable, natural).

Using humans to evaluate is most accurate, but automatic metrics are also used for convenience.

64 of 71

Using Human Raters to Evaluate MT

The most accurate evaluations use human raters, such as online crowdworkers, to evaluate each translation along the two dimensions.

For example, along the dimension of fluency, we can ask how intelligible, how clear, how readable, or how natural the MT output (the target text) is.

We can do the same thing to judge the second dimension, adequacy, using raters to assign scores on a scale.

65 of 71

Automatic Evaluation

chrP percentage of character 1-grams, 2-grams, ..., k-grams in the hypothesis that occur in the reference, averaged.

chrR percentage of character 1-grams, 2-grams,..., k-grams in the reference that occur in the hypothesis, averaged.

66 of 71

Automatic Evaluation

There are various alternative overlap metrics. For example, before the development of chrF, it was common to use a word-based overlap metric called BLEU (for BiLingual Evaluation Understudy), that is purely precision-based rather than combining precision and recall.

67 of 71

Automatic Evaluation: Embedding-Based Methods

The chrF metric is based on measuring the exact character n-grams a human reference and candidate machine translation have in common. However, this criterion is overly strict, since a good translation may use alternate words or paraphrases. A solution first pioneered in early metrics like METEOR (Banerjee and Lavie, 2005) was to allow synonyms to match between the reference x and candidate x̃.

68 of 71

Automatic Evaluation: Embedding-Based Methods

69 of 71

Bias and Ethical Issues

70 of 71

Bias and Ethical Issues

When translating a reference to a person described without specified gender, MT systems often default to male gender (Schiebinger 2014, Prates et al. 2019).

One open problem is developing metrics for knowing what our systems don’t know. This is because MT systems can be used in urgent situations where human translators may be unavailable.

1 of 71

2 of 71

3 of 71

4 of 71

5 of 71

6 of 71

7 of 71

8 of 71

9 of 71

10 of 71

11 of 71

12 of 71

13 of 71

14 of 71

15 of 71

16 of 71

17 of 71

18 of 71

19 of 71

20 of 71

21 of 71

22 of 71

23 of 71

24 of 71

25 of 71

26 of 71

27 of 71

28 of 71

29 of 71

30 of 71

31 of 71

32 of 71

33 of 71

34 of 71

35 of 71

36 of 71

37 of 71

38 of 71

39 of 71

40 of 71

41 of 71

42 of 71

43 of 71

44 of 71

45 of 71

46 of 71

47 of 71

48 of 71

49 of 71

50 of 71

51 of 71

52 of 71

53 of 71

54 of 71

55 of 71

56 of 71

57 of 71

58 of 71

59 of 71

60 of 71

61 of 71

62 of 71

63 of 71

64 of 71

65 of 71

66 of 71

67 of 71

68 of 71

69 of 71

70 of 71

71 of 71