CS458 Natural language Processing
Lecture 18
Machine Translation
Krishnendu Ghosh
Department of Computer Science & Engineering
Indian Institute of Information Technology Dharwad
Introduction
Machine Translation: Application
Information Access
We might want to translate some instructions on the web, information access perhaps the recipe for a favorite dish, or the steps for putting together some furniture. Or we might want to read an article in a newspaper, or get information from an online resource like Wikipedia or a government webpage in some other language.
Machine Translation: Application
Reduce Digital Divide
Digital Divide is the fact that much more information is available in English and other languages spoken in wealthy countries. Web searches in English return much more information than searches in other languages, and online resources like Wikipedia are much larger in English and other higher-resourced languages. High-quality translation can help provide information to speakers of lower-resourced languages.
Machine Translation: Application
Aid Human Translators
Another common use of machine translation is to aid human translators. MT systems are routinely used to produce a draft translation that is fixed up in a post-editing phase by a human translator. This task is often called computer-aided translation or CAT. CAT is commonly used as part of localization: the task of adapting content or a product to a particular language community.
Machine Translation: Application
Human Communication Needs
Finally, a more recent application of MT is to in-the-moment human communication needs. This includes incremental translation, translating speech on-the-fly before the entire sentence is complete, as is commonly used in simultaneous interpretation. Image-centric translation can be used for example to use OCR of the text on a phone camera image as input to an MT system to translate menus or street signs.
Machine Translation Algorithm
The standard algorithm for MT is the encoder-decoder network.
Recall that encoder-decoder or sequence-to-sequence models are used for tasks in which we need to map an input sequence to an output sequence that is a complex function of the entire input sequence
Machine Translation Algorithm
Machine Translation: Challenges
The words of the target language don’t necessarily agree with the words of the source language in number or order
In English, the verb write is in the middle of the sentence, while in Japanese, the verb kaita comes at the end.
Japanese sentence doesn’t require pronoun he, while English does.
Machine Translation: Challenges
The ordering differs in major ways: The Chinese order of the noun phrase is “peaceful using outer space conference of suggestions” while the English has “suggestions of the … conference on peaceful use of outer space”
The order differs in minor ways (the date is ordered differently)
Machine Translation: Challenges
English requires the in many places that Chinese doesn’t, and adds some details (like “in which” and “it”) that aren’t necessary in Chinese.
Machine Translation: Challenges
Chinese doesn’t grammatically mark plurality on nouns (unlike English, which has the “-s” in “recommendations”), and so the Chinese must use the modifier 各项/various to make it clear that there is not just one recommendation.
Language Divergences and Typology
Language Divergences and Typology
There are about 7,000 languages in the world.
Some aspects of human language universal seem to be universal, holding true for every one of these languages, or are statistical universals, holding true for most of these languages.
Yet languages also differ in many ways.
Language Divergences and Typology
Understanding what causes such translation divergences (Dorr, 1994) can help us build better MT models.
We often distinguish the idiosyncratic and lexical differences that must be dealt with one by one (the word for “dog” differs wildly from language to language), from systematic differences that we can model in a general way (many languages put the verb before the grammatical object; others put the verb after the grammatical object). The study of these systematic cross-linguistic similarities and differences is called linguistic typology.
Word Order Typology
German, French, English, and Mandarin, for example, are all SVO (Subject-Verb-Object) languages, meaning that the verb tends to come between SOV the subject and object. Hindi and Japanese, by contrast, are SOV languages, meaning that the verb tends to come at the end of basic clauses, and Irish and Arabic are VSO languages. Two languages that share their basic word order type often have other similarities. For example, VO languages generally have prepositions, whereas OV languages generally have postpositions.
Word Order Typology
Word Order Typology
Lexical Divergences
The way that languages differ in lexically dividing up conceptual space may be more complex than this one-to-many translation problem, leading to many-to-many mappings.
Lexical Divergences: Lexical Gap
Further, one language may have a lexical gap, where no word or phrase, short of an explanatory footnote, can express the exact meaning of a word in the other language.
For example, English does not have a word that corresponds neatly to Mandarin xiao` or Japanese oyako¯ko¯ (in English one has to make do with awkward phrases like filial piety or loving child, or good son/daughter for both).
Lexical Divergences: Lexical Gap
Languages differ systematically in how the conceptual properties of an event are mapped onto specific words. Talmy (1985, 1991) noted that languages can be characterized by whether direction of motion and manner of motion are marked on the verb or on the “satellites”: particles, prepositional phrases, or adverbial phrases.
English: The bottle floated out.
Spanish: La botella salió flotando.
The bottle exited floating.
Lexical Divergences: Lexical Gap
Verb-framed languages mark the direction of motion on the verb (leaving the satellites to mark the manner of motion), like Spanish acercarse ‘approach’, alcanzar ‘reach’, entrar ‘enter’, salir ‘exit’.
Satellite-framed languages mark the direction of motion on the satellite (leaving the verb to mark the manner of motion), like English crawl out, float off, jump down, run after.
Morphological Typology
Morphologically, languages are often characterized along two dimensions of variation.
The first is the number of morphemes per word, ranging from isolating languages like Vietnamese and Cantonese, in which each word generally has one morpheme, to polysynthetic languages like Siberian Yupik (“Eskimo”), in which a single word may have very many morphemes, corresponding to a whole sentence in English.
Morphological Typology
The second dimension is the degree to which morphemes are segmentable, ranging from agglutinative languages like Turkish, in which morphemes have relatively clean boundaries, to fusion languages like Russian, in which a single affix may conflate multiple morphemes, like -om in the word stolom (table-SG-INSTR-DECL1), which fuses the distinct morphological categories instrumental, singular, and first declension.
Referential Density
Languages that can omit pronouns are called pro-drop languages. Even among the pro-drop languages, there are marked differences in frequencies of omission. Japanese and Chinese, for example, tend to omit far more than does Spanish.
This dimension of variation across languages is called the dimension of referential density.
Referential density
[El jefe] i dio con un libro. 0 / i Mostró su hallazgo a un descifrador ambulante.
[The boss] came upon a book. [He] showed his find to a wandering decoder.
We say that languages that tend to use more pronouns are more referentially dense than those that use more zeros. Referentially sparse languages, like Chinese or Japanese, that require the hearer to do more inferential work to recover antecedents are also called cold languages. Languages that are more explicit and make it easier for the hearer are called hot languages.
Referential density
Translating from languages with extensive pro-drop, like Chinese or Japanese, to non-pro-drop languages like English can be difficult since the model must somehow identify each zero and recover who or what is being talked about in order to insert the proper pronoun.
Machine Translation using Encoder-Decoder
Encoder-Decoder
The standard architecture for MT is the encoder-decoder transformer or sequence-to-sequence model.
Most machine translation tasks make the simplification that we can translate each sentence independently, so we’ll just consider individual sentences.
Encoder-Decoder
Given a sentence in a source language, the MT task is then to generate a corresponding sentence in a target language.
For example, an MT system is given an English sentence like
The green witch arrived
and must translate it into the Spanish sentence:
Llego la bruja verde
Encoder-Decoder
MT uses supervised machine learning: at training time the system is given a large set of parallel sentences (each sentence in a source language matched with a sentence in the target language), and learns to map source sentences into target sentences.
In practice, rather than using words (as in the example above), we split the sentences into a sequence of subword tokens (tokens can be words, or subwords, or individual characters).
Encoder-Decoder
The systems are then trained to maximize the probability of the sequence of tokens in the target language y1,..., ym given the sequence of tokens in the source language x1,..., xn:
P(y1,..., ym|x1,..., xn)
Encoder-Decoder
Rather than use the input tokens directly, the encoder-decoder architecture consists of two components, an encoder and a decoder. The encoder takes the input words x = [x1,..., xn] and produces an intermediate context h. At decoding time, the system takes h and, word by word, generates the output y:
h = encoder(x)
yt+1 = decoder(h, y1,..., yt) ∀t ∈ [1,...,m]
Tokenization
Subword tokenization algorithms, like the BPE algorithm
BPE takes a pair of tokens (bytes), looks at the frequency of each pair, and merges the pair which has the highest combined frequency. The process is greedy as it looks for the highest combined frequency.
It can have instances where there is more than one way to encode a particular word. It then gets difficult for the algorithm to choose subword tokens as there is no way to prioritize which one to use first. Hence, the same input can be represented by different encodings impacting the accuracy of the learned representations.
Tokenization
Some systems use a variant of BPE called the wordpiece algorithm
A shared vocabulary for the source and target languages, which makes it easy to copy tokens (like names) from source to target.
Tokenization: Wordpiece Algorithm
Instead of choosing the most frequent set of tokens to merge, chooses merges based on which one most increases the language model probability of the tokenization.
Wordpieces use a special symbol at the beginning of each token:
words:
Jet makers feud over seat width with big orders at stake
Wordpieces:
_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake
Tokenization: Wordpiece Algorithm
The wordpiece algorithm is given a training corpus and a desired vocabulary size V, and proceeds as follows:
1. Initialize the wordpiece lexicon with characters (for example a subset of Unicode characters, collapsing all the remaining characters to a special unknown character token).
Tokenization: Wordpiece Algorithm
2. Repeat until there are V wordpieces:
(a) Train an n-gram language model on the training corpus, using the current set of wordpieces.
(b) Consider the set of possible new wordpieces made by concatenating two wordpieces from the current lexicon. Choose the one new wordpiece that most increases the language model probability of the training corpus.
Tokenization: Alternatives
Unigram Algorithm
In unigram tokenization, instead of building up a vocabulary by merging tokens, we start with a huge vocabulary of every individual unicode character plus all frequent sequences of characters (including all space-separated words, for languages with spaces), and iteratively remove some tokens to get to a desired final vocabulary size.
Tokenization: Alternatives
SentencePiece Algorithm
Creating the Training Data
Machine translation models are trained on a parallel corpus, sometimes called a bitext, a text that appears in two (or more) languages.
Europarl corpus: contains between 400,000 and 2 million sentences each from 21 European languages.
Creating the Training Data
Creating the Training Data: Sentence Alignment
Given two documents that are translations of each other, we generally need two steps to produce sentence alignments:
• a cost function that takes a span of source sentences and a span of target sentences and returns a score measuring how likely these spans are to be translations.
• an alignment algorithm that takes these scores to find a good alignment between the documents.
Creating the Training Data: Sentence Alignment
To score the similarity of sentences across languages, we need to make use of a multilingual embedding space, in which sentences from different languages are in the same embedding space (Artetxe and Schwenk, 2019).
here nSents() gives the number of sentences (this biases the metric toward many alignments of single sentences instead of aligning very large spans).
Details of the Encoder-Decoder Model
The encoder-decoder architecture is made up of two transformers:
The encoder takes the source language input word tokens X = x1,..., xn and maps to output representation Henc = h1,..., hn; via a stack of encoder blocks.
The decoder is essentially a conditional language model that attends to the encoder representation and generates the target words one by one, at each timestep conditioning on the source sentence and the previously generated target language words to generate a token.
Details of the Encoder-Decoder Model
Details of the Encoder-Decoder Model
Decoding in MT: Beam Search
Greedy decoding algorithm: at each time step t in generation, the output yt is chosen by computing the probability for each word in the vocabulary and then choosing the highest probability word (the argmax):
The beam search algorithm maintains multiple choices until later when we can see which one is best.
Decoding in MT: Beam Search
In beam search we model decoding as searching the space of possible generations, represented as a search tree whose branches represent actions (generating a token), and nodes represent states (having generated a particular prefix). We search for the best action sequence, i.e., the string with the highest probability.
Decoding in MT: Beam Search
Decoding in MT: Beam Search
Instead, MT systems generally decode using beam search, a heuristic search method first proposed by Lowerre (1976).
In beam search, instead of choosing the best token to generate at each timestep, we keep k possible tokens at each step.
This fixed-size memory footprint k is called the beam width, on the metaphor of a flashlight beam that can be parameterized to be wider or narrower.
Decoding in MT: Beam Search
Decoding in MT: Beam Search
Minimum Bayes Risk Decoding
Minimum Bayes risk or MBR decoding is an alternative decoding algorithm that can work even better than beam search and also tends to be better than the other decoding algorithms like temperature sampling.
Translating in low-resource situations
Data Augmentation
Here we briefly introduce two commonly used approaches for dealing with this data sparsity: backtranslation, which is a special case of the general statistical technique called data augmentation, and multilingual models, and also discuss some socio-technical issues.
Data augmentation is a statistical technique for dealing with insufficient training data, by adding new synthetic data that is generated from the current natural data.
Data Augmentation: Backtranslation
The most common data augmentation technique for machine translation is called backtranslation.
Backtranslation relies on the intuition that while parallel corpora may be limited for particular languages or domains, we can often find a large (or at least larger) monolingual corpus, to add to the smaller parallel corpora that are available. The algorithm makes use of monolingual corpora in the target language by creating synthetic bitexts.
Data Augmentation: Backtranslation
In backtranslation, our goal is to improve source-to-target MT, given a small parallel text (a bitext) in the source/target languages, and some monolingual data in the target language. We first use the bitext to train a MT system in the reverse direction: a target-to-source MT system . We then use it to translate the monolingual target data to the source language. Now we can add this synthetic bitext (natural target sentences, aligned with MT-produced source sentences) to our training data, and retrain our source-to-target MT model.
Data Augmentation: Backtranslation
Backtranslation has various parameters.
One is how we generate the backtranslated data; we can run the decoder in greedy inference, or use beam search. Or we can do sampling, like the temperature sampling algorithm we saw in Chapter 9.
Another parameter is the ratio of backtranslated data to natural bitext data; we can choose to upsample the bitext data (include multiple copies of each sentence).
Multilingual models
In a multilingual translator, we train the system by giving it parallel sentences in many different pairs of languages.
One advantage of a multilingual model is that they can improve the translation of lower-resourced languages by drawing on information from a similar language in the training data that happens to have more resources.
Sociotechnical issues
One problem is that for low-resource languages, especially from low-income countries, native speakers are often not involved as the curators for content selection, as the language technologists, or as the evaluators who measure performance.
MT Evaluation
MT Evaluation
Translations are evaluated along two dimensions:
1. adequacy: how well the translation captures the exact meaning of the source sentence. Sometimes called faithfulness or fidelity.
2. fluency: how fluent the translation is in the target language (is it grammatical, clear, readable, natural).
Using humans to evaluate is most accurate, but automatic metrics are also used for convenience.
Using Human Raters to Evaluate MT
The most accurate evaluations use human raters, such as online crowdworkers, to evaluate each translation along the two dimensions.
For example, along the dimension of fluency, we can ask how intelligible, how clear, how readable, or how natural the MT output (the target text) is.
We can do the same thing to judge the second dimension, adequacy, using raters to assign scores on a scale.
Automatic Evaluation
chrP percentage of character 1-grams, 2-grams, ..., k-grams in the hypothesis that occur in the reference, averaged.
chrR percentage of character 1-grams, 2-grams,..., k-grams in the reference that occur in the hypothesis, averaged.
Automatic Evaluation
There are various alternative overlap metrics. For example, before the development of chrF, it was common to use a word-based overlap metric called BLEU (for BiLingual Evaluation Understudy), that is purely precision-based rather than combining precision and recall.
Automatic Evaluation: Embedding-Based Methods
The chrF metric is based on measuring the exact character n-grams a human reference and candidate machine translation have in common. However, this criterion is overly strict, since a good translation may use alternate words or paraphrases. A solution first pioneered in early metrics like METEOR (Banerjee and Lavie, 2005) was to allow synonyms to match between the reference x and candidate x̃.
Automatic Evaluation: Embedding-Based Methods
Bias and Ethical Issues
Bias and Ethical Issues
When translating a reference to a person described without specified gender, MT systems often default to male gender (Schiebinger 2014, Prates et al. 2019).
One open problem is developing metrics for knowing what our systems don’t know. This is because MT systems can be used in urgent situations where human translators may be unavailable.
Thank You