Machine Translation: Introduction
Slides from Dan Jurafsky
Università di Pisa
Human Language Technologies
Dipartimento di Informatica
Università di Pisa
Outline
What is MT?
Translating a text from one language (the source language) to another (the target language) automatically
Criticism of Altavista, Umberto Eco. 2007
Google 2007
Babelfish 2004
English Original | Italian Translation |
The Works of Shakespeare | Gli impianti di Shakespeare |
Hartcourt Brace | sostegno di Hartcourt |
Speaker of the chamber of deputies | Altoparlante dell’alloggiamento dei delegati |
Studies in the logic of Charles Sanders Pierce | Studi nella logica delle sabbiatrici Pierce del Charles |
English Original | Italian Translation |
The Works of Shakespeare | Le opere di Shakespeare |
Hartcourt Brace | Hartcourt Brace |
Speaker of the chamber of deputies | Presidente della Camera dei deputati |
Studies in the logic of Charles Sanders Pierce | Studi nella logica di Charles Sanders Peirce |
Google Translate
http://www.cocinadominicana.com/acompanamientos-ensaladas-pastelones/1907-tostones.html
Tostones are green plantain (or male) slices, fried, flattened, and then fried again.
Los tostones son rodajas de plátanos verdes (o machos), fritas, aplanadas y luego fritas nuevamente.
Google Translate
Machine Translation
The Story of the Stone (“The Dream of the Red Chamber”)
Chinese gloss: Dai-yu alone on bed top think-of-with-gratitude Bao-chai again listen to window outside bamboo tip plantain leaf of on-top rain sound sigh drop clear cold penetrate curtain not feeling again fall down tears come
Hawkes translation: As she lay there alone, Dai-yu’s thoughts turned to Bao-chai. Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without noticing it she had begun to cry.
Machine Translation
Alignment in Machine Translation
Not just literature
Hansards: Canadian parliamentary proceedings
What is MT already good enough for?
What is MT less good for?
MT Early History
1946 Booth and Weaver discuss MT at Rockefeller Foundation in New York
1947-48 idea of dictionary-based direct translation
1949 Weaver memorandum popularized idea
1952 all 18 MT researchers in world meet at MIT
1954 IBM/Georgetown Demo Russian-English MT
1955-65 lots of labs take up MT
IBM?Georhetown Thinking Machine
Warren Weaver memo
Early Research
History of MT: Pessimism
1959/1960: Bar-Hillel “Report on the state of MT in US and GB”
History of MT: Pessimism
The ALPAC report
History of MT: Revival
1976 Meteo, weather forecasts from English to French
Systran (Babelfish) in use for 50 years
1970’s
European focus in MT; mainly ignored in US
1980’s
ideas of using early AI techniques in MT (KBMT, CMU)
Focus on “interlingua” systems, especially in Japan
1990’s
Commercial MT systems
Statistical MT
Speech-to-speech translation
2000’s
Statistical MT takes off
Google Translate
2015
Neural MT takes off
Language Similarities and Divergences
Morphology
Morpheme
Word = Morpheme+Morpheme+Morpheme+…
Stems: root plus derivational morphemes
Lemma: also called base form, root, lexeme
Affixes
Morphological Variation
Isolating languages
Agglutinative languages
Polysynthetic languages
Fusion languages
vs
vs
One word one phrase
uygarlaştıramadıklarımızdanmışsınızcasına
uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
Behaving as if you are among those whom we could not cause to become civilized
Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft
Danube steam shipping electricity main plant construction subordinate company
Donaudampfschifffahrtsgesellschaftskapitän
Donau+dampf+Schiff+Fahrt+s+gesellschafts+kapitän
Danube steam shipping company captain
Index of synthesis
Slide from Holger Diessel
isolating
synthetic
Vietnamese
English
Russian
Oneida
Isolating language
Vietnamese
Khi tôi đến nhà bạn, chúng tôi bắt đầu làm bài.
When I come house friend PL I begin do lesson
When I came to my friend’s house, we began to do lessons.
Cantonese
keui wa chyuhn gwok jeui daaih gaan nguk haih li gaan
he say entire country most big building house is this building
Slide from Holger Diessel
Synthetic language
Kirundi
Y-a-bi-gur-i-ye abâna
CL1-PST-CL8.them-buy-APPL-ASP CL2.children
He bought them for the children.
Slide from Holger Diessel
Polysynthetic language
Noun-incorporation (cf. fox-hunting, bird-watching)
Mohawk
a. r-ukwe’t-í:yo
he-person-nice
He is a nice person
b. wa-hi-‘sereth-óhare-‘se
PST-he/me-car-wash-for
He car-wash for me (= He washed my car)
c. kvtsyu v-kuwa-nya’t-ó:’ase
fish FUT-they/her-throat-slit
They will throat-slit a fish
Slide from Holger Diessel
Index of fusion
agglutinative
fusional
Swahili
Russian
Oneida
Slide from Holger Diessel
Agglutinative language
Words are formed by stringing together morphemes without changing them
Turkish
SG PL
Nominative adam adam-lar
Accusative adam-K adam-lar-K
Genitive adam-Kn adam-lar-Kn
Dative adam-a adam-lar-a
Locative adam-da adam-lar-da
Ablative adam-dan adam-lar-dan
Slide from Holger Diessel
Fusional language
A single inflectional morpheme to denote multiple grammatical feature, e.g. both tense and person
Russian
SG PL SG PL
Nominative stol stol-y lip-a lip-y
Accusative stol stol-y lip-u lip-y
Genitive stol-a stol-ov lip-y lip
Dative stol-u stol-am lip-e lip-am
Instrumental stol-om stol-ami lip-oj lip-ami
Prepositional stol-e stol-ax lip-e lip-ax
Slide from Holger Diessel
Word Order
Segmentation Variation
Inferential Load: cold vs. hot langs
Inferential Load (2)
All noun phrases in
blue do not appear
in the Chinese text …
But they are needed
for a good translation
Lexical Divergences
Lexical Divergences: Specificity
Grammatical constraints
Semantic constraints
Lexical Divergence: many-to-many
Lexical Divergence: lexical gaps
Event-to-argument divergences
Structural divergences
Head Swapping
Thematic divergence
Divergence counts from Bonnie Dorr
Categorial | X tener hambre Y have hunger | 98% |
Conflational | X dar puñaladas a Z X stab Z | 83% |
Structural | X entrar en Y X enter Y | 35% |
Head Swapping | X cruzar Y nadando X swim across Y | 8% |
Thematic | X gustar a Y Y likes X | 6% |
3 “Classical” methods for MT
Three MT Approaches: Direct, Transfer, Interlingual
Direct Translation
Direct MT Dictionary entry
Direct MT
Problems with direct MT
The Transfer Model
English to French
Transfer rules
Japanese
Lexical transfer
English German
home nach Hause (going home)
Heim (home game)
Heimat (homeland, home country)
zu Hause (at home)
Systran: combining direct and transfer
Transfer: some problems
Interlingua
Interlingua
Mary did not slap the green witch
Interlingua
Direct MT: pros and cons (Bonnie Dorr)
Cons
Pros
Interlingual MT: pros and cons (B. Dorr)
Cons
Pros
Moving toward Statistical MT
Warren Weaver (1947)
When I look at an article in Russian, I say to myself: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.
Kevin Knight slide
Rosetta Stone
Carved in 196 BC
Found in 1799
Decoded in 1822
Egyptian hieroglyphs
Egyptian Demotic
Greek
Kevin Knight slide
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
Kevin Knight slide
Centauri/Arcturan Parallel Corpus
Slide from Kevin Knight
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . | 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . |
2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . | 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . |
3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . | 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . |
4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . | 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . |
5a. wiwok farok izok stok . 5b. totat jjat quat cat . | 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . |
6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . | 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . |
Centauri to Arcturan Traslation
Slide from Kevin Knight
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . | 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . |
2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . | 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . |
3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . | 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . |
4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . | 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . |
5a. wiwok farok izok stok . 5b. totat jjat quat cat . | 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . |
6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . | 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . |
Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan Alignment
(
(
(
(
(
(
Translating this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
(
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
(
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
(
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
(
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
(
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
(
Slide from Kevin Knight
Centauri/Arcturan Alignment
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
(
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
(
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
(
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
(
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
(
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
(
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
(
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
(
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
(
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
(
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
(
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
(
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
???
Slide from Kevin Knight
Centauri/Arcturan Alignment
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Slide from Kevin Knight
Centauri/Arcturan Alignment
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Slide from Kevin Knight
Centauri/Arcturan Alignment
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Slide from Kevin Knight
Centauri/Arcturan Alignment
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
???
Slide from Kevin Knight
Centauri/Arcturan Alignment
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Slide from Kevin Knight
Centauri/Arcturan Alignment
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
process of
elimination
Slide from Kevin Knight
Centauri/Arcturan Alignment
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
cognate?
Slide from Kevin Knight
Centauri/Arcturan Alignment
Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
zero
fertility
Slide from Kevin Knight
It’s Really Spanish/English
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa
1a. Garcia and associates .
1b. Garcia y asociados .
7a. the clients and the associates are enemies .
7b. los clientes y los asociados son enemigos .
2a. Carlos Garcia has three associates .
2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .
8b. la empresa tiene tres grupos .
3a. his associates are not strong .
3b. sus asociados no son fuertes .
9a. its groups are in Europe .
9b. sus grupos estan en Europa .
4a. Garcia has a company also .
4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .
10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .
5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .
11b. los grupos no venden zanzanina .
6a. the associates are also angry .
6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .
12b. los grupos pequenos no son modernos .
Slide from Kevin Knight
Summary