Evaluation Scores of Google Translate in 107 Languages
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

View only
 
 
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Source data for Teach You Backwards: An In-Depth Study of Google Translate for 108 Languageshttp://teachyoubackwards.com
2
The tables in this file report the scores of translations of the same 20 English phrases to all 107 languages in Google Translate, as rated by native or highly-competent speakers.
3
The data in this file is free to use according to the terms of the CC-by license at https://creativecommons.org/licenses/by/4.0/legalcode, with credit to Martin Benjamin, Kamusi Project International, 2018
4
Evaluators rated each translation on a scale of A, B, or C. (A = good translation, B = not correct, but someone could understand the idea, C = completely wrong)
original English phrase
English explanation shown to evaluatorsBardTarzanFail
5
The original English phrases given to Google Translate, and the English explanations given to the evaluators, are shown in the light purple table on the right. Click here for itemized translations and scores in all languages1fly out of Londontake an airplane from London35.20%79.00%21.00%
6
Methodology is discussed in the black boxes on the right.Click here for language-to-language confidence scores (Tarzan-to-Tarzan intelligibility ratings between non-English pairs)2like a bat out of hellescaping as quickly as possible2.40%23.80%76.20%
7
3out coldunconscious3.80%9.50%90.50%
8
Bard is a weighted ranking that indicates the proportion of translations that were judged close to human quality. Responses scored A = 5, B = 2.5, C = 0. Maximum score = 100.4out of boundsunacceptable6.70%41.00%59.00%
9
Tarzan indicates the percentage of times that translations could be understood, regardless of whether they were judged as human quality. Responses scored A = 5, B = 5, C= 0. Maximum score = 100.5out of breathgasping for air (for example, after running)34.80%64.80%35.20%
10
Fail is the percentage of times that translations were judged as completely wrong. Responses scored A = 0, B = 0, C = 5. Maximum score = 100.6out of curiousitybecause a person is casually interested in something33.30%65.70%34.30%
11
7out of focusnot clear to see (blurry)21.90%49.50%50.50%
12
Languages in blue are those that this page says run on the Neural Machine Translation Model: https://cloud.google.com/translate/docs/languagesWhite = Phrase-Based Machine Translation Model.8out of his mindcrazy4.80%28.60%71.40%
13
Languages in orange are those that this page says run on the Neural Machine Translation Model: https://translate.google.com/intl/en/about/languages/9out of milkthe supply of milk is finished4.30%19.00%81.00%
14
Languages in bold are among world's top 100 (https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers)10out of orderdoes not function (broken)31.90%50.50%49.50%
15
Green-headed table shows GT results for top 100 languages by number of native speakers
Dark-blue-headed table shows GT results for languages not in top 10011out of pocketpaid for something from personal money15.20%50.50%49.50%
16
Spanish was evaluated separately for SPain and Latin America. Portuguese was evaluated separately for PorTugal, Cape Verde, and BRazil.Traditional and Simplified Chinese were evaluated as one.12out of steamno more energy (exhausted)3.30%10.50%89.50%
17
13out of styleunfashionable21.90%64.80%35.20%
18
AlphabeticalBardTarzanFailBard RankingTarzan Ranking% Fail Ranking14out of the closetopenly homosexual9.50%24.80%75.20%
19
1Afrikaans67.587.512.51Afrikaans67.51Afrikaans87.51Bengali10015out of the gameno longer participating in a game32.40%68.60%31.40%
20
2Albanian26.2540602German602German82.51Haitian Creole10016out of the officeaway from the office46.70%82.90%17.10%
21
3Amharic3040602
Portuguese-PT
603
Portuguese-BR
801Tajik10017out of this worldexcellent3.80%17.10%82.90%
22
4Arabic32.540603Spanish-SP57.53
Portuguese-CV
802Kurdish9518out of timea deadline has passed14.80%63.80%36.20%
23
5Armenian2540604Polish56.254Spanish-LA752Nepali9519out of wedlockbetween partners who are not married30.00%61.00%39.00%
24
6Azerbaijani37.555455Chinese554Spanish-SP753Latin9020out on the town
having a fun time going shopping or to bars/ restaurants (carousing)
4.30%21.90%78.10%
25
7Basque37.547.552.55Croatian555Polish72.53Malaysian90
26
8Belarusian4055455Spanish-LA556Danish703Urdu90Methodology:
27
9Bengali001006Dutch52.56Greek704Maori85
28
10Bosnian3040606Galician52.57Chinese655Cebuano80We translated 20 common English phrases to 102 languages, and then sent a re-worded explanation of the phrase to independent evaluators. We did not send the original English phrase, because early testing showed that knowing the actual words in question influenced the way that evaluators understood the query. For example, if we showed "out of steam" along with its explanation, some evaluators would see that a word for "steam" was given and therefore judge the translation highly, whereas if they only saw the explanation "no more energy (exhausted)", they were able to judge whether the proposed translation captured the usual English meaning.
29
11Bulgarian4060406Greek52.57Croatian655Georgian80
30
12Catalan37.560406
Portuguese-BR
52.57Dutch655Kinyarwanda80
31
13Cebuano12.520806
Portuguese-CV
52.57Finnish655Persian80
32
14Chichewa17.530707Italian507Hungarian655Punjabi80
33
15Chinese5565357Latvian507
Portuguese-PT
655Uzbek80
34
16Corsican22.535658Indonesian47.58Welsh62.56Xhosa77.5
35
17Croatian5565358Hungarian47.59Indonesian607Hawaiian75
36
18Czech43.7555458Igbo47.59Bulgarian607Javanese75We tested clusters of two or more words that often occur together, and that have a meaning that is generally discernable when they do (as can be readily seen in Twitter searches); for example, tweets with “out cold” almost always imply unconsciousness. We did not test single words, which can be highly ambiguous (for example, right = correct, legal entitlement, politically conservative, not left, etc.), and therefore too arbitrary in isolation. Although GT does not advertise itself as a dictionary, single-word lookups are a major proportion of real world uses of the service. Nor did we test full sentences, which add a lot of complexity to scoring, since translations might include both correct and incorrect elements; moreover, humans can translate the same sentence many ways, making it impossible to impose a gold standard for full sentences, especially across dozens of languages. It should be noted that GT alters its vocabulary choice on the fly, so the words chosen to translate a phrase may not be those selected in a longer sentence; for example, “run of the mill” is represented in French by “course du moulin” in isolation, and “course de l’usine” when translating a longer tweet (both completely wrong), and other instances might produce other results.
37
19Danish4070308Serbian47.59Catalan607Lao75
38
20Dutch52.565359Finnish459French607
Myanmar/ Burmese
75
39
21Esperanto4055459French459Galician607Pashto75
40
22Estonian27.545559Hebrew459Igbo607Samoan75
41
23Filipino2535659Swedish459Italian607Swahili75
42
24Finnish45653510Czech43.759Kazakh607Thai75
43
25French45604010Welsh43.759Latvian607Yoruba75
44
26Frisian30406011Japanese42.59Macedonian608Chichewa70
45
27Galician52.5604011Kazakh42.59Maltese608Hindi70
46
28Georgian10208011Korean42.59Odia (Oriya)608Icelandic70
47
29German6082.517.511Malaysian42.59Sesotho608Lithuanian70All of the clusters contained the word “out”. This word is ubiquitous in English texts. It is extremely ambiguous in isolation, but often occurs in clusters with unmistakable meanings. WordReference.com gives definitions and French equivalents for nearly 1700 composed expressions that include “out”, from “a fish out of water” to “zoom out” (http://www.wordreference.com/enfr/out?start=1600). We chose 20 formulations that are lexicalized in WordReference as composed forms such as “out of style”, or that, as queried on Twitter, usually reduce to defined meanings when matched with other particular words, such as “out of milk”. The premise is that all of these items have been translated in an electronic dictionary, so are thus similarly viable as units for machine translation.
48
30Greek52.5703011Romanian42.59Slovenian608Marathi70
49
31Gujarati30455512Belarusian409Swedish608Shona70
50
32
Haitian Creole
0010012Bulgarian4010Azerbaijani558Somali70
51
33Hausa27.5356512Danish4010Belarusian558Zulu70
52
34Hawaiian15257512Esperanto4010Czech559Kyrgyz67.5
53
35Hebrew45455512Mongolian4010Esperanto5510Corsican65
54
36Hindi22.5307012Russian4010Japanese5510Filipino65
55
37Hmong27.5406012Scots Gaelic4010Korean5510Hausa65
56
38Hungarian47.5653512Slovenian4010Romanian5510Luxembourgish65The expressions were not chosen to be especially simple or difficult, nor based on corpus frequency. Rather, they were chosen because they had clear meanings, and are broadly representative of the types of phrase that ordinary users are likely to seek to translate. The selection (see above) is therefore not rigidly scientific, and we leave it to the reader to decide whether the items provide a fair test of MT capability. The least-recognized phrase, “out cold”, resulted in just 3 “A” ratings (Hausa, Hindi, and Malaysian) and was only understandable to some extent 10 times, while the phrase “out of the office” produced an understandable result 83% of the time.
57
39Icelandic20307012Turkish4010Serbian5510Tamil65
58
40Igbo47.5604013Azerbaijani37.511Malayalam5010Tatar65
59
41Indonesian47.5604013Basque37.511Mongolian5010Telugu65
60
42Irish25455513Catalan37.511Russian5010Vietnamese65
61
43Italian50604014Macedonian3511Scots Gaelic5011Albanian60
62
44Japanese42.5554514Maltese3511Turkish5011Amharic60
63
45Javanese17.5257514Odia (Oriya)3511Turkmen5011Arabic60Scores are not absolute, for two reasons. First, the choice of expressions was arbitrary. A different selection of English expressions would generate different numerical results within each language; for example, scores would probably fall were a larger number of idioms to be included. However, relative results would likely remain the same; a language with high scores for our test set should perform highly with other data, while a language with low scores herein would have similarly low results with other input. Second, most language scores show the subjective opinion of a single reviewer. One could well argue that more reviewers per language would produce more reliable data. We had several languages with multiple reviewers, and found that inter-annotator disagreements were usually minor, with a handful of entries per language being judged good versus marginal, or marginal versus wrong. A single entry being ranked by different evaluators as good versus marginal does not change the Tarzan score and changes the Bard score by 2.5 points, and a disagreement between marginal and wrong changes both Tarzan and Bard by 2.5. We averaged the scores where annotators disagreed, and have kept disagreements visible in the full data release, in the "Itemized Translations and Scores" tab at the bottom of this page. Based on the inter-annotator disagreement values we discovered, the reader is advised to place mental error bars of ±10 around the score that is reported.
64
46Kannada20406014Sinhala3511Yiddish5011Armenian60
65
47Kazakh42.5604015Arabic32.512Basque47.511Bosnian60
66
48Khmer30455515Sesotho32.513Estonian4511Frisian60
67
49Kinyrwanda12.5208016Turkmen31.2513Gujarati4511Hmong60
68
50Korean42.5554516Uyghur31.2513Hebrew4511Kannada60
69
51Kurdish3.7559517Amharic3013Irish4511Malagasy60
70
52Kyrgyz22.532.567.517Bosnian3013Khmer4511Sindhi60
71
53Lao21.25356517Frisian3013Norwegian4511Sundanese60
72
54Latin5109017Gujarati3013Sinhala4511Ukrainian60
73
55Latvian50604017Khmer3013Slovak4512Estonian55
74
56Lithuanian20307017Sindhi3013Uyghur4512Gujarati55
75
57
Luxembourgish
27.5356517Ukrainian3014Albanian4012Hebrew55
76
58Macedonian35604018Estonian27.514Amharic4012Irish55
77
59Malagasy27.5406018Hausa27.514Arabic4012Khmer55
78
60Malayalam42.5505018Hmong27.514Armenian4012Norwegian55
79
61Malaysian5109018
Luxembourgish
27.514Bosnian4012Sinhala55
80
62Maltese35604018Malagasy27.514Frisian4012Slovak55
81
63Maori10158518Norwegian27.514Hmong4012Uyghur50
82
64Marathi15307018Slovak27.514Kannada4013Basque52.5
83
65Mongolian40505018Tatar27.514Malagasy4014Malayalam50
84
66
Myanmar/ Burmese
12.5257518Yiddish27.514Sindhi4014Mongolian50
85
67Nepali2.559518Albanian26.2514Sundanese4014Russian50
86
68Norwegian27.5455520Armenian2514Ukrainian4014Scots Gaelic50
87
69Odia (Oriya)35604020Filipino2515Corsican3514Turkish50
88
70Pashto20257520Irish2515Filipino3514Turkmen50
89
71Persian10208021Corsican22.515Hausa3514Yiddish50
90
72Polish56.2572.527.521Hindi22.515Lao3515Azerbaijani45
91
73
Portuguese-BR
60802021Kyrgyz22.515
Luxembourgish
3515Belarusian45
92
74
Portuguese-CV
52.5802021Sundanese22.515Tamil3515Czech45
93
75
Portuguese-PT
52.5653521Telugu22.515Tatar3515Esperanto45
94
76Punjabi10208021Lao21.2515Telugu3515Japanese45
95
77Romanian42.5554523Icelandic2015Vietnamese3515Korean45
96
78Russian40505023Kannada2016Kyrgyz32.515Romanian45
97
79Samoan15257523Lithuanian2017Chichewa3015Serbian45
98
80Scots Gaelic40505023Pashto2017Hindi3016Bulgarian40
99
81Serbian47.5554523Tamil2017Icelandic3016Catalan40
100
82Sesotho32.5604023Vietnamese2017Lithuanian3016French40
Loading...