The MADAR-Turk corpus adds Turkish sentences to the MADAR Corpus (Bouamor et al., 2018), which provided the first set of parallel sentences to include the dialects of 25 Arab cities in addition to English, French, and MSA. The MADAR Corpus was built on
the Basic Traveling Expression Corpus (BTEC) (Takezawa et al., 2007)
and comprised about 20,000 English tourism-related sentences.
BTEC is conversational in nature, has short sentences, and has
translations in several languages, making it an attractive resource
for building and testing machine translation models.
To create MADAR-Turk, two native Arabic speakers from Syria who are
highly fluent in Turkish translated all 2,000 sentences from the
Damascus dialect entries because our initial objective was to work
on Syrian Arabic to Turkish machine translation.
The sentences came from the following sub-splits in the MADAR Corpus:
200 corpus-6-test-corpus-26-devĀ
200 corpus-6-test-corpus-26-test
1600 corpus-6-test-corpus-26-train