1 of 21

ParsiNLU: A Suite of Language Understanding Challenges for Persian

Daniel Khashabi, Arman Cohan, Siamak Shakeri, Pedram Hosseini, Pouya Pezeshkpour, Malihe Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze Brahman, Sarik Ghazarian, Mozhdeh Gheini, Arman Kabiri, Rabeeh Karimi Mahabadi, Omid Memarrast, Ahmadreza Mosallanezhad, Erfan Noury, Shahab Raji, Mohammad Sadegh Rasooli, Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi Samghabadi, Mahsa Shafaei, Saber Sheybani, Ali Tazarv, Yadollah Yaghoobzadeh

1

2 of 21

Motivation

Benchmarks, e.g., GLUE, have played a major role in pushing forward the performance on NLU tasks.

2

However, such benchmarks (the advances they bring about) are disproportionately focused around a small subset of world’s high-resource languages, most notably English.

This work introduces ParsiNLU, the first Persian NLU benchmark.

To alleviate resource scarcity of Persian and push forward the NLP progress for it.

3 of 21

ParsiNLU: Overview

Reading Comprehension
Multiple-Choice QA
Sentiment Analysis
Textual Entailment
Question Paraphrasing
Machine Translation

3

6 tasks

Manual Annotations

External Resources

Experiments w/ SOTA language models

4 of 21

Task 1: Reading Comprehension (1)

Setup: This task is defined as extracting a substring from a given context paragraph that answers a given question.

4

سوال: نهاوند جزو کدام استان است؟

Question: Nahavand is part of which province?�

پاراگراف: نَهاوند شهری در غرب ایران است. این شهر در جنوب غربی استان همدان قرار گرفته است. نهاوند دارای حمعیت …

Paragraph: Nahavand (Navan) is a city in western Iran. This city is located in the southern part of Hamedan province and it is the capital of Nahavand. Nahavand has a population of …

پاسخ: همدان، استان همدان

Answer: Hamedan; Hamedan province

5 of 21

Task 1: Reading Comprehension (2)

Overview of the data collection pipeline

5

Short questions

open-ended questions with no concrete answers

Questions

Select minimal and coherent spans

that contains the answer

correct grammatical

errors and typos

Question, Answer, Paragraph

Google’s Auto-complete

- a seed set of question terms: Who, Where, …

- repeatedly querying parts

- popular questions of Persian-speaking

users of Google

In order to create a reading comprehension dataset for Persian language, we need to collect questions first, and then annotate paragraphs and their answers.

For collecting questions, we don’t translate questions from SQuAD English dataset since many questions there are not common among Persian speakers.

As an example Persian speakers usually do not follow US sports events (e.g., Superbowl, NFL).
Instead we try to mine questions that are more likely to be of interest for Persian speakers.
To collect such naturalistic set of questions, we leverage Google autocompletion API that reflect popular queries of their users. We start by querying popular search terms and bootstrap based off of the resulting questions to collect larger and richer set of Persian questions posed by users of Google.
We also filter out open-ended questions that have no concrete answers such as “What is the results of the game with Japan?” since complete questions include well established sources (such as Wikipedia).

After collecting the questions, we retrieve the top 10 relevant articles returned by Google as candidate paragraphs.
6 native speakers read the most relevant documents retrieved from Google and annotate whether the paragraphs includes the answer

During the annotations, we annotated all the valid spans as the answers.

Overall, we have collected 1.3k question-answer-paragraph triplets.

6 of 21

Task 2: Multiple-Choice QA (1)

Setup: given a natural language question, pick the correct answer among a list of multiple candidates

Common format for evaluation of fact-retrieval [Richardson et al., 2013; Clark et al., 2020]

6

بزرگترین قاره‌ی جهان کدام است؟‌

✔ ۱) آسیا ۲)‌ اروپا ۳) آمریکا ۴) آفریقا

What is the largest continent in the world?

✔ 1) Asia 2) Europe 3) Americas 4) Africa

نجاری روزی یک صندلی و شاگردش در سه روز یک صندلی می‌سازد. اگر نجار و شاگردش با هم کارکنند، ۱۲ صندلی رو در چند روز می سازند؟

۱) ۱۲ ✔ ۲)‌ ۹ ۳) ۸ ۴) ۶

A carpenter makes a chair a day and his student makes a chair in three days If a carpenter and his student work together, how many days will they make 12 chairs?

1) 12 ✔ 2) 9 3) 8 4) 6

7 of 21

Task 2: Multiple-Choice QA (2)

Construction:

(1) Collecting existing exams in Persian:

Annual college entrance exams
Employment exams �

(2) Run them through OCR �

(3) Manually fix OCR mistakes �

7

Sources of questions. We use existing sources of multiple-choice questions, rather than annotating new ones.

We collect the questions from a variety of sources:

annual college entrance exams in Iran, for the past 15 years.
Employment exams that are expected to assess an individual’s depth in various topics (accounting, teaching, mathematics, logic, etc).
Common knowledge questions, which involve questions about topics such as basic science, history, or geography

Most of the above sources are scanned copies of the original exams in image format.

We use an existing Persian OCR tool to convert the image data to a textual format.

Then 4 annotators fix any mistakes made by the OCR system and convert the result into a structured format.
Overall, this yields 2460 questions with an average of 4.0 candidate answers (Table 2).

To further examine the quality of the annotations, we randomly sampled 100 questions from the annotations and crosschecked the OCR output with the original data.

We discovered that 94 of such questions exactly matched the original data, and the rest required minor modifications.
We thus conclude that the annotated data has a high quality.

8 of 21

Task 3: Sentiment Analysis (1)

We explored two relatively less investigated domains in Persian Sentiment Analysis:

Movie reviews
Online grocery shopping

Our sentiment labels are on a 5-point likert scale, [ -2, -1, 0, +1, +2], at two levels:

Document (review)-level
Aspect-level

8

Example: “It tastes good but it’s so expensive even with a special offer. It’s almost double the price of fresh meat.”

Labels:

Overall sentiment: negative (-1)
Taste: positive (+1)
Price: very negative (-2)

For our sentiment analysis task, we chose two relatively less investigated domains in Persian sentiment analysis including movie reviews and online grocery shopping.
We defined our sentiment labels on a 5-point likert scale at two levels:

1) document-level where we assign a sentiment label to the entire review, and
2) aspect-level where we first tag the aspects in a review and then we assign sentiment labels to those aspects.

Here is an example of our sentiment analysis task from online grocery shopping domain.

We have a review in which a user comments on two aspects of a meat product:

1) taste with a positive sentiment and
2) price with a very negative sentiment.

And since the weight of the negative comment here is stronger, we assign an overall negative sentiment to the review.

Now let me explain the annotation process.

9 of 21

Task 3: Sentiment Analysis (2)

9

Defining sentiment aspects

Training annotators

Final annotation

Food & beverages aspects	Movie review aspects
Purchase value/price Packaging Delivery Product quality Nutritional value taste/smell	Music Sound Directing story/screenplay acting/performance Cinematography scene

Review sources

Annotation process

Total annotated reviews: 2,423

Food & beverages: 1,917
Movie: 506

Annotation tasks and Cohen’s Kappa agreement:

Task 1: Assigning sentiment to a review (0.76, substantial)
Task 2: Tagging aspects in a review (0.49, moderate)
Task 3: Assigning sentiment to aspects (0.47, moderate)

In our annotation process, we started by collecting the reviews from the two popular Iranian/Persian websites on movie reviews and online shopping.
Then we defined a set of aspects for each domain (You can see all these aspects in the table on the right side of the slide.)
Then we ran one round of annotation to tutor our annotators and finalize our annotation guidelines.
And next, we ran the final round of manual annotation to create our benchmark.
We annotated 2,423 reviews from two domains in total. You can see the distributions of sentiment labels on the right side of the slide.
We also computed the Cohen’s Kappa agreement on a randomly selected subset of reviews.

Our agreement results show a substantial agreement for document-level annotation and a moderate-level agreement for aspect-level related tasks.

10 of 21

Task 4: Textual Entailment (1)

Setup: This task is defined as determining the 3-way relationship between two sentences:

10

Translating MNLI instances using Google Translate

Writing a premise and hypothesis based on existing sentences

1- Based on natural sentences

2- Based on existing datasets

Entailment

Neutral

Contradiction

Premise: Poor people in more than a couple of counties in Atlanta

receive help from the Atlanta Legal Aid.

پیش فرض: مردم فقیر در بیش از چند شهرستان در آتلانتا از کمک حقوقی آتلانتا کمک میگیرند.

Hypothesis: Atlanta Legal Aid provides civil services to poor people

in five metro Atlanta counties.

.فرضیه: کمک حقوقی آتلانتا به مردم فقیر در پنج منطقه شهری آتلانتا خدمات مدنی ارائه می دهد

11 of 21

Task 4: Textual Entailment (2)

Overview of the data collection pipeline:

11

1- Sampling Sentences with Conjunctive Adverbs from Persian Wikipedia

2- MNLI Dataset

Google Translate (En-Fa)

S1: فرانسه بارها اعلام داشته که بحران اقتصادی بین المللی است ، پس چاره کار هم باید جهانی باشد.

France has repeatedly stated that crisis is an international economy, so the solution must be global.

S2: صدام حسين می گويد ترجيح می دهد بميرد ولی به تبعيد نرود.

Saddam Hussein says he would rather die but not go into exile

S3: پنج تن از آنها به ژاپن بازگشتند همچنین ممکن است بقیه ی ربوده شدگان هنوز زنده باشند.

Five of them came back to Japan. Also, the rest of Abducted might be still alive.

P: Corona virus spreads mainly between people who are in close contact with each other.

ویروس کرونا عمدتا بین افرادی که در تماس نزدیک با یکدیگر هستند ، گسترش می یابد.

H: People can be infected when droplets containing the virus are inhaled.

افراد هنگام استنشاق قطرات حاوی ویروس می توانند آلوده شوند.

Textual Entailment

Dataset

Human Annotation

Fixing Translations

12 of 21

Task 5: Question Paraphrasing (1)

Setup: This task is defined as determining whether two given questions are paraphrases or not:

12

(1) Based on natural sentences

(2) Based on existing datasets

Mining questions using Google auto-complete

Creating pairs of questions with high token overlap

Getting question pairs from QQP dataset

Translating them using Google Translation

سوال ۱: کدام شهرهای ایران در وضعیت سفید کرونا هستند؟

Q1: Which cities in Iran are in white zones for corona?

سوال ۲: کدام شهرهای ایران در وضعیت قرمز کرونا هستند؟

Q2: What cities are red zones of corona?

13 of 21

Task 5: Question Paraphrasing (2)

Overview of the data collection pipeline:

13

Incomplete Seed Sentences

QQP Dataset

Google’s Auto-complete

Google Translate (En-Fa)

Question Paraphrasing

Dataset

Human Annotation

Q1. آخرین روزهای زندگی خود را در کجا میخواهید سپری کنید؟

(Where do you want to spend your last days of life?)

Q2. آیا جهان موازی وجود دارد؟

(Are there any parallel universes?)

Q3. استارتاپ ها چیستند؟

(What are startups?)

14 of 21

Task 6: Translation (1)

Setup: We consider the task of translating a given English sentence into Persian, and vice versa. ��
Several existing resources [Kashefi, 2018; Prokopidis et al., 2016; Pilevar et al., 2011]
Our work: compile an evaluation set for comprehensive evaluation of MT.

14

آن کسانی که به جان غیب ایمان آرند و نماز به پا دارند و از هرچه روزیشان کردیم به فقیران انفاق کنند.

Who believe in the Unseen, are steadfast in prayer, and spend out of what We have provided for them.

15 of 21

Task 6: Translation (2)

The proposed combination:

Quran [Tiedemann and Nygaard, 2004]
Bible
Query Paraphrasing
Mizan [Kashefi, 2018] �

Also an accompdanying set for training models.

15

Our proposed evaluation sets consist of the followings:

(i) evaluation sets based on Quran and Bible. These holy books has been translated into many languages, including English and Persian. In fact, there are several translations of these books available, which makes them particularly helpful for the automatic evaluation of machine translation since such metrics work best when provided with several gold standards/

(iii) QQP: using the data obtained in the construction of question paraphrasing task to create an evaluation set for translating language questions.

(iv) Mizan: we use the evaluation subset of the Mizan corpus (Kashefi, 2018), which is acquired based on a manual alignment of famous literary works and their published Persian translations.

Overall, the combination of these four high-quality subsets yields an evaluation set that contains 47k sentences, from 4 different domains.

We also put together an accompanying training data that contains learning instances from these sets, as well as several other datasets. Feel free to check out the details in the paper.

16 of 21

Experiments: Setup

Human performance:

Evaluate annotations from three human annotatorsagainst the gold labels

Model performance:

Trained a variety of models

ParsBERT [Farahani et al. 2020]
wikiBERT [Xue et al. 2021]
mBERT [Devlin et al. 2019]
mT5 [Xue et al. 2021]

16

Persian LM

Multilingual LM

17 of 21

Experimental Findings (1)

Finding 1: The proposed dataset(s) is decent quality.

17

86.2

Human:

Performance (F1)

LMs trained on

ParsiNLU Reading Comprehension

39.2

WikiBERT

(base)

40.7

49.0

mT5

(large)

49.2

70.4

mT5

(XL)

ParsBERT

(base)

mBERT

(base)

18 of 21

Experimental Findings (2)

Finding 1: The proposed dataset(s) is decent quality.
Finding 2: English models successfully transfer to Persian.

18

39.2

WikiBERT

(base)

40.7

49.0

mT5

(large)

49.2

70.4

mT5

(XL)

86.2

Human:

LMs trained on

ParsiNLU Reading Comprehension

Performance (F1)

LMs trained on

SQuAD

67.4

68.2

ParsBERT

(base)

mBERT

(base)

mT5

(large)

mT5

(XL)

19 of 21

Experimental Findings (3)

Finding 1: The proposed dataset(s) is decent quality.
Finding 2: English models successfully transfer to Persian. Joint training on English and Persian helps.

19

39.2

WikiBERT

(base)

40.7

49.0

mT5

(large)

49.2

70.4

mT5

(XL)

86.2

Human:

LMs trained on

ParsiNLU Reading Comprehension

Performance (F1)

LMs trained on

SQuAD

LMs trained on

ParsiNLU Reading Comprehension +

SQuAD

67.4

68.2

73.6

74.7

ParsBERT

(base)

mBERT

(base)

mT5

(large)

mT5

(XL)

mT5

(large)

mT5

(XL)

20 of 21

Experimental Findings (4)

Finding 1: The proposed dataset(s) is decent quality.
Finding 2: English models successfully transfer to Persian. Joint training on English and Persian helps.
Finding 3: ParsiNLU has room for progress.

20

39.2

WikiBERT

(base)

40.7

49.0

mT5

(large)

49.2

70.4

mT5

(XL)

86.2

Human:

LMs trained on

ParsiNLU Reading Comprehension

Performance (F1)

LMs trained on

SQuAD

LMs trained on

ParsiNLU Reading Comprehension +

SQuAD

67.4

68.2

73.6

74.7

ParsBERT

(base)

mBERT

(base)

mT5

(large)

mT5

(XL)

mT5

(large)

mT5

(XL)

21 of 21

ParsiNLU: Summary

We introduced ParsiNLU, a comprehensive benchmark of language understanding tasks for Persian.

The benchmark comprises six tasks.
Each task benefits from annotations by native speakers who professionally work on various areas of NLP and based on the available resources.�

We evaluated state-of-art multi-lingual models on our proposed benchmark.

Comparing fine-tuned LM scores with human evaluations shows that work remains to be done

21