1 of 21

ParsiNLU: A Suite of Language Understanding Challenges for Persian

Daniel Khashabi, Arman Cohan, Siamak Shakeri, Pedram Hosseini, Pouya Pezeshkpour, Malihe Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze Brahman, Sarik Ghazarian, Mozhdeh Gheini, Arman Kabiri, Rabeeh Karimi Mahabadi, Omid Memarrast, Ahmadreza Mosallanezhad, Erfan Noury, Shahab Raji, Mohammad Sadegh Rasooli, Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi Samghabadi, Mahsa Shafaei, Saber Sheybani, Ali Tazarv, Yadollah Yaghoobzadeh

1

2 of 21

Motivation

  • Benchmarks, e.g., GLUE, have played a major role in pushing forward the performance on NLU tasks.

2

  • However, such benchmarks (the advances they bring about) are disproportionately focused around a small subset of world’s high-resource languages, most notably English.
  • This work introduces ParsiNLU, the first Persian NLU benchmark.
    • To alleviate resource scarcity of Persian and push forward the NLP progress for it.

3 of 21

ParsiNLU: Overview

  1. Reading Comprehension
  2. Multiple-Choice QA
  3. Sentiment Analysis
  4. Textual Entailment
  5. Question Paraphrasing
  6. Machine Translation

3

6 tasks

Manual Annotations

External Resources

Experiments w/ SOTA language models

4 of 21

Task 1: Reading Comprehension (1)

Setup: This task is defined as extracting a substring from a given context paragraph that answers a given question.

4

سوال: نهاوند جزو کدام استان است؟

Question: Nahavand is part of which province?�

پاراگراف: نَهاوند شهری در غرب ایران است. این شهر در جنوب غربی استان همدان قرار گرفته است. نهاوند دارای حمعیت …

Paragraph: Nahavand (Navan) is a city in western Iran. This city is located in the southern part of Hamedan province and it is the capital of Nahavand. Nahavand has a population of …

پاسخ: همدان، استان همدان

Answer: Hamedan; Hamedan province

5 of 21

Task 1: Reading Comprehension (2)

Overview of the data collection pipeline

5

Short questions

open-ended questions with no concrete answers

Questions

Select minimal and coherent spans

that contains the answer

correct grammatical

errors and typos

Question, Answer, Paragraph

Google’s Auto-complete

- a seed set of question terms: Who, Where, …

- repeatedly querying parts

- popular questions of Persian-speaking

users of Google

6 of 21

Task 2: Multiple-Choice QA (1)

  • Setup: given a natural language question, pick the correct answer among a list of multiple candidates

  • Common format for evaluation of fact-retrieval [Richardson et al., 2013; Clark et al., 2020]

6

بزرگترین قاره‌ی جهان کدام است؟‌

✔ ۱) آسیا ۲)‌ اروپا ۳) آمریکا ۴) آفریقا

What is the largest continent in the world?

✔ 1) Asia 2) Europe 3) Americas 4) Africa

نجاری روزی یک صندلی و شاگردش در سه روز یک صندلی می‌سازد. اگر نجار و شاگردش با هم کارکنند، ۱۲ صندلی رو در چند روز می سازند؟

۱) ۱۲ ✔ ۲)‌ ۹ ۳) ۸ ۴) ۶

A carpenter makes a chair a day and his student makes a chair in three days If a carpenter and his student work together, how many days will they make 12 chairs?

1) 12 2) 9 3) 8 4) 6

7 of 21

Task 2: Multiple-Choice QA (2)

  • Construction:
    • (1) Collecting existing exams in Persian:
      • Annual college entrance exams
      • Employment exams �
    • (2) Run them through OCR �
    • (3) Manually fix OCR mistakes �

7

8 of 21

Task 3: Sentiment Analysis (1)

We explored two relatively less investigated domains in Persian Sentiment Analysis:

  • Movie reviews
  • Online grocery shopping

Our sentiment labels are on a 5-point likert scale, [ -2, -1, 0, +1, +2], at two levels:

  • Document (review)-level
  • Aspect-level

8

Example: “It tastes good but it’s so expensive even with a special offer. It’s almost double the price of fresh meat.”

Labels:

  • Overall sentiment: negative (-1)
  • Taste: positive (+1)
  • Price: very negative (-2)

9 of 21

Task 3: Sentiment Analysis (2)

9

Defining sentiment aspects

Training annotators

Final annotation

Food & beverages aspects

Movie review aspects

Purchase value/price

Packaging

Delivery

Product quality

Nutritional value

taste/smell

Music

Sound

Directing

story/screenplay

acting/performance

Cinematography

scene

Review sources

Annotation process

Total annotated reviews: 2,423

  • Food & beverages: 1,917
  • Movie: 506

Annotation tasks and Cohen’s Kappa agreement:

  • Task 1: Assigning sentiment to a review (0.76, substantial)
  • Task 2: Tagging aspects in a review (0.49, moderate)
  • Task 3: Assigning sentiment to aspects (0.47, moderate)

10 of 21

Task 4: Textual Entailment (1)

Setup: This task is defined as determining the 3-way relationship between two sentences:

10

Translating MNLI instances using Google Translate

Writing a premise and hypothesis based on existing sentences

1- Based on natural sentences

2- Based on existing datasets

Entailment

Neutral

Contradiction

Premise: Poor people in more than a couple of counties in Atlanta

receive help from the Atlanta Legal Aid.

پیش فرض: مردم فقیر در بیش از چند شهرستان در آتلانتا از کمک حقوقی آتلانتا کمک میگیرند.

Hypothesis: Atlanta Legal Aid provides civil services to poor people

in five metro Atlanta counties.

.فرضیه: کمک حقوقی آتلانتا به مردم فقیر در پنج منطقه شهری آتلانتا خدمات مدنی ارائه می دهد

11 of 21

Task 4: Textual Entailment (2)

Overview of the data collection pipeline:

Overview of the data collection pipeline:

11

1- Sampling Sentences with Conjunctive Adverbs from Persian Wikipedia

2- MNLI Dataset

Google Translate (En-Fa)

S1: فرانسه بارها اعلام داشته که بحران اقتصادی بین المللی است ، پس چاره کار هم باید جهانی باشد.

France has repeatedly stated that crisis is an international economy, so the solution must be global.

S2: صدام حسين می گويد ترجيح می دهد بميرد ولی به تبعيد نرود.

Saddam Hussein says he would rather die but not go into exile

S3: پنج تن از آنها به ژاپن بازگشتند همچنین ممکن است بقیه ی ربوده شدگان هنوز زنده باشند.

Five of them came back to Japan. Also, the rest of Abducted might be still alive.

P: Corona virus spreads mainly between people who are in close contact with each other.

ویروس کرونا عمدتا بین افرادی که در تماس نزدیک با یکدیگر هستند ، گسترش می یابد.

H: People can be infected when droplets containing the virus are inhaled.

افراد هنگام استنشاق قطرات حاوی ویروس می توانند آلوده شوند.

Textual Entailment

Dataset

Human Annotation

Fixing Translations

12 of 21

Task 5: Question Paraphrasing (1)

Setup: This task is defined as determining whether two given questions are paraphrases or not:

12

(1) Based on natural sentences

(2) Based on existing datasets

Mining questions using Google auto-complete

Creating pairs of questions with high token overlap

Getting question pairs from QQP dataset

Translating them using Google Translation

سوال ۱: کدام شهرهای ایران در وضعیت سفید کرونا هستند؟

Q1: Which cities in Iran are in white zones for corona?

سوال ۲: کدام شهرهای ایران در وضعیت قرمز کرونا هستند؟

Q2: What cities are red zones of corona?

13 of 21

Task 5: Question Paraphrasing (2)

Overview of the data collection pipeline:

13

Incomplete Seed Sentences

QQP Dataset

Google’s Auto-complete

Google Translate (En-Fa)

Question Paraphrasing

Dataset

Human Annotation

Q1. آخرین روزهای زندگی خود را در کجا میخواهید سپری کنید؟

(Where do you want to spend your last days of life?)

Q2. آیا جهان موازی وجود دارد؟

(Are there any parallel universes?)

Q3. استارتاپ ها چیستند؟

(What are startups?)

14 of 21

Task 6: Translation (1)

  • Setup: We consider the task of translating a given English sentence into Persian, and vice versa. ������
  • Several existing resources [Kashefi, 2018; Prokopidis et al., 2016; Pilevar et al., 2011]
  • Our work: compile an evaluation set for comprehensive evaluation of MT.

14

آن کسانی که به جان غیب ایمان آرند و نماز به پا دارند و از هرچه روزیشان کردیم به فقیران انفاق کنند.

Who believe in the Unseen, are steadfast in prayer, and spend out of what We have provided for them.

15 of 21

Task 6: Translation (2)

  • The proposed combination:
    • Quran [Tiedemann and Nygaard, 2004]
    • Bible
    • Query Paraphrasing
    • Mizan [Kashefi, 2018]
  • Also an accompdanying set for training models.

15

16 of 21

Experiments: Setup

  • Human performance:
    • Evaluate annotations from three human annotatorsagainst the gold labels

  • Model performance:
    • Trained a variety of models
      • ParsBERT [Farahani et al. 2020]
      • wikiBERT [Xue et al. 2021]
      • mBERT [Devlin et al. 2019]
      • mT5 [Xue et al. 2021]

16

Persian LM

Multilingual LM

17 of 21

Experimental Findings (1)

  • Finding 1: The proposed dataset(s) is decent quality.

17

86.2

Human:

Performance (F1)

LMs trained on

ParsiNLU Reading Comprehension

39.2

WikiBERT

(base)

40.7

49.0

mT5

(large)

49.2

70.4

mT5

(XL)

ParsBERT

(base)

mBERT

(base)

18 of 21

Experimental Findings (2)

  • Finding 1: The proposed dataset(s) is decent quality.
  • Finding 2: English models successfully transfer to Persian.

18

39.2

WikiBERT

(base)

40.7

49.0

mT5

(large)

49.2

70.4

mT5

(XL)

86.2

Human:

LMs trained on

ParsiNLU Reading Comprehension

Performance (F1)

LMs trained on

SQuAD

67.4

68.2

ParsBERT

(base)

mBERT

(base)

mT5

(large)

mT5

(XL)

19 of 21

Experimental Findings (3)

  • Finding 1: The proposed dataset(s) is decent quality.
  • Finding 2: English models successfully transfer to Persian. Joint training on English and Persian helps.

19

39.2

WikiBERT

(base)

40.7

49.0

mT5

(large)

49.2

70.4

mT5

(XL)

86.2

Human:

LMs trained on

ParsiNLU Reading Comprehension

Performance (F1)

LMs trained on

SQuAD

LMs trained on

ParsiNLU Reading Comprehension +

SQuAD

67.4

68.2

73.6

74.7

ParsBERT

(base)

mBERT

(base)

mT5

(large)

mT5

(XL)

mT5

(large)

mT5

(XL)

20 of 21

Experimental Findings (4)

  • Finding 1: The proposed dataset(s) is decent quality.
  • Finding 2: English models successfully transfer to Persian. Joint training on English and Persian helps.
  • Finding 3: ParsiNLU has room for progress.

20

39.2

WikiBERT

(base)

40.7

49.0

mT5

(large)

49.2

70.4

mT5

(XL)

86.2

Human:

LMs trained on

ParsiNLU Reading Comprehension

Performance (F1)

LMs trained on

SQuAD

LMs trained on

ParsiNLU Reading Comprehension +

SQuAD

67.4

68.2

73.6

74.7

ParsBERT

(base)

mBERT

(base)

mT5

(large)

mT5

(XL)

mT5

(large)

mT5

(XL)

21 of 21

ParsiNLU: Summary

  • We introduced ParsiNLU, a comprehensive benchmark of language understanding tasks for Persian.
    • The benchmark comprises six tasks.
    • Each task benefits from annotations by native speakers who professionally work on various areas of NLP and based on the available resources.�
  • We evaluated state-of-art multi-lingual models on our proposed benchmark.
    • Comparing fine-tuned LM scores with human evaluations shows that work remains to be done

21