1 of 49

Computational Modeling� of Gender in Arabic

Nizar Habash

New York University Abu Dhabi

2 of 49

The New York University Global Network�شبكة جامعة نيويورك العالمية

2

3 of 49

New York University Abu Dhabi�جامعة نيويورك أبوظبي

3

4 of 49

CAMeL Lab مختبر «كامل» Computational Approaches to Modeling Language Lab

@CamelNlp

scholar.camel-lab.com

http://www.camel-lab.com

Started in 2014 Research Areas Core Arabic & Arabic dialect NLP Resource and tool development Machine translation Dialogue systems
120+ publications/20+ resources

5 of 49

Computational Modeling� of Gender in Arabic

Nizar Habash

New York University Abu Dhabi

6 of 49

Introduction Arabic Gender and Other Phenomena Arabic Gender and Language Technologies Arabic Gender Rewriting Outlook

Roadmap

7 of 49

Classical Arabic Quranic Arabic Historical texts
Modern Standard Arabic Official language Language of news & media Standard writing & grammar The National Language
Dialectal Arabic Predominantly spoken No official standardization The Mother Tongue Lots of variations from MSA Increasing use on social media

Arabic and its Variants

8 of 49

Arabic script uses optional diacritical marks 1.5% of newspaper words have some diacritical marks Standard Arabic has 6.8 diacritizations and 2.7 lemmas per word

Arabic Orthographic Ambiguity

ولعين

وَلِعِينَ وَلِعَينٍ وَلَعِينٌ

Infatuated (m.pl) # and for an eye/spring # and cursed

9 of 49

وسنقولها

/wasanaqūluhā/

و+ س+ ن+ قول + ها

wa+sa+na+qūl+u+hā

and+will+we+say+it

And we will say it

قال، قالت، قالا، قالوا، قلتَ،� قلتِ، قلتما، قلتم، قلتن،

يقولُ، يقولَ، يقل، تقولُ، تقولَ، تقل، تقولين، تقولي،

... فقال، فقالت، فقالا ...

... وسأقولها، وسنقولها، ...

Arabic has a very rich inflectional system Gender, number, person, aspect, voice, mood, case, state, and many clitics For example, Arabic verbs have 5,400 inflected forms Whereas English verbs have 6 and Chinese verbs have 1!

Arabic Morphological Richness

10 of 49

Form-Function Discrepancy 8.6% of nominals not “cis-gender” 8.7% of nominals not “cis-number” Functionally M with F form endings خليفة Xaliyfah Caliph أسامة ÂusAmah Ossama Functionally F with M form endings شمس šams sun عين Eayn eye Two functional genders طريق Tariyq road (F+M) Mix of gender and number: FS/MP سحرة saHarah magicians

Arabic Morphological Complexity: Beyond the Binary

(Alkuhlani & Habash, 2011)

11 of 49

Semantics of the feminine singular ending ة ah Grammatical Gender كريم / كريمة kariym / kariymah generous [ms] / generous [fs] Biological Sex أمير / أميرة Âamiyr / Âamiyrah prince / princess Collective-Singulative نمل / نملة naml / namlah ants (type) / one ants Exaggerative نابغ / نابغة nAbig / nAbigah smart / genius Ad hoc مكتب / مكتبة maktab / maktabah office / library Singular-Plural (dialectal) فرنسي / فرنسية faransiy / faransiyya French [ms] / French [fs,p]

Arabic Morphological Complexity: Beyond the Binary

12 of 49

Arabic inflects for gender Feminine & masculine with nouns, verbs and adjectives. Templatic and affixational morphemes Form and function discrepancy Complex gender agreement rules

Complex Morphosyntactic Agreement

شهيرة	مدن	ثلاث	في	الطرشاء	الموسيقية	سكنت
famous FS	cities FP	three�M	in	the-deaf�FS	the-musician�FS	live �FS
irrationality�agreement �P >> FS	templatic�MS	inverse gender agreement in numbers		templatic�MS	�

Evelyn Glennie

13 of 49

Arabic inflects for gender Feminine & masculine with nouns, verbs and adjectives. Templatic and affixational morphemes Form and function discrepancy Complex gender agreement rules

Complex Morphosyntactic Agreement

(Alkuhlani & Habash, 2011)

14 of 49

A Note on Gender Neutral Arabic�Existing Tactics

… beyond the Masculine Generic

Gender-neutral pronouns

أنا، نحن، هما، أنتما ‘I, we, they two, you two’, plus dialectal انتو (y’all)

Diacritic free writing that allows for ambiguous references

كتابك ktAbk ‘your book’ كتابُكَ and كتابُكِ kitAbuka^m / kitAbuki^f

Constructions to avoid specifying gender:

instead of أنا سعيد جدا، أنا سعيدة جدا / أنا كلي سعادة / أحس بالسعادة

I am very happy^f/m / I feel happiness / I am all-of-me happiness

instead of ما اسمكَ / اسمكِ؟ / الاسم الكريم؟

What is your^f/m name? / the good name?

14

15 of 49

A Note on Gender Neutral Arabic�Emerging Tactics

Paired forms

السيد(ة) الوالد(ة) أو ولي(ة) الأمر

Mr.(f) Parent^m (f) or Guardian^m (f)

يوم في حياة موظف/ة

A day in the life of an employee^m/f

نشكركم/ن على مساندتكم/ن و مشاركتكم/ن المتواصلة

We thank you[mp/fp] for your[mp/fp] support and your[mp/fp] �continued participation

15

16 of 49

A Note on Gender Neutral Arabic�Experimental Tactics

New (Queer) pronouns

أنتم+أنتن => أنتمن
you^mp+you^fp=> you^p
antum + antunna 🡺 antumunna

16

Ibdal factory for Language and Queer Translations

17 of 49

A Note on Gender Neutral Arabic�Experimental Tactics

New (Queer) Grammar

أنهما كويريات مسلمين
They^D are Queer^FP Muslim^MP/D

A post about Mauree Turner ---------------->

Member of the Oklahoma House of Representatives
They are the first publicly non-binary U.S. state lawmaker and the first Muslim member of the Oklahoma Legislature.

17

18 of 49

Introduction Arabic Gender and Other Phenomena Arabic Gender and Language Technologies Arabic Gender Rewriting Outlook

Roadmap

19 of 49

MT 2023

Gender translation errors in machine translation for morphologically rich languages�persist

male doctor

female nurse

20 of 49

MT 2023

Gender translation errors in machine translation for morphologically rich languages�persist

The world is biased. The data is biased. The models are biased.

But even if we fix all of these, we will still have a problem!

NLP systems are mostly gender-unaware single-output systems.

male doctor

female nurse

21 of 49

MT 2023

Gender translation errors in machine translation for morphologically rich languages�persist

The world is biased. The data is biased. The models are biased.

But even if we fix all of these, we will still have a problem!

NLP systems are mostly fragile gender-unaware single-output systems.

male nurse

female nurse

22 of 49

MT 2023

Gender translation errors in machine translation for morphologically rich languages�persist

The world is biased. The data is biased. The models are biased.

But even if we fix all of these, we will still have a problem!

NLP systems are mostly really fragile gender-unaware single-output systems.

male doctor

23 of 49

MT 2023: inconsistent outputs

♂

♀

♂

♀

♂

Post editing as a grammatical error correction task?

24 of 49

MT 2023: inconsistent agreement

24

Masculine noun agreement is good.
Feminine noun agreement is bad in 8 out of 18 cases (44%)

25 of 49

ChatGPT MT 2023: better, but…

Consistent output + but with typical single-output gender bias

26 of 49

27 of 49

Introduction Arabic Gender and Other Phenomena Arabic Gender and Language Technologies Arabic Gender Rewriting Outlook

Roadmap

28 of 49

28

Arabic Gender Rewriting Task

We define the task of Gender Rewriting:

Generating alternatives of a given sentence �to match different target user gender contexts

Work with my PhD student, Bashar Alhafni

29 of 49

29

Arabic Gender Rewriting Task

We define the task of Gender Rewriting:

Generating alternatives of a given sentence �to match different target user gender contexts

We focus on Arabic, a gender-marking morphologically rich language

Input: Arabic Sentence, Target User Gender
Output: Gender Rewritten Sentences

NLP System

أنا طبيب رائع

“I am a wonderful [male] doctor”

Gender

Rewriting

System

Target Gender: Feminine

30 of 49

30

We define the task of Gender Rewriting:

Generating alternatives of a given sentence �to match different target user gender contexts

We focus on Arabic, a gender-marking morphologically rich language

Input: Arabic Sentence, Target User Gender
Output: Gender Rewritten Sentences

NLP System

أنا طبيب رائع

“I am a wonderful [male] doctor”

Gender

Rewriting

System

أنا طبيبة رائعة

Target Gender: Feminine

Arabic Gender Rewriting Task

31 of 49

31

We define the task of Gender Rewriting:

Generating alternatives of a given sentence �to match different target user gender contexts

We focus on Arabic, a gender-marking morphologically rich language

Input: Arabic Sentence, Target User Gender
Output: Gender Rewritten Sentences

NLP System

أنا طبيب رائع

“I am a wonderful [male] doctor”

Gender

Rewriting

System

أنا طبيبة رائعة

أنا طبيب رائع

Target Gender: Masculine

Target Gender: Feminine

Arabic Gender Rewriting Task

32 of 49

Arabic Parallel Gender Corpus (APGC) v2.0 (Alhafni et al., 2022a)

We developed an Arabic parallel gender corpus

First and second persons sentences -- v1.0 (Habash et al., 2019) was First Person only

Selected from the Open Subtitles 2018 dataset (Lison & Tiedemann, 2016)

58,000 English-Arabic sentences containing first and second persons pronouns: I, me, my, mine, myself, and you, your, yours, yourself

Gender of both speaker and listener are identified: B, 1M/B, B/2M, 1F/B, B/2F, 1M/2M, 1M/2F, 1F/2M, 1F/2F

Parallel M and F versions were introduced in the case of M or F subtags

Keep word order and count, and maintain grammatical agreement

33 of 49

Arabic Parallel Gender Corpus (APGC) v2.0 (Alhafni et al., 2022a)

Word-level gender annotations: Masculine (1M/2M), Feminine (1F/2F), ambiguous (B)
Extended word-level gender annotations (base form + enclitic): 1M+B, 2M+B, 1M+2F, etc..
Five balanced parallel corpora + English

34 of 49

Arabic Parallel Gender Corpus (APGC) v2.0 (Alhafni et al., 2022a)

80,326 Sentences (597K words). 54% contained gendered references
10% of the words are gender specific
70% Train; 10% Dev; 20% Test

35 of 49

35

Multi-User Gender Rewriting Model (Alhafni et al., 2022b)

I. Gender Identification (GID):

Identify word-level gender label (base form + enclitic)
Fine-tuned CAMeLBERT MSA (Modern Standard Arabic)

II. Out-of-context Word Gender Rewriting:

Corpus-based Rewriter (CorpusR):

Morphological Rewriter (MorphR):

Morphological analyzer and generator (Obeid et al., 2020)

Neural Rewriter (NeuralR):

Seq2Seq with side-constraints
Input: <1F+B>طبيب “[male] doctor”; Output: طبيبة “[female] doctor”

III. In-context Ranking & Selection:

Pseudo-log-likelihood (PPL) scores as defined by Salazar et al., 2020 over CAMeLBERT MSA

36 of 49

36

Evaluation & Results

Evaluation:

Treat the gender rewriting problem as a user-aware GEC task
MaxMatch (M²) Scorer: Precision, Recall, F0.5 (Dahlmeier and Ng, 2012)
BLEU (Papineni et al., 2002)

Baselines:

Do Nothing

Joint Model (Alhafni et al., 2020)

Sentence-level gender identification and rewriting model
Character-level Seq2Seq model with word level morphological features
Learned representation for the target user gender

37 of 49

37