1 of 21

Democratizing Arabic Natural Language Processing

Founders: Zaid Alyafeai and Maged Saeed

2 of 21

Motivation

  • As you know machine learning has proven its importance in many fields, like computer vision, NLP, reinforcement learning, adversarial learning, etc ..
  • Unfortunately, there is a little work to make machine learning accessible for Arabic-speaking people.
  • Most NLP models require a lot of time to replicate, test and collect datasets.

3 of 21

Challenges

  • Arabic language is morphologically rich and there exists many dialects.
  • Arabic language contains special characters called diacritics that help readers pronounce words correctly.
  • There is lack research and annotated data.
  • Open source is not popular and a lot of research is closed source.
  • Arabic NLP applications are lacking.

4 of 21

Mission

  • Create open-source projects.
  • Open the community eyes on the significance of natural language processing.
  • Create interactive applications that allow novice Arabs to learn more about machine learning.
  • Provide researchers and developers with model prototypes, reproducible results and datasets.
  • Create different interfaces targeting different audience.

5 of 21

Methodology

6 of 21

Tools

  • Python for coding ml and javascript for coding the web interface.
  • Google Colab the interface of a colab notebook is very similar to jupyter notebooks with slight differences. Google offers three hardware accelerators CPU, GPU and TPU for speeding up training.
  • TensorFlow.js is part of the TensorFlow ecosystem that supports training and inference of machine learning models in the browser.

7 of 21

Models

8 of 21

Datasets

9 of 21

10 of 21

Translation and Embedding

11 of 21

Diacritization

12 of 21

Sentiment Analysis

13 of 21

Meter Classification

14 of 21

Digits Classification

15 of 21

Object Detection

16 of 21

Captioning (1/2)

17 of 21

Captioning (2/2)

18 of 21

Contributions

We advise people to work on different problems for Arabic. Look for good first issue in our GitHub repo

19 of 21

Limitations

  • It is difficult to host everything on GitHub. Especially for large datasets or models.
  • NLP models are getting bigger. It is difficult to run models like BERT on the browser without distillation or pruning.
  • Open source contribution is lacking in the Arabic world.
  • Research is moving fast, it is difficult to keep up with it without great community support.

20 of 21

Future Work

  • Arabic Tokenization library.
  • Data processors and visualizers.
  • Data scraping for unsupervised data collection.
  • Work on other models like DistillBert and train it for Arabic.

21 of 21

Q & A