1 of 14

TEAM : BITS-P

Amulya Ratna Dash, Harpreet Singh Anand*, Yashvardhan Sharma

TRACK: Machine Translation for Indian Languages

Forum for Information Retrieval Evaluation 2023

Date: 17-12-2023

BITS Pilani

Pilani Campus

BITS Pilani

Pilani Campus

2 of 14

Problem Statement for FIRE 2023

Task 1

Build a machine translation model to translate sentences of the following language pairs:

  1. Hindi-Gujarati
  2. Gujarati-Hindi
  3. Hindi-Kannada
  4. Kannada-Hindi
  5. Hindi-Odia
  6. Odia-Hindi
  7. Hindi-Punjabi
  8. Punjabi-Hindi
  9. Hindi-Sindhi
  10. Urdu-Kashmiri
  11. Telugu-Hindi
  12. Hindi-Telugu

Task 2

Build machine translation models for Governance and Healthcare domains. The language pairs for each domain are as follows:

  1. Healthcare:
    • Hindi-Gujarati
    • Gujarati-Hindi
    • Hindi-Kannada
    • Kannada-Hindi
    • Hindi-Odia
    • Odia-Hindi
    • Hindi-Punjabi
    • Punjabi-Hindi

2. Governance:

    • Hindi-Gujarati
    • Gujarati-Hindi
    • Hindi-Kannada
    • Kannada-Hindi
    • Hindi-Odia
    • Odia-Hindi
    • Hindi-Punjabi
    • Punjabi-Hindi

BITS Pilani, Pilani Campus

3 of 14

Proposed Technique - Data Preprocessing

Step 1 : Compiling the data

  1. In this step, we combine the data present in different folders of the BPCC dataset.
  2. The data is present in the form of two text files containing sentences in Indian Languages and their translation in English.
  3. A CSV(Comma separated values) file is created with two columns containing the sentences in any particular Indic Language and their translation in English.

BITS Pilani, Pilani Campus

4 of 14

Proposed Technique - Data Preprocessing

Step 2 : Classification

  1. The second task requires a machine translation model more suited to healthcare and governance domains.
  2. We classify the dataset into healthcare-related sentences and governance-related sentences to achieve better results.
  3. We achieve this by using the BART-large-mnli classification model (by Facebook/Meta AI) for Zero-Shot classification.
  4. Classification is done on the English translation of the sentences with the classification labels as “Healthcare”, “Governance” and “Others”.
  5. The data is then copied to another CSV file containing the sentences in Indic languages, their translation in English and their classifications.

BITS Pilani, Pilani Campus

5 of 14

Proposed Technique - Data Preprocessing

Step 3 : Translating the data

  1. In this step, we translate the data from one Indic language to another without the use of English as a pivot language.
  2. We achieve this by using the Indic2Indic pipeline in the IndicTrans model.
  3. IndicTrans is a Transformer-4x ( ~434M ) multilingual NMT model trained on Samanantar dataset which was the largest publicly available parallel corpora collection for Indic languages at the time of writing (14 April 2021).
  4. It is a single script model i.e it converts all the Indic data to the Devanagari script which allows for better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows

using a smaller subword vocabulary

BITS Pilani, Pilani Campus

6 of 14

Proposed Technique(for FIRE 2023)

Following is the proposed technique for the tasks:

  1. Task 1

1. Translate the sentences using both IndicTrans and NLLB models.

2. Calculate the sentence embeddings of each of above generated

Translations using Muril Model by Google.

3. Calculate the cosine-similarity of each of these embeddings with the embedding

generated for the English translation.

4. Compare both cosine-similarities and accept the translation with higher

cosine-similarity.

BITS Pilani, Pilani Campus

7 of 14

Proposed Technique(for FIRE 2023)

BITS Pilani, Pilani Campus

8 of 14

Proposed Technique(for FIRE 2023)

2. Task 2

  1. Classify the data of around 10 lakh Odia and Hindi sentences.
  2. Split the data into two parts - healthcare-related and governance-related.
  3. Fine-tuning the NLLB model separately for each of these to get two different models for healthcare and governance.

BITS Pilani, Pilani Campus

9 of 14

Results(FIRE 2023)

  1. TASK 1 : General Translation Task

BITS Pilani, Pilani Campus

10 of 14

Results(FIRE 2023)

2. TASK 2 : Domain Specific Translation Task

BITS Pilani, Pilani Campus

11 of 14

References

[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[2] Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672 (2022).

[3] Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M Khapra, and Pratyush Kumar. 2021. IndicBART: A pre-trained model for indic natural language generation. arXiv preprint arXiv:2109.02903 (2021).

[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[5] Jay Gala, Pranjal A Chitale, Raghavan AK, Sumanth Doddapaneni, Varun Gumma, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, et al. 2023. IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages. arXiv preprint arXiv:2305.16307 (2023).

[6] Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, et al. 2021. Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730 (2021).

BITS Pilani, Pilani Campus

12 of 14

References

[7] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).

[8] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).

[9] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[10] Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan Ak, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee,Divyanshu Kakwani, Navneet Kumar, et al. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 indic languages.Transactions of the Association for Computational Linguistics 10 (2022), 145–162.

[11] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).

[12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[13]Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach.arXiv preprint arXiv:1909.00161 (2019).

BITS Pilani, Pilani Campus

13 of 14

References

[14] Philip, J., Namboodiri, V., Jawahar, C. (2019). a baseline neural machine translation system for indian languages.. https://doi.org/10.48550/arxiv.1907.12437

[15] Aggarwal, D., Gupta, V., & Kunchukuttan, A. (2022). Indicxnli: Evaluating multilingual inference for indian languages. arXiv preprint arXiv:2204.08776.

[16] https://github.com/AI4Bharat/indicTrans

[17] https://ai.meta.com/research/no-language-left-behind/

BITS Pilani, Pilani Campus

14 of 14

Thank You

BITS Pilani, Pilani Campus