TEAM : BITS-P
Amulya Ratna Dash, Harpreet Singh Anand*, Yashvardhan Sharma
TRACK: Machine Translation for Indian Languages
Forum for Information Retrieval Evaluation 2023
Date: 17-12-2023
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
Problem Statement for FIRE 2023
Task 1
Build a machine translation model to translate sentences of the following language pairs:
Task 2
Build machine translation models for Governance and Healthcare domains. The language pairs for each domain are as follows:
2. Governance:
BITS Pilani, Pilani Campus
Proposed Technique - Data Preprocessing
Step 1 : Compiling the data
BITS Pilani, Pilani Campus
Proposed Technique - Data Preprocessing
Step 2 : Classification
BITS Pilani, Pilani Campus
Proposed Technique - Data Preprocessing
Step 3 : Translating the data
using a smaller subword vocabulary
BITS Pilani, Pilani Campus
Proposed Technique(for FIRE 2023)
Following is the proposed technique for the tasks:
1. Translate the sentences using both IndicTrans and NLLB models.
2. Calculate the sentence embeddings of each of above generated
Translations using Muril Model by Google.
3. Calculate the cosine-similarity of each of these embeddings with the embedding
generated for the English translation.
4. Compare both cosine-similarities and accept the translation with higher
cosine-similarity.
BITS Pilani, Pilani Campus
Proposed Technique(for FIRE 2023)
BITS Pilani, Pilani Campus
Proposed Technique(for FIRE 2023)
2. Task 2
BITS Pilani, Pilani Campus
Results(FIRE 2023)
BITS Pilani, Pilani Campus
Results(FIRE 2023)
2. TASK 2 : Domain Specific Translation Task
BITS Pilani, Pilani Campus
References
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[2] Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672 (2022).
[3] Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M Khapra, and Pratyush Kumar. 2021. IndicBART: A pre-trained model for indic natural language generation. arXiv preprint arXiv:2109.02903 (2021).
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[5] Jay Gala, Pranjal A Chitale, Raghavan AK, Sumanth Doddapaneni, Varun Gumma, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, et al. 2023. IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages. arXiv preprint arXiv:2305.16307 (2023).
[6] Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, et al. 2021. Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730 (2021).
BITS Pilani, Pilani Campus
References
[7] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
[8] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
[9] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[10] Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan Ak, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee,Divyanshu Kakwani, Navneet Kumar, et al. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 indic languages.Transactions of the Association for Computational Linguistics 10 (2022), 145–162.
[11] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
[12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[13]Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach.arXiv preprint arXiv:1909.00161 (2019).
BITS Pilani, Pilani Campus
References
[14] Philip, J., Namboodiri, V., Jawahar, C. (2019). a baseline neural machine translation system for indian languages.. https://doi.org/10.48550/arxiv.1907.12437
[15] Aggarwal, D., Gupta, V., & Kunchukuttan, A. (2022). Indicxnli: Evaluating multilingual inference for indian languages. arXiv preprint arXiv:2204.08776.
[16] https://github.com/AI4Bharat/indicTrans
[17] https://ai.meta.com/research/no-language-left-behind/
BITS Pilani, Pilani Campus
Thank You
BITS Pilani, Pilani Campus