INFOTEXTCM: Addressing Code-mixed Data Retrieval Challenges via Text Classification
ADITYA SHARMA
DEPT. OF CSIS, BITS PILANI
WHY ADDRESS CODE-MIXED IR? AND CHALLENGES?
TASK PROBLEM
DATASET
APPROACHES IMPLEMENTED DURING THIS STUDY
Fig: Framework pipeline
PREPROCESSING APPROACHES
[1] S. Chanda, S. Pal, The effect of stopword removal on information retrieval for code-mixed data obtained via social media, SN Comput. Sci. 4 (2023) 494. doi:10.1007/S42979-023-01942-7.
Reranking Traditional Models with Pre-trained Language Models
[2] M. N. Raihan, D. Goswami, A. Mahmud, Mixed-distil-bert: Code-mixed language modeling for bangla, english, and hindi, arXiv preprint arXiv:2309.10272 (2023).
[3] S. Deode, J. Gadre, A. Kajale, A. Joshi, R. Joshi, L3cube-indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert, 2023. arXiv:2304.11434.
[4] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding, 2020. arXiv:2007.01852.
[5] H. Su, W. Shi, J. Kasai, Y. Wang, Y. Hu, M. Ostendorf, W. tau Yih, N. A. Smith, L. Zettlemoyer, T. Yu, One embedder, any task: Instruction-finetuned text embeddings, 2023. URL: https://arxiv.org/abs/ 2212.09741. arXiv:2212.09741
Ranking using Fine-tuned SBERT Models
Reranking SBERT Results Using GNN
CONCLUSION
THANK YOU