1 of 13

INFOTEXTCM: Addressing Code-mixed Data Retrieval Challenges via Text Classification

ADITYA SHARMA

DEPT. OF CSIS, BITS PILANI

2 of 13

WHY ADDRESS CODE-MIXED IR? AND CHALLENGES?

  • Growing popularity of code-mixed languages on social media and other digital platforms.
  • Lack of resources and tools for processing code-mixed text.
  • Challenges:
    • Language identification – (transliteration)
      • "tin" means "three" in Bengali
      • "metal container" or "chemical element" in English
    • Syntactic diversity: Different and the grammatical structural and morphological complexity of languages
    • Low resource languages

3 of 13

TASK PROBLEM

  •  

4 of 13

DATASET

  • The provided dataset consists of multilingual documents in TREC format.
  • Code-mixed Bengali and English languages.
  • Train queries: 20 queries in code-mixed English and Bengali.
  • Additionally, a document with human annotations indicating the relevance or non-relevance of documents related to a particular query.

5 of 13

APPROACHES IMPLEMENTED DURING THIS STUDY

  • Preprocessing approaches (stopword removal and stemming)
  • Reranking with pretrained language models(LMs)
  • Ranking using fine-tuned Sentence-BERT (SBERT) language model
  • Reranking SBERT ranking results using Graph Neural Networks (GNNs)

6 of 13

Fig: Framework pipeline

7 of 13

PREPROCESSING APPROACHES

  • Stopword removal
    • NLTK (both English and Bengali libraries)
    • Terrier English stopword list
    • SMART English stopword list
    • Custom English and Bengali stopword list from research article [1].

[1] S. Chanda, S. Pal, The effect of stopword removal on information retrieval for code-mixed data obtained via social media, SN Comput. Sci. 4 (2023) 494. doi:10.1007/S42979-023-01942-7.

8 of 13

9 of 13

Reranking Traditional Models with Pre-trained Language Models

  • Aim is to enhance the rankings of traditional IR models by leveraging transformer based pre-trained models
  • Specifically focusing on Sentence-BERT (SBERT) based language models
  • Chose the rankings obtained from Hiemstra-LM as the foundation.
  • Few models that we compared are
    • Mixed-DistilBERT [2],
    • IndicSBERT [3],
    • LaBSE [4],
    • hkunlp-Instructor [5]

[2] M. N. Raihan, D. Goswami, A. Mahmud, Mixed-distil-bert: Code-mixed language modeling for bangla, english, and hindi, arXiv preprint arXiv:2309.10272 (2023).

[3] S. Deode, J. Gadre, A. Kajale, A. Joshi, R. Joshi, L3cube-indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert, 2023. arXiv:2304.11434.

[4] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding, 2020. arXiv:2007.01852.

[5] H. Su, W. Shi, J. Kasai, Y. Wang, Y. Hu, M. Ostendorf, W. tau Yih, N. A. Smith, L. Zettlemoyer, T. Yu, One embedder, any task: Instruction-finetuned text embeddings, 2023. URL: https://arxiv.org/abs/ 2212.09741. arXiv:2212.09741

10 of 13

Ranking using Fine-tuned SBERT Models

  • We also experimented with conducting retrieval using SBERT models exclusively.
  • We fine-tuned the SBERT models on the code mixed corpus
  • To enable the model to learn suitable weights for enhancing the retrieval results.
  • We noticed that the performance of some models declined after fine tuning. This could be due to the initial language in which the model was pre-trained. Fine-tuning a model pre-trained in a different language with a code-mixed Bengali-English corpus caused further confusion, leading to a decline in performance.

11 of 13

Reranking SBERT Results Using GNN

  • To further improve the ranking results obtained from finely-tuned Sentence-BERT (SBERT) models, we explored an innovative approach by integrating graph neural networks (GNN)
  • We selected the most promising results from our best-performing SBERT model as a baseline (Mixed DistilBERT).
  • These selected results were then fed into a GNN model, aiming to leverage its capability to understand and process data in a graph-structured form.
  • This approach was motivated by the potential of GNNs to capture complex relationships and interactions within the data

12 of 13

CONCLUSION

  • Addressed Code-mixed IR in Bengali and English
  • Explored the performance of various techniques for code-mixed Bengali-English IR
  • Transformer-based models showed good retrieval performance.
  • However, the GNN based approach did not perform well as compared to fine-tuned transformer-based models.
  • Future work could be to further improve the existing results and incorporate GNNs more inclusively in IR systems.

13 of 13

THANK YOU