1 of 13

INFOTEXTCM: Addressing Code-mixed Data Retrieval Challenges via Text Classification

ADITYA SHARMA

DEPT. OF CSIS, BITS PILANI

2 of 13

WHY ADDRESS CODE-MIXED IR? AND CHALLENGES?

Growing popularity of code-mixed languages on social media and other digital platforms.
Lack of resources and tools for processing code-mixed text.
Challenges:

Language identification – (transliteration)

"tin" means "three" in Bengali
"metal container" or "chemical element" in English

Syntactic diversity: Different and the grammatical structural and morphological complexity of languages
Low resource languages

3 of 13

TASK PROBLEM

4 of 13

DATASET

The provided dataset consists of multilingual documents in TREC format.
Code-mixed Bengali and English languages.
Train queries: 20 queries in code-mixed English and Bengali.
Additionally, a document with human annotations indicating the relevance or non-relevance of documents related to a particular query.

5 of 13

APPROACHES IMPLEMENTED DURING THIS STUDY

Preprocessing approaches (stopword removal and stemming)
Reranking with pretrained language models(LMs)
Ranking using fine-tuned Sentence-BERT (SBERT) language model
Reranking SBERT ranking results using Graph Neural Networks (GNNs)

6 of 13

Fig: Framework pipeline

7 of 13

PREPROCESSING APPROACHES

Stopword removal

NLTK (both English and Bengali libraries)
Terrier English stopword list
SMART English stopword list
Custom English and Bengali stopword list from research article [1].

[1] S. Chanda, S. Pal, The effect of stopword removal on information retrieval for code-mixed data obtained via social media, SN Comput. Sci. 4 (2023) 494. doi:10.1007/S42979-023-01942-7.

9 of 13

Reranking Traditional Models with Pre-trained Language Models

Aim is to enhance the rankings of traditional IR models by leveraging transformer based pre-trained models
Specifically focusing on Sentence-BERT (SBERT) based language models
Chose the rankings obtained from Hiemstra-LM as the foundation.
Few models that we compared are

Mixed-DistilBERT [2],
IndicSBERT [3],
LaBSE [4],
hkunlp-Instructor [5]

[2] M. N. Raihan, D. Goswami, A. Mahmud, Mixed-distil-bert: Code-mixed language modeling for bangla, english, and hindi, arXiv preprint arXiv:2309.10272 (2023).

[3] S. Deode, J. Gadre, A. Kajale, A. Joshi, R. Joshi, L3cube-indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert, 2023. arXiv:2304.11434.

[4] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding, 2020. arXiv:2007.01852.

[5] H. Su, W. Shi, J. Kasai, Y. Wang, Y. Hu, M. Ostendorf, W. tau Yih, N. A. Smith, L. Zettlemoyer, T. Yu, One embedder, any task: Instruction-finetuned text embeddings, 2023. URL: https://arxiv.org/abs/ 2212.09741. arXiv:2212.09741

10 of 13

Ranking using Fine-tuned SBERT Models

We also experimented with conducting retrieval using SBERT models exclusively.
We fine-tuned the SBERT models on the code mixed corpus
To enable the model to learn suitable weights for enhancing the retrieval results.

We noticed that the performance of some models declined after fine tuning. This could be due to the initial language in which the model was pre-trained. Fine-tuning a model pre-trained in a different language with a code-mixed Bengali-English corpus caused further confusion, leading to a decline in performance.

11 of 13

Reranking SBERT Results Using GNN

To further improve the ranking results obtained from finely-tuned Sentence-BERT (SBERT) models, we explored an innovative approach by integrating graph neural networks (GNN)
We selected the most promising results from our best-performing SBERT model as a baseline (Mixed DistilBERT).
These selected results were then fed into a GNN model, aiming to leverage its capability to understand and process data in a graph-structured form.
This approach was motivated by the potential of GNNs to capture complex relationships and interactions within the data

12 of 13

CONCLUSION

Addressed Code-mixed IR in Bengali and English
Explored the performance of various techniques for code-mixed Bengali-English IR
Transformer-based models showed good retrieval performance.
However, the GNN based approach did not perform well as compared to fine-tuned transformer-based models.
Future work could be to further improve the existing results and incorporate GNNs more inclusively in IR systems.