1 of 27

SINHALA SIMILARITY AND SEMANTIC PLAGIARISM DETECTION USING MODERN EMBEDDING MODELS

Group/Project ID: 25-26J-545

1

1/5/2026

2 of 27

GROUP MEMBERS

Ranaweera R.R.M.I.M.

Abhayasiri S.A.U. D

Supervisor

Dinith Primal

2

1/5/2026

3 of 27

1.INTRODUCTION

Plagiarism weakens academic credibility.
Most checkers focus on English and ignore Sinhala text.
Sri Lankan universities handle Sinhala scripts by manual reading.
Machine translation lets writers copy English sources and repackage them in Sinhala.
This project builds an automated checker for

• Sinhala-Sinhala copying

• English-to-Sinhala translated copying.

The system runs at sentence level, searches the web, and reports matched lines.

3

1/5/2026

4 of 27

2.RESEARCH PROBLEM

Word2vec cosine matching reached 97 % on only fifty news articles [1].
A rule based web scraper scored 88 % yet fails on deep paraphrase [3].
A Siamese LSTM raised sentence similarity but never linked to a full detector [2].
No current tool flags cross-language plagiarism between English and Sinhala.
Small private datasets limit model trust and public benchmarking.
Goal: deliver accurate detection on a large open corpus and handle translated copying.

Figure 1: Siamese LSTM Architecture [4]

4

1/5/2026

5 of 27

3.OBJECTIVES�

Create an automated plagiarism checker for Sinhala and English

Sinhala similarity detection

Cross-language semantic detection

Main Objective

Sub-Objectives

5

1/5/2026

6 of 27

4.SYSTEM ARCHITECTURE�

6

1/5/2026

7 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

IT21187414 | R.R.M.I.M.Ranaweera |

1. Can contextual embeddings capture similarity between Sinhala sentences that use paraphrase and synonym swaps?

2. What similarity threshold balances precision and recall for Sinhala plagiarism detection?

3.Does a diverse training corpus improve generalisation across academic, news, and social text?

5.1 Research Questions

7

1/5/2026

8 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.2 Research Gap

IT21187414 | R.R.M.I.M.Ranaweera |

Features	Proposed System	Kasthuri 2019 [1]	Nilaxan 2021 [2]	Rajamanthri 2021 [3]
Large, balanced Sinhala corpus	✓	✗	✗	✗
Contextual embeddings (mBERT)	✓	✗	✗ (fastText only)	✗
Handles deep paraphrase	✓	✗	✓	✗
Web-scale source crawl	✓	✗	✗	✓
Real-time API response	✓	✗	✗	✗
Public dataset release	✓	✗	✗	✗

8

1/5/2026

9 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.3 Technologies and Frameworks

• fastText for subword Sinhala embeddings.

• Multilingual BERT fine tuned for Sinhala.

• Siamese LSTM built in TensorFlow.

• Flask API for model serving.

• PostgreSQL store for embeddings and match logs.

IT21187414 | R.R.M.I.M.Ranaweera |

9

1/5/2026

10 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.4 Objectives for Sub-Component

Build a balanced Sinhala plagiarism corpus with two thousand labelled documents.

Train a Siamese LSTM that reaches F1 above 0.90 on sentence level detection.

Deploy the model as an API that returns scores in under one second per sentence.

IT21187414 | R.R.M.I.M.Ranaweera |

10

1/5/2026

11 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.5 Methodology (Process)

IT21187414 | R.R.M.I.M.Ranaweera |

11

1/5/2026

12 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.6 Work-Breakdown Structure

IT21187414 | R.R.M.I.M.Ranaweera |

Similarity Plagiarism Detection Component

Data Collection (W1-W3)

Crawl news sites

Gather student sample essays

Collect blogs and forums

Remove duplicates

Data Annotation (W4-W5)

Draft guidelines

Validate and Fine-Tune Model

Dual label sentences

Check inter-rater agreement

Resolve disagreements

Embedding Preparation (W6-W7)

Train fastText vectors

Fine-tune mBERT

Build tokeniser and normaliser

Model Development (W8-W10)

Design Siamese LSTM

Train on labelled pairs

Tune similarity threshold

Validate with cross fold

Deployment (W11-W12)

Dockerise model

Build Flask API

Run load and latency tests

Write user documentation

12

1/5/2026

13 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.7 Development Progression- I

IT21187414 | R.R.M.I.M.Ranaweera |

Collected 2 536 Sinhala sentence pairs from the Kaggle dataset and explored label balance and text length.

Engineered sentence level features such as sequence similarity, character overlap, word overlap, length ratio, and length difference.

Split data with stratified sampling and handled class imbalance with resampling to keep plagiarised and non plagiarised pairs balanced.

Trained several models on TF IDF features and hand crafted features and compared Logistic Regression, Random Forest, Gradient Boosting, XGBoost, and SVM.

Selected Random Forest as the best model with F1 score above 0.98 and strong recall for plagiarised sentences.

F1-Score Comparison on Tested Models

13

1/5/2026

14 of 27

REFERENCES

[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.

[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.

[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.

[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).

IT21187414 | R.R.M.I.M.Ranaweera |

ITXXXXXXXX | <<Student Name>> | <<Project ID>>

14

1/5/2026

15 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

IT21163968 | Abhayasiri S.A.U. D |

1. Can bilingual embeddings detect English text translated into Sinhala without back-translation?

2. What score marks a sentence pair as translated plagiarism while keeping false alarms low?

3. Does adding domain mixed data improve cross-language matching in news and academic prose?

6.1 Research Questions

15

1/5/2026

16 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.2 Research Gap

IT21163968 | Abhayasiri S.A.U. D |

Features	Proposed System	Kasthuri 2019 [1]	Nilaxan 2021 [2]	Rajamanthri 2021 [3]
Cross-language plagiarism handling	✓	✗	✗	✗
Bilingual sentence embeddings (LaBSE)	✓	✗	✗	✗
Machine translation fallback	✓	✗	✗	✗
Deep paraphrase after translation	✓	✗	✗	✗
Web crawl of English sources	✓	✗	✗	✗
API response under 1 s	✓	✗	✗	✗

16

1/5/2026

17 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.3 Technologies and Frameworks

LaBSE and LASER for bilingual embeddings
MarianMT for sentence level translation
Hugging Face Transformers, Sentence-Transformers
Python, TensorFlow, PyTorch
Flask REST API
PostgreSQL for match logs
Docker and Git for deployment and version control

IT21163968 | Abhayasiri S.A.U. D |

17

1/5/2026

18 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.4 Objectives for Sub-Component

Build a parallel Sinhala-English sentence set with ten thousand pairs

Train or fine-tune LaBSE to push F1 above 0.88 on translated plagiarism detection

Integrate translation and similarity services into one endpoint that answers in real time

IT21163968 | Abhayasiri S.A.U. D |

18

1/5/2026

19 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.5 Methodology (Process)

IT21163968 | Abhayasiri S.A.U. D |

19

1/5/2026

20 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.6 Work Breakdown Structure

IT21163968 | Abhayasiri S.A.U. D |

Semantic Plagiarism Detection Component

Corpus Buiding (W1-W3)

Mine parallel texts

Sentence alignment

Quality Check

Model Training(W4-W6)

Fine-tune LaBSE

Validate embeddings

Threshold tuning

Translation Module (W7-W8)

Set up MarianMT

Evaluate BLEU on sample pairs

Integration (W9-W10)

Build Similarity API

Merge with translation fallback

Testing & Deployment (W11- W12)

Load Test API

Write user docs

Push Docker Image

20

1/5/2026

21 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.7 Development Progression

IT21163968 | Abhayasiri S.A.U. D |

Downloaded the MIT sentence pair dataset from Kaggle and studied class balance, text length, and basic statistics.

Computed LaBSE embeddings for both sentences and derived extra similarity features such as cosine similarity, Euclidean distance, sequence similarity, word overlap, and length based ratios.

Analysed cosine similarity distribution for similar and different pairs to see the separation of classes.

Built a feature matrix that combines embedding based features with traditional similarity features.

Trained several classifiers on the balanced feature set and compared Logistic Regression, Random Forest, Gradient Boosting, and XGBoost.

Confusion Matrix on Higher Accurated Model - XGBoost

21

1/5/2026

22 of 27

REFERENCES

[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.

[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.

[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.

[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).

IT21163968 | Abhayasiri S.A.U. D |

ITXXXXXXXX | <<Student Name>> | <<Project ID>>

22

1/5/2026

23 of 27

7. COMMERCIALIZATION PLAN

Branding and Logo

23

1/5/2026

24 of 27

8. BUDGET

	Cost Item	Description	Amount (LKR)
1	Data collection	Web-crawling bandwidth, paid archives for Sinhala and English text	10 000
2	Annotation labour	Two part-time linguists for sentence-level labelling (50 hours)	15 000
3	Cloud compute (GPU)	Training and evaluation on rented GPU instances (120 GPU-hours)	12 000
4	Software & APIs	Translation API credits, repository hosting, SSL certificate	8 000
5	Domain & server hosting	Similarity.lk domain for 12 months and VPS for demo	5 000
6	Printing & presentation	Posters, handouts, binding of final report	3 000
7	Contingency (≈12 %)	Covers exchange-rate change and minor unforeseen costs	7 000
Total			60 000

24

1/5/2026

25 of 27

9. GANTT CHART

25

1/5/2026

26 of 27

10. EXTERNAL SUPPORT

Mr. Bimsara Kumarasinghe

Bachelor of Information Technology (UCSC), Master of Computer Science (UCSC) (Reading)

26

1/5/2026

27 of 27

THANK YOU !

27

1/5/2026