1 of 27

SINHALA SIMILARITY AND SEMANTIC PLAGIARISM DETECTION USING MODERN EMBEDDING MODELS

Group/Project ID: 25-26J-545

1

1/5/2026

2 of 27

GROUP MEMBERS

Ranaweera R.R.M.I.M.

Abhayasiri S.A.U. D

Supervisor

Dinith Primal

2

1/5/2026

3 of 27

1.INTRODUCTION

  • Plagiarism weakens academic credibility.
  • Most checkers focus on English and ignore Sinhala text.
  • Sri Lankan universities handle Sinhala scripts by manual reading.
  • Machine translation lets writers copy English sources and repackage them in Sinhala.
  • This project builds an automated checker for

• Sinhala-Sinhala copying

• English-to-Sinhala translated copying.

  • The system runs at sentence level, searches the web, and reports matched lines.

3

1/5/2026

4 of 27

2.RESEARCH PROBLEM

  • Word2vec cosine matching reached 97 % on only fifty news articles [1].
  • A rule based web scraper scored 88 % yet fails on deep paraphrase [3].
  • A Siamese LSTM raised sentence similarity but never linked to a full detector [2].
  • No current tool flags cross-language plagiarism between English and Sinhala.
  • Small private datasets limit model trust and public benchmarking.
  • Goal: deliver accurate detection on a large open corpus and handle translated copying.

Figure 1: Siamese LSTM Architecture [4]

4

1/5/2026

5 of 27

3.OBJECTIVES�

Create an automated plagiarism checker for Sinhala and English

Sinhala similarity detection

Cross-language semantic detection

Main Objective

Sub-Objectives

5

1/5/2026

6 of 27

4.SYSTEM ARCHITECTURE�

6

1/5/2026

7 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

IT21187414 | R.R.M.I.M.Ranaweera |

1. Can contextual embeddings capture similarity between Sinhala sentences that use paraphrase and synonym swaps?

2. What similarity threshold balances precision and recall for Sinhala plagiarism detection?

3.Does a diverse training corpus improve generalisation across academic, news, and social text?

5.1 Research Questions

7

1/5/2026

8 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.2 Research Gap

IT21187414 | R.R.M.I.M.Ranaweera |

Features

Proposed System

Kasthuri 2019 [1]

Nilaxan 2021 [2]

Rajamanthri 2021 [3]

Large, balanced Sinhala corpus

Contextual embeddings (mBERT)

✗ (fastText only)

Handles deep paraphrase

Web-scale source crawl

Real-time API response

Public dataset release

8

1/5/2026

9 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.3 Technologies and Frameworks

• fastText for subword Sinhala embeddings.

• Multilingual BERT fine tuned for Sinhala.

• Siamese LSTM built in TensorFlow.

• Flask API for model serving.

• PostgreSQL store for embeddings and match logs.

IT21187414 | R.R.M.I.M.Ranaweera |

9

1/5/2026

10 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.4 Objectives for Sub-Component

Build a balanced Sinhala plagiarism corpus with two thousand labelled documents.

Train a Siamese LSTM that reaches F1 above 0.90 on sentence level detection.

Deploy the model as an API that returns scores in under one second per sentence.

IT21187414 | R.R.M.I.M.Ranaweera |

10

1/5/2026

11 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.5 Methodology (Process)

IT21187414 | R.R.M.I.M.Ranaweera |

11

1/5/2026

12 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.6 Work-Breakdown Structure

IT21187414 | R.R.M.I.M.Ranaweera |

Similarity Plagiarism Detection Component

Data Collection (W1-W3)

Crawl news sites

Gather student sample essays

Collect blogs and forums

Remove duplicates

Data Annotation (W4-W5)

Draft guidelines

Validate and Fine-Tune Model

Dual label sentences

Check inter-rater agreement

Resolve disagreements

Embedding Preparation (W6-W7)

Train fastText vectors

Fine-tune mBERT

Build tokeniser and normaliser

Model Development (W8-W10)

Design Siamese LSTM

Train on labelled pairs

Tune similarity threshold

Validate with cross fold

Deployment (W11-W12)

Dockerise model

Build Flask API

Run load and latency tests

Write user documentation

12

1/5/2026

13 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.7 Development Progression- I

IT21187414 | R.R.M.I.M.Ranaweera |

  • Collected 2 536 Sinhala sentence pairs from the Kaggle dataset and explored label balance and text length.

  • Engineered sentence level features such as sequence similarity, character overlap, word overlap, length ratio, and length difference.

  • Split data with stratified sampling and handled class imbalance with resampling to keep plagiarised and non plagiarised pairs balanced.

  • Trained several models on TF IDF features and hand crafted features and compared Logistic Regression, Random Forest, Gradient Boosting, XGBoost, and SVM.

  • Selected Random Forest as the best model with F1 score above 0.98 and strong recall for plagiarised sentences.

F1-Score Comparison on Tested Models

13

1/5/2026

14 of 27

REFERENCES

[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.

[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.

[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.

[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).

IT21187414 | R.R.M.I.M.Ranaweera |

ITXXXXXXXX | <<Student Name>> | <<Project ID>>

14

1/5/2026

15 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

IT21163968 | Abhayasiri S.A.U. D |

1. Can bilingual embeddings detect English text translated into Sinhala without back-translation?

2. What score marks a sentence pair as translated plagiarism while keeping false alarms low?

3. Does adding domain mixed data improve cross-language matching in news and academic prose?

6.1 Research Questions

15

1/5/2026

16 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.2 Research Gap

IT21163968 | Abhayasiri S.A.U. D |

Features

Proposed System

Kasthuri 2019 [1]

Nilaxan 2021 [2]

Rajamanthri 2021 [3]

Cross-language plagiarism handling

Bilingual sentence embeddings (LaBSE)

Machine translation fallback

Deep paraphrase after translation

Web crawl of English sources

API response under 1 s

16

1/5/2026

17 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.3 Technologies and Frameworks

  • LaBSE and LASER for bilingual embeddings
  • MarianMT for sentence level translation
  • Hugging Face Transformers, Sentence-Transformers
  • Python, TensorFlow, PyTorch
  • Flask REST API
  • PostgreSQL for match logs
  • Docker and Git for deployment and version control

IT21163968 | Abhayasiri S.A.U. D |

17

1/5/2026

18 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.4 Objectives for Sub-Component

Build a parallel Sinhala-English sentence set with ten thousand pairs

Train or fine-tune LaBSE to push F1 above 0.88 on translated plagiarism detection

Integrate translation and similarity services into one endpoint that answers in real time

IT21163968 | Abhayasiri S.A.U. D |

18

1/5/2026

19 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.5 Methodology (Process)

IT21163968 | Abhayasiri S.A.U. D |

19

1/5/2026

20 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.6 Work Breakdown Structure

IT21163968 | Abhayasiri S.A.U. D |

Semantic Plagiarism Detection Component

Corpus Buiding (W1-W3)

Mine parallel texts

Sentence alignment

Quality Check

Model Training(W4-W6)

Fine-tune LaBSE

Validate embeddings

Threshold tuning

Translation Module (W7-W8)

Set up MarianMT

Evaluate BLEU on sample pairs

Integration (W9-W10)

Build Similarity API

Merge with translation fallback

Testing & Deployment (W11- W12)

Load Test API

Write user docs

Push Docker Image

20

1/5/2026

21 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.7 Development Progression

IT21163968 | Abhayasiri S.A.U. D |

  • Downloaded the MIT sentence pair dataset from Kaggle and studied class balance, text length, and basic statistics.

  • Computed LaBSE embeddings for both sentences and derived extra similarity features such as cosine similarity, Euclidean distance, sequence similarity, word overlap, and length based ratios.

  • Analysed cosine similarity distribution for similar and different pairs to see the separation of classes.

  • Built a feature matrix that combines embedding based features with traditional similarity features.

  • Trained several classifiers on the balanced feature set and compared Logistic Regression, Random Forest, Gradient Boosting, and XGBoost.

Confusion Matrix on Higher Accurated Model - XGBoost

21

1/5/2026

22 of 27

REFERENCES

[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.

[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.

[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.

[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).

IT21163968 | Abhayasiri S.A.U. D |

ITXXXXXXXX | <<Student Name>> | <<Project ID>>

22

1/5/2026

23 of 27

7. COMMERCIALIZATION PLAN

Branding and Logo

23

1/5/2026

24 of 27

8. BUDGET

Cost Item

Description

Amount (LKR)

1

Data collection

Web-crawling bandwidth, paid archives for Sinhala and English text

10 000

2

Annotation labour

Two part-time linguists for sentence-level labelling (50 hours)

15 000

3

Cloud compute (GPU)

Training and evaluation on rented GPU instances (120 GPU-hours)

12 000

4

Software & APIs

Translation API credits, repository hosting, SSL certificate

8 000

5

Domain & server hosting

Similarity.lk domain for 12 months and VPS for demo

5 000

6

Printing & presentation

Posters, handouts, binding of final report

3 000

7

Contingency (≈12 %)

Covers exchange-rate change and minor unforeseen costs

7 000

Total

60 000

24

1/5/2026

25 of 27

9. GANTT CHART

25

1/5/2026

26 of 27

10. EXTERNAL SUPPORT

Mr. Bimsara Kumarasinghe

Bachelor of Information Technology (UCSC), Master of Computer Science (UCSC) (Reading)

26

1/5/2026

27 of 27

THANK YOU !

27

1/5/2026