1 of 27

SINHALA SIMILARITY AND SEMANTIC PLAGIARISM DETECTION USING MODERN EMBEDDING MODELS

Group/Project ID: 25-26J-545

1

3/9/2026

2 of 27

GROUP MEMBERS

Ranaweera R.R.M.I.M.

Abhayasiri S.A.U. D

Supervisor

Dinith Primal

2

3/9/2026

3 of 27

1.INTRODUCTION

Plagiarism weakens academic credibility.
Most checkers focus on English and ignore Sinhala text.
Sri Lankan universities handle Sinhala scripts by manual reading.
Machine translation lets writers copy English sources and repackage them in Sinhala.
This project builds an automated checker for

• Sinhala-Sinhala copying

• English-to-Sinhala translated copying.

The system runs at sentence level, searches the web, and reports matched lines.

3

3/9/2026

4 of 27

2.RESEARCH PROBLEM

Word2vec cosine matching reached 97 % on only fifty news articles [1].
A rule based web scraper scored 88 % yet fails on deep paraphrase [3].
A Siamese LSTM raised sentence similarity but never linked to a full detector [2].
No current tool flags cross-language plagiarism between English and Sinhala.
Small private datasets limit model trust and public benchmarking.
Goal: deliver accurate detection on a large open corpus and handle translated copying.

Figure 1: Siamese LSTM Architecture [4]

4

3/9/2026

5 of 27

3.OBJECTIVES�

Create an automated plagiarism checker for Sinhala and English

Sinhala similarity detection

Cross-language semantic detection

Main Objective

Sub-Objectives

5

3/9/2026

6 of 27

4.SYSTEM ARCHITECTURE�

6

3/9/2026

7 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

IT21187414 | R.R.M.I.M.Ranaweera |

1. Can contextual embeddings capture similarity between Sinhala sentences that use paraphrase and synonym swaps?

2. What similarity threshold balances precision and recall for Sinhala plagiarism detection?

3.Does a diverse training corpus improve generalisation across academic, news, and social text?

5.1 Research Questions

7

3/9/2026

8 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.2 Research Gap

IT21187414 | R.R.M.I.M.Ranaweera |

Features	Proposed System	Kasthuri 2019 [1]	Nilaxan 2021 [2]	Rajamanthri 2021 [3]
Large, balanced Sinhala corpus	✓	✗	✗	✗
Contextual embeddings (mBERT)	✓	✗	✗ (fastText only)	✗
Handles deep paraphrase	✓	✗	✓	✗
Web-scale source crawl	✓	✗	✗	✓
Real-time API response	✓	✗	✗	✗
Public dataset release	✓	✗	✗	✗

8

3/9/2026

9 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.3 Technologies and Frameworks

• fastText for subword Sinhala embeddings.

• Multilingual BERT fine tuned for Sinhala.

• Siamese LSTM built in TensorFlow.

• Flask API for model serving.

• PostgreSQL store for embeddings and match logs.

IT21187414 | R.R.M.I.M.Ranaweera |

9

3/9/2026

10 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.4 Objectives for Sub-Component

Build a balanced Sinhala plagiarism corpus with two thousand labelled documents.

Train a Siamese LSTM that reaches F1 above 0.90 on sentence level detection.

Deploy the model as an API that returns scores in under one second per sentence.

IT21187414 | R.R.M.I.M.Ranaweera |

10

3/9/2026

11 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.5 Methodology (Process)

IT21187414 | R.R.M.I.M.Ranaweera |

11

3/9/2026

12 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.6 Work-Breakdown Structure

IT21187414 | R.R.M.I.M.Ranaweera |

Similarity Plagiarism Detection Component

Data Collection (W1-W3)

Crawl news sites

Gather student sample essays

Collect blogs and forums

Remove duplicates

Data Annotation (W4-W5)

Draft guidelines

Validate and Fine-Tune Model

Dual label sentences

Check inter-rater agreement

Resolve disagreements

Embedding Preparation (W6-W7)

Train fastText vectors

Fine-tune mBERT

Build tokeniser and normaliser

Model Development (W8-W10)

Design Siamese LSTM

Train on labelled pairs

Tune similarity threshold

Validate with cross fold

Deployment (W11-W12)

Dockerise model

Build Flask API

Run load and latency tests

Write user documentation

12

3/9/2026

13 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.7 Development Progression- II

IT21187414 | R.R.M.I.M.Ranaweera |

Developed the full Flask based web application for Sinhala similarity plagiarism detection with secure login, document upload, dashboard view, and report download.
Integrated the trained Random Forest model and saved TF IDF vectorizer into the backend so the system can run live plagiarism prediction on uploaded documents.
Implemented text extraction for PDF and Word files, including OCR fallback for legacy Sinhala PDF documents using PyMuPDF and Tesseract.
Built the sentence processing pipeline to split extracted text, compare sentence pairs, calculate similarity features, assign confidence scores, and identify suspicious content.
Generated automated PDF reports with overall plagiarism percentage, sentence level highlights, source attribution, and document summary statistics.
Added document history, file deletion, upload status handling, and dashboard analytics to complete an end to end working prototype.

Web Application Interface and Report Generation Process

13

3/9/2026

14 of 27

REFERENCES

[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.

[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.

[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.

[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).

IT21187414 | R.R.M.I.M.Ranaweera |

ITXXXXXXXX | <<Student Name>> | <<Project ID>>

14

3/9/2026

15 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

IT21163968 | Abhayasiri S.A.U. D |

1. Can bilingual embeddings detect English text translated into Sinhala without back-translation?

2. What score marks a sentence pair as translated plagiarism while keeping false alarms low?

3. Does adding domain mixed data improve cross-language matching in news and academic prose?

6.1 Research Questions

15

3/9/2026

16 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.2 Research Gap

IT21163968 | Abhayasiri S.A.U. D |

Features	Proposed System	Kasthuri 2019 [1]	Nilaxan 2021 [2]	Rajamanthri 2021 [3]
Cross-language plagiarism handling	✓	✗	✗	✗
Bilingual sentence embeddings (LaBSE)	✓	✗	✗	✗
Machine translation fallback	✓	✗	✗	✗
Deep paraphrase after translation	✓	✗	✗	✗
Web crawl of English sources	✓	✗	✗	✗
API response under 1 s	✓	✗	✗	✗

16

3/9/2026

17 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.3 Technologies and Frameworks

LaBSE and LASER for bilingual embeddings
MarianMT for sentence level translation
Hugging Face Transformers, Sentence-Transformers
Python, TensorFlow, PyTorch
Flask REST API
PostgreSQL for match logs
Docker and Git for deployment and version control

IT21163968 | Abhayasiri S.A.U. D |

17

3/9/2026

18 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.4 Objectives for Sub-Component

Build a parallel Sinhala-English sentence set with ten thousand pairs

Train or fine-tune LaBSE to push F1 above 0.88 on translated plagiarism detection

Integrate translation and similarity services into one endpoint that answers in real time

IT21163968 | Abhayasiri S.A.U. D |

18

3/9/2026

19 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.5 Methodology (Process)

IT21163968 | Abhayasiri S.A.U. D |

19

3/9/2026

20 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.6 Work Breakdown Structure

IT21163968 | Abhayasiri S.A.U. D |

Semantic Plagiarism Detection Component

Corpus Buiding (W1-W3)

Mine parallel texts

Sentence alignment

Quality Check

Model Training(W4-W6)

Fine-tune LaBSE

Validate embeddings

Threshold tuning

Translation Module (W7-W8)

Set up MarianMT

Evaluate BLEU on sample pairs

Integration (W9-W10)

Build Similarity API

Merge with translation fallback

Testing & Deployment (W11- W12)

Load Test API

Write user docs

Push Docker Image

20

3/9/2026

21 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.7 Development Progression II

IT21163968 | Abhayasiri S.A.U. D |

Expanded the Similarity.lk application as a unified platform that can support both similarity based and semantic plagiarism detection within one interface.
Designed a reusable preprocessing pipeline for English and Sinhala document handling so the same system can support cross language semantic comparison.
Prepared the backend structure in a modular way using separate folders for models, utilities, templates, reports, uploads, and static resources for easier model integration.
Built a common result management workflow to display analysis status, plagiarism confidence, downloadable reports, and past document records through the dashboard.
Standardised the reporting format so both modules can present matched sentences, confidence values, and source based evidence in a consistent output.
Completed the application shell and deployment ready project structure that can support the final integration of the English to Sinhala semantic plagiarism detector.

Standardised Reporting Format

21

3/9/2026

22 of 27

REFERENCES

[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.

[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.

[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.

[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).

IT21163968 | Abhayasiri S.A.U. D |

ITXXXXXXXX | <<Student Name>> | <<Project ID>>

22

3/9/2026

23 of 27

7. COMMERCIALIZATION PLAN

Branding and Logo

23

3/9/2026

24 of 27

8. BUDGET

	Cost Item	Description	Amount (LKR)
1	Data collection	Web-crawling bandwidth, paid archives for Sinhala and English text	10 000
2	Annotation labour	Two part-time linguists for sentence-level labelling (50 hours)	15 000
3	Cloud compute (GPU)	Training and evaluation on rented GPU instances (120 GPU-hours)	12 000
4	Software & APIs	Translation API credits, repository hosting, SSL certificate	8 000
5	Domain & server hosting	Similarity.lk domain for 12 months and VPS for demo	5 000
6	Printing & presentation	Posters, handouts, binding of final report	3 000
7	Contingency (≈12 %)	Covers exchange-rate change and minor unforeseen costs	7 000
Total			60 000

24

3/9/2026

25 of 27

9. GANTT CHART

25

3/9/2026

26 of 27

10. EXTERNAL SUPPORT

Mr. Bimsara Kumarasinghe

Bachelor of Information Technology (UCSC), Master of Computer Science (UCSC) (Reading)

26

3/9/2026

27 of 27

THANK YOU !

27

3/9/2026