1 of 27

SINHALA SIMILARITY AND SEMANTIC PLAGIARISM DETECTION USING MODERN EMBEDDING MODELS

Group/Project ID: 25-26J-545

1

3/9/2026

2 of 27

GROUP MEMBERS

Ranaweera R.R.M.I.M.

Abhayasiri S.A.U. D

Supervisor

Dinith Primal

2

3/9/2026

3 of 27

1.INTRODUCTION

  • Plagiarism weakens academic credibility.
  • Most checkers focus on English and ignore Sinhala text.
  • Sri Lankan universities handle Sinhala scripts by manual reading.
  • Machine translation lets writers copy English sources and repackage them in Sinhala.
  • This project builds an automated checker for

• Sinhala-Sinhala copying

• English-to-Sinhala translated copying.

  • The system runs at sentence level, searches the web, and reports matched lines.

3

3/9/2026

4 of 27

2.RESEARCH PROBLEM

  • Word2vec cosine matching reached 97 % on only fifty news articles [1].
  • A rule based web scraper scored 88 % yet fails on deep paraphrase [3].
  • A Siamese LSTM raised sentence similarity but never linked to a full detector [2].
  • No current tool flags cross-language plagiarism between English and Sinhala.
  • Small private datasets limit model trust and public benchmarking.
  • Goal: deliver accurate detection on a large open corpus and handle translated copying.

Figure 1: Siamese LSTM Architecture [4]

4

3/9/2026

5 of 27

3.OBJECTIVES�

Create an automated plagiarism checker for Sinhala and English

Sinhala similarity detection

Cross-language semantic detection

Main Objective

Sub-Objectives

5

3/9/2026

6 of 27

4.SYSTEM ARCHITECTURE�

6

3/9/2026

7 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

IT21187414 | R.R.M.I.M.Ranaweera |

1. Can contextual embeddings capture similarity between Sinhala sentences that use paraphrase and synonym swaps?

2. What similarity threshold balances precision and recall for Sinhala plagiarism detection?

3.Does a diverse training corpus improve generalisation across academic, news, and social text?

5.1 Research Questions

7

3/9/2026

8 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.2 Research Gap

IT21187414 | R.R.M.I.M.Ranaweera |

Features

Proposed System

Kasthuri 2019 [1]

Nilaxan 2021 [2]

Rajamanthri 2021 [3]

Large, balanced Sinhala corpus

Contextual embeddings (mBERT)

✗ (fastText only)

Handles deep paraphrase

Web-scale source crawl

Real-time API response

Public dataset release

8

3/9/2026

9 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.3 Technologies and Frameworks

• fastText for subword Sinhala embeddings.

• Multilingual BERT fine tuned for Sinhala.

• Siamese LSTM built in TensorFlow.

• Flask API for model serving.

• PostgreSQL store for embeddings and match logs.

IT21187414 | R.R.M.I.M.Ranaweera |

9

3/9/2026

10 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.4 Objectives for Sub-Component

Build a balanced Sinhala plagiarism corpus with two thousand labelled documents.

Train a Siamese LSTM that reaches F1 above 0.90 on sentence level detection.

Deploy the model as an API that returns scores in under one second per sentence.

IT21187414 | R.R.M.I.M.Ranaweera |

10

3/9/2026

11 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.5 Methodology (Process)

IT21187414 | R.R.M.I.M.Ranaweera |

11

3/9/2026

12 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.6 Work-Breakdown Structure

IT21187414 | R.R.M.I.M.Ranaweera |

Similarity Plagiarism Detection Component

Data Collection (W1-W3)

Crawl news sites

Gather student sample essays

Collect blogs and forums

Remove duplicates

Data Annotation (W4-W5)

Draft guidelines

Validate and Fine-Tune Model

Dual label sentences

Check inter-rater agreement

Resolve disagreements

Embedding Preparation (W6-W7)

Train fastText vectors

Fine-tune mBERT

Build tokeniser and normaliser

Model Development (W8-W10)

Design Siamese LSTM

Train on labelled pairs

Tune similarity threshold

Validate with cross fold

Deployment (W11-W12)

Dockerise model

Build Flask API

Run load and latency tests

Write user documentation

12

3/9/2026

13 of 27

5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�

5.7 Development Progression- II

IT21187414 | R.R.M.I.M.Ranaweera |

  • Developed the full Flask based web application for Sinhala similarity plagiarism detection with secure login, document upload, dashboard view, and report download.
  • Integrated the trained Random Forest model and saved TF IDF vectorizer into the backend so the system can run live plagiarism prediction on uploaded documents.
  • Implemented text extraction for PDF and Word files, including OCR fallback for legacy Sinhala PDF documents using PyMuPDF and Tesseract.
  • Built the sentence processing pipeline to split extracted text, compare sentence pairs, calculate similarity features, assign confidence scores, and identify suspicious content.
  • Generated automated PDF reports with overall plagiarism percentage, sentence level highlights, source attribution, and document summary statistics.
  • Added document history, file deletion, upload status handling, and dashboard analytics to complete an end to end working prototype.

Web Application Interface and Report Generation Process

13

3/9/2026

14 of 27

REFERENCES

[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.

[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.

[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.

[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).

IT21187414 | R.R.M.I.M.Ranaweera |

ITXXXXXXXX | <<Student Name>> | <<Project ID>>

14

3/9/2026

15 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

IT21163968 | Abhayasiri S.A.U. D |

1. Can bilingual embeddings detect English text translated into Sinhala without back-translation?

2. What score marks a sentence pair as translated plagiarism while keeping false alarms low?

3. Does adding domain mixed data improve cross-language matching in news and academic prose?

6.1 Research Questions

15

3/9/2026

16 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.2 Research Gap

IT21163968 | Abhayasiri S.A.U. D |

Features

Proposed System

Kasthuri 2019 [1]

Nilaxan 2021 [2]

Rajamanthri 2021 [3]

Cross-language plagiarism handling

Bilingual sentence embeddings (LaBSE)

Machine translation fallback

Deep paraphrase after translation

Web crawl of English sources

API response under 1 s

16

3/9/2026

17 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.3 Technologies and Frameworks

  • LaBSE and LASER for bilingual embeddings
  • MarianMT for sentence level translation
  • Hugging Face Transformers, Sentence-Transformers
  • Python, TensorFlow, PyTorch
  • Flask REST API
  • PostgreSQL for match logs
  • Docker and Git for deployment and version control

IT21163968 | Abhayasiri S.A.U. D |

17

3/9/2026

18 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.4 Objectives for Sub-Component

Build a parallel Sinhala-English sentence set with ten thousand pairs

Train or fine-tune LaBSE to push F1 above 0.88 on translated plagiarism detection

Integrate translation and similarity services into one endpoint that answers in real time

IT21163968 | Abhayasiri S.A.U. D |

18

3/9/2026

19 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.5 Methodology (Process)

IT21163968 | Abhayasiri S.A.U. D |

19

3/9/2026

20 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.6 Work Breakdown Structure

IT21163968 | Abhayasiri S.A.U. D |

Semantic Plagiarism Detection Component

Corpus Buiding (W1-W3)

Mine parallel texts

Sentence alignment

Quality Check

Model Training(W4-W6)

Fine-tune LaBSE

Validate embeddings

Threshold tuning

Translation Module (W7-W8)

Set up MarianMT

Evaluate BLEU on sample pairs

Integration (W9-W10)

Build Similarity API

Merge with translation fallback

Testing & Deployment (W11- W12)

Load Test API

Write user docs

Push Docker Image

20

3/9/2026

21 of 27

6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�

6.7 Development Progression II

IT21163968 | Abhayasiri S.A.U. D |

  • Expanded the Similarity.lk application as a unified platform that can support both similarity based and semantic plagiarism detection within one interface.
  • Designed a reusable preprocessing pipeline for English and Sinhala document handling so the same system can support cross language semantic comparison.
  • Prepared the backend structure in a modular way using separate folders for models, utilities, templates, reports, uploads, and static resources for easier model integration.
  • Built a common result management workflow to display analysis status, plagiarism confidence, downloadable reports, and past document records through the dashboard.
  • Standardised the reporting format so both modules can present matched sentences, confidence values, and source based evidence in a consistent output.
  • Completed the application shell and deployment ready project structure that can support the final integration of the English to Sinhala semantic plagiarism detector.

Standardised Reporting Format

21

3/9/2026

22 of 27

REFERENCES

[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.

[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.

[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.

[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).

IT21163968 | Abhayasiri S.A.U. D |

ITXXXXXXXX | <<Student Name>> | <<Project ID>>

22

3/9/2026

23 of 27

7. COMMERCIALIZATION PLAN

Branding and Logo

23

3/9/2026

24 of 27

8. BUDGET

Cost Item

Description

Amount (LKR)

1

Data collection

Web-crawling bandwidth, paid archives for Sinhala and English text

10 000

2

Annotation labour

Two part-time linguists for sentence-level labelling (50 hours)

15 000

3

Cloud compute (GPU)

Training and evaluation on rented GPU instances (120 GPU-hours)

12 000

4

Software & APIs

Translation API credits, repository hosting, SSL certificate

8 000

5

Domain & server hosting

Similarity.lk domain for 12 months and VPS for demo

5 000

6

Printing & presentation

Posters, handouts, binding of final report

3 000

7

Contingency (≈12 %)

Covers exchange-rate change and minor unforeseen costs

7 000

Total

60 000

24

3/9/2026

25 of 27

9. GANTT CHART

25

3/9/2026

26 of 27

10. EXTERNAL SUPPORT

Mr. Bimsara Kumarasinghe

Bachelor of Information Technology (UCSC), Master of Computer Science (UCSC) (Reading)

26

3/9/2026

27 of 27

THANK YOU !

27

3/9/2026