SINHALA SIMILARITY AND SEMANTIC PLAGIARISM DETECTION USING MODERN EMBEDDING MODELS
Group/Project ID: 25-26J-545
1
3/9/2026
GROUP MEMBERS
Ranaweera R.R.M.I.M.
Abhayasiri S.A.U. D
Supervisor
Dinith Primal
2
3/9/2026
1.INTRODUCTION
• Sinhala-Sinhala copying
• English-to-Sinhala translated copying.
3
3/9/2026
2.RESEARCH PROBLEM
Figure 1: Siamese LSTM Architecture [4]
4
3/9/2026
3.OBJECTIVES�
Create an automated plagiarism checker for Sinhala and English
Sinhala similarity detection
Cross-language semantic detection
Main Objective
Sub-Objectives
5
3/9/2026
4.SYSTEM ARCHITECTURE�
6
3/9/2026
5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�
IT21187414 | R.R.M.I.M.Ranaweera |
1. Can contextual embeddings capture similarity between Sinhala sentences that use paraphrase and synonym swaps?
2. What similarity threshold balances precision and recall for Sinhala plagiarism detection?
3.Does a diverse training corpus improve generalisation across academic, news, and social text?
5.1 Research Questions
7
3/9/2026
5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�
5.2 Research Gap
IT21187414 | R.R.M.I.M.Ranaweera |
Features | Proposed System | Kasthuri 2019 [1] | Nilaxan 2021 [2] | Rajamanthri 2021 [3] |
Large, balanced Sinhala corpus | ✓ | ✗ | ✗ | ✗ |
Contextual embeddings (mBERT) | ✓ | ✗ | ✗ (fastText only) | ✗ |
Handles deep paraphrase | ✓ | ✗ | ✓ | ✗ |
Web-scale source crawl | ✓ | ✗ | ✗ | ✓ |
Real-time API response | ✓ | ✗ | ✗ | ✗ |
Public dataset release | ✓ | ✗ | ✗ | ✗ |
8
3/9/2026
5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�
5.3 Technologies and Frameworks
• fastText for subword Sinhala embeddings.
• Multilingual BERT fine tuned for Sinhala.
• Siamese LSTM built in TensorFlow.
• Flask API for model serving.
• PostgreSQL store for embeddings and match logs.
IT21187414 | R.R.M.I.M.Ranaweera |
9
3/9/2026
5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�
5.4 Objectives for Sub-Component
Build a balanced Sinhala plagiarism corpus with two thousand labelled documents.
Train a Siamese LSTM that reaches F1 above 0.90 on sentence level detection.
Deploy the model as an API that returns scores in under one second per sentence.
IT21187414 | R.R.M.I.M.Ranaweera |
10
3/9/2026
5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�
5.5 Methodology (Process)
IT21187414 | R.R.M.I.M.Ranaweera |
11
3/9/2026
5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�
5.6 Work-Breakdown Structure
IT21187414 | R.R.M.I.M.Ranaweera |
Similarity Plagiarism Detection Component
Data Collection (W1-W3)
Crawl news sites
Gather student sample essays
Collect blogs and forums
Remove duplicates
Data Annotation (W4-W5)
Draft guidelines
Validate and Fine-Tune Model
Dual label sentences
Check inter-rater agreement
Resolve disagreements
Embedding Preparation (W6-W7)
Train fastText vectors
Fine-tune mBERT
Build tokeniser and normaliser
Model Development (W8-W10)
Design Siamese LSTM
Train on labelled pairs
Tune similarity threshold
Validate with cross fold
Deployment (W11-W12)
Dockerise model
Build Flask API
Run load and latency tests
Write user documentation
12
3/9/2026
5. DETECTION OF SINHALA SIMILARITY PLAGIARISM�
5.7 Development Progression- II
IT21187414 | R.R.M.I.M.Ranaweera |
Web Application Interface and Report Generation Process
13
3/9/2026
REFERENCES
[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.
[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.
[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.
[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).
IT21187414 | R.R.M.I.M.Ranaweera |
ITXXXXXXXX | <<Student Name>> | <<Project ID>>
14
3/9/2026
6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�
IT21163968 | Abhayasiri S.A.U. D |
1. Can bilingual embeddings detect English text translated into Sinhala without back-translation?
2. What score marks a sentence pair as translated plagiarism while keeping false alarms low?
3. Does adding domain mixed data improve cross-language matching in news and academic prose?
6.1 Research Questions
15
3/9/2026
6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�
6.2 Research Gap
IT21163968 | Abhayasiri S.A.U. D |
Features | Proposed System | Kasthuri 2019 [1] | Nilaxan 2021 [2] | Rajamanthri 2021 [3] |
Cross-language plagiarism handling | ✓ | ✗ | ✗ | ✗ |
Bilingual sentence embeddings (LaBSE) | ✓ | ✗ | ✗ | ✗ |
Machine translation fallback | ✓ | ✗ | ✗ | ✗ |
Deep paraphrase after translation | ✓ | ✗ | ✗ | ✗ |
Web crawl of English sources | ✓ | ✗ | ✗ | ✗ |
API response under 1 s | ✓ | ✗ | ✗ | ✗ |
16
3/9/2026
6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�
6.3 Technologies and Frameworks
IT21163968 | Abhayasiri S.A.U. D |
17
3/9/2026
6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�
6.4 Objectives for Sub-Component
Build a parallel Sinhala-English sentence set with ten thousand pairs
Train or fine-tune LaBSE to push F1 above 0.88 on translated plagiarism detection
Integrate translation and similarity services into one endpoint that answers in real time
IT21163968 | Abhayasiri S.A.U. D |
18
3/9/2026
6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�
6.5 Methodology (Process)
IT21163968 | Abhayasiri S.A.U. D |
19
3/9/2026
6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�
6.6 Work Breakdown Structure
IT21163968 | Abhayasiri S.A.U. D |
Semantic Plagiarism Detection Component
Corpus Buiding (W1-W3)
Mine parallel texts
Sentence alignment
Quality Check
Model Training(W4-W6)
Fine-tune LaBSE
Validate embeddings
Threshold tuning
Translation Module (W7-W8)
Set up MarianMT
Evaluate BLEU on sample pairs
Integration (W9-W10)
Build Similarity API
Merge with translation fallback
Testing & Deployment (W11- W12)
Load Test API
Write user docs
Push Docker Image
20
3/9/2026
6. DETECTION OF ENGLISH TO SINHALA SEMANTIC PLAGIARISM�
6.7 Development Progression II
IT21163968 | Abhayasiri S.A.U. D |
Standardised Reporting Format
21
3/9/2026
REFERENCES
[1]Tharuka KasthuriArachchi and E. Y. A. Charles, “Deep Learning Approach to Detect Plagiarism in Sinhala Text,” Dec. 2019, doi: https://doi.org/10.1109/iciis47346.2019.9063299.
[2]S. Nilaxan and S. Ranathunga, “Monolingual Sentence Similarity Measurement Using Siamese Neural Networks for Sinhala and Tamil Languages,” 2021 Moratuwa Engineering Research Conference (MERCon), pp. 567–572, Jul. 2021, doi: https://doi.org/10.1109/mercon52712.2021.9525786.
[3]Lochana Rajamanthri and S. Thelijjagoda, “Plagiarism Detection Tool for Sinhala Language with Internet Resources Using Natural Language Processing,” pp. 156–160, Aug. 2021, doi: https://doi.org/10.1109/iciafs52090.2021.9605852.
[4]D. Kharazi, “Data Science,” Github.io, 2025. https://dkharazi.github.io/notes/ml/nlp/siamese (accessed Sep. 17, 2025).
IT21163968 | Abhayasiri S.A.U. D |
ITXXXXXXXX | <<Student Name>> | <<Project ID>>
22
3/9/2026
7. COMMERCIALIZATION PLAN
Branding and Logo
23
3/9/2026
8. BUDGET
| Cost Item | Description | Amount (LKR) |
1 | Data collection | Web-crawling bandwidth, paid archives for Sinhala and English text | 10 000 |
2 | Annotation labour | Two part-time linguists for sentence-level labelling (50 hours) | 15 000 |
3 | Cloud compute (GPU) | Training and evaluation on rented GPU instances (120 GPU-hours) | 12 000 |
4 | Software & APIs | Translation API credits, repository hosting, SSL certificate | 8 000 |
5 | Domain & server hosting | Similarity.lk domain for 12 months and VPS for demo | 5 000 |
6 | Printing & presentation | Posters, handouts, binding of final report | 3 000 |
7 | Contingency (≈12 %) | Covers exchange-rate change and minor unforeseen costs | 7 000 |
Total | | | 60 000 |
24
3/9/2026
9. GANTT CHART
25
3/9/2026
10. EXTERNAL SUPPORT
Mr. Bimsara Kumarasinghe
Bachelor of Information Technology (UCSC), Master of Computer Science (UCSC) (Reading)
26
3/9/2026
THANK YOU !
27
3/9/2026