1 of 28

X International conference�“Information Technology and Implementation” (IT&I-2023)�Kyiv, Ukraine

Comparison of Transformer-based Deep Learning Methods for the Paraphrase Identification task

Vitalii Vrublevskyi

Oleksandr Marchenko

Taras Shevchenko National University of Kyiv

Dedicated to the tenth anniversary of the Faculty of Information Technology

2 of 28

Global task

Understand and explore semantic meanings of sentences.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

3 of 28

Formulation of the problem

Paraphrase Identification: The system gets two sentences and needs to decide whether they have the same meaning (they are paraphrases) or not.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

4 of 28

Formulation of the problem

Semantic Similarity. In this case, the system needs to estimate the similarity degree between two sentences.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

5 of 28

Paraphrase Example

She decided to postpone the meeting until Friday.

The meeting was rescheduled to Friday by her.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

6 of 28

Overview of different methods to make paraphrase identification

Traditional Rule-Based Methods
Machine Learning-Based Methods
Neural Networks (RNN, LSTM, Transformers)

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

7 of 28

Traditional Rule-Based Methods

Traditional paraphrase detection approaches rely on handcrafted rules, linguistic patterns, and syntactic analysis.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

8 of 28

Machine Learning-Based Methods

Models: SVM, Random Forests, and logistic regression.

These methods involve feature engineering, where various linguistic and statistical features are extracted from text pairs to train classifiers

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

9 of 28

Transformer-Based Models

Transformer-based models, such as BERT, RoBERTa, DeBERTa.

These models are pre-trained on large text corpora and can capture contextual information effectively. Finetuning these models on paraphrase identification datasets has consistently achieved state-of-the-art results.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

10 of 28

Transformer models overview

We selected a few different transformer models for the analysis. Including: ALBERT, BERT, DistilBERT, BART, ELECTRA,MobileBERT,I-BERT,

DeBERTa,RoBERTa,SqueezeBERT.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

11 of 28

BERT

BERT focuses on pre-training deep bidirectional representations by considering both left and right context across all layers. It can be finetuned with a single additional output layer and do question answering or language inference without any task-specific modifications to its architecture.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

12 of 28

RoBERTa

Explores how the performance of the original BERT can be optimised using more data and longer training times.

Change to the training objective - remove the next sentence prediction part.

Highlights the importance of hyperparameter selection and its significant impact on performance.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

13 of 28

I-BERT

This model addresses the resource-intensive nature of Transformer-based models like BERT and RoBERTa by introducing a novel quantisation approach. It is based on memory footprint reduction via lower bit precision representation.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

14 of 28

SqueezeBERT

Model that is inspired by SqueezeNet - computer vision model. Authors try to apply techniques usually used in computer vision models to minimise the model's size and improve its speed. They use grouped convolutions instead of fully connected layers.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

15 of 28

ALBERT

It was created to overcome the challenges of increasing model size, longer training times. To tackle these issues, the authors proposed two parameter-reduction techniques to reduce memory usage and speed up BERT training like repeating layers, resulting in a small memory footprint (but compute costs remain the same).

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

16 of 28

ALBERT

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

17 of 28

Metrics

• Accuracy and F1 score.

• Size of the model.

• Number of sentences that the model can process in a second.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

18 of 28

Dataset

Microsoft Research Paraphrase (MSRP) Corpus was used as the basis.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

19 of 28

Experiment details

All finetuning for models described in this section was done using NVIDIA Tesla T4 with 16 GB GPU RAM and 50 GB System RAM using the Google Collab

• Each model was finetuned only 5 epochs.

• Each model was finetuned five times, and we reported the mean and standard deviation.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

20 of 28

Experiment details: Small models

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Model Name	Test Accuracy	Test F1 score	Samples per second
ALBERT base	84.87 ± 1.30	88.64 ± 1.19	164.88 ± 59.51
DistilBERT base	81.25 ± 0.55	86.19 ± 0.46	438.70 ± 3.01
Google Electra small discriminator	81.47 ± 0.58	86.50 ± 0.38	1242.70 ± 39.59
Google MobileBERT	72.44 ± 4.74	80.62 ± 3.71	526.57 ± 28.87
SqueezeBERT	84.58 ± 0.49	88.22 ± 0.40	419.06 ± 6.39

21 of 28

Experiment details: Regular models

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Model Name	Test Accuracy	Test F1 score	Samples per second
BERT base	80.12 ± 2.05	85.50 ± 1.37	213.91± 3.09
Facebook BART base	85.92 ± 0.21	89.74 ± 0.18	156.89 ± 5.03
Google Electra base discriminator	85.31 ± 0.47	89.24 ± 0.38	207.08 ± 8.96
I-BERT RoBERTa base	86.23 ± 0.50	89.63 ± 0.52	220.21 ± 0.91
Microsoft DeBERTa base	86.40 ± 0.48	89.86 ± 0.34	169.76 ± 67.31

22 of 28

Experiment with LLM

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Large language models are powerful deep learning algorithms for various natural language processing tasks. Based on transformer models, trained on massive datasets.

We selected the Llama 2 model released in July 2023.

To finetune it we used a single NVIDIA A100 (40GB - Google Collab).

23 of 28

Finetuning LLama 2

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

24 of 28

Finetuning LLama 2

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Model Name	Test Accuracy	Test F1 score
Llama 2 7b	84.52	88.48

25 of 28

Conclusions

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

The field of model application is one of the main criteria for model selection. One should investigate smaller Transformer models like ALBERT or SqueezeBERT to perform relatively well on small devices.

If accuracy is more critical - regular or large models will be your choice.

26 of 28

Conclusions

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Large Language Models can also be used to detect paraphrases, but the cost and complexity of their finetuning are significantly larger.

27 of 28

Conclusions

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Further improvements to the model performance can be made by exploring large models with even more parameters (Llama 2 70B model, for example) or including more sentence-aware context into Transformer architecture. A good example is a DeBERTa model that tries to build a separate representation of a word and its position

28 of 28

Thank you for attention !

Vitalii Vrublevskyi

Faculty of Computer Science and Cybernetics

Taras Shevchenko National University of Kyiv

vitalii.vrublevskyi@gmail.com

Oleksandr Marchenko

Faculty of Computer Science and Cybernetics

Taras Shevchenko National University of Kyiv

omarchenko@univ.kiev.ua

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine