X International conference�“Information Technology and Implementation” (IT&I-2023)�Kyiv, Ukraine
1
Comparison of Transformer-based Deep Learning Methods for the Paraphrase Identification task
Vitalii Vrublevskyi
Oleksandr Marchenko
Taras Shevchenko National University of Kyiv
Dedicated to the tenth anniversary of the Faculty of Information Technology
Global task
Understand and explore semantic meanings of sentences.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Formulation of the problem
Paraphrase Identification: The system gets two sentences and needs to decide whether they have the same meaning (they are paraphrases) or not.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Formulation of the problem
Semantic Similarity. In this case, the system needs to estimate the similarity degree between two sentences.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Paraphrase Example
She decided to postpone the meeting until Friday.
The meeting was rescheduled to Friday by her.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Overview of different methods to make paraphrase identification
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Traditional Rule-Based Methods
Traditional paraphrase detection approaches rely on handcrafted rules, linguistic patterns, and syntactic analysis.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Machine Learning-Based Methods
Models: SVM, Random Forests, and logistic regression.
These methods involve feature engineering, where various linguistic and statistical features are extracted from text pairs to train classifiers
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Transformer-Based Models
Transformer-based models, such as BERT, RoBERTa, DeBERTa.
These models are pre-trained on large text corpora and can capture contextual information effectively. Finetuning these models on paraphrase identification datasets has consistently achieved state-of-the-art results.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Transformer models overview
We selected a few different transformer models for the analysis. Including: ALBERT, BERT, DistilBERT, BART, ELECTRA,MobileBERT,I-BERT,
DeBERTa,RoBERTa,SqueezeBERT.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
BERT
BERT focuses on pre-training deep bidirectional representations by considering both left and right context across all layers. It can be finetuned with a single additional output layer and do question answering or language inference without any task-specific modifications to its architecture.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
RoBERTa
Explores how the performance of the original BERT can be optimised using more data and longer training times.
Change to the training objective - remove the next sentence prediction part.
Highlights the importance of hyperparameter selection and its significant impact on performance.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
I-BERT
This model addresses the resource-intensive nature of Transformer-based models like BERT and RoBERTa by introducing a novel quantisation approach. It is based on memory footprint reduction via lower bit precision representation.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
SqueezeBERT
Model that is inspired by SqueezeNet - computer vision model. Authors try to apply techniques usually used in computer vision models to minimise the model's size and improve its speed. They use grouped convolutions instead of fully connected layers.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
ALBERT
It was created to overcome the challenges of increasing model size, longer training times. To tackle these issues, the authors proposed two parameter-reduction techniques to reduce memory usage and speed up BERT training like repeating layers, resulting in a small memory footprint (but compute costs remain the same).
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
ALBERT
It was created to overcome the challenges of increasing model size, longer training times. To tackle these issues, the authors proposed two parameter-reduction techniques to reduce memory usage and speed up BERT training like repeating layers, resulting in a small memory footprint (but compute costs remain the same).
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Metrics
• Accuracy and F1 score.
• Size of the model.
• Number of sentences that the model can process in a second.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Dataset
Microsoft Research Paraphrase (MSRP) Corpus was used as the basis.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Experiment details
All finetuning for models described in this section was done using NVIDIA Tesla T4 with 16 GB GPU RAM and 50 GB System RAM using the Google Collab
• Each model was finetuned only 5 epochs.
• Each model was finetuned five times, and we reported the mean and standard deviation.
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Experiment details: Small models
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Model Name | Test Accuracy | Test F1 score | Samples per second |
ALBERT base | 84.87 ± 1.30 | 88.64 ± 1.19 | 164.88 ± 59.51 |
DistilBERT base | 81.25 ± 0.55 | 86.19 ± 0.46 | 438.70 ± 3.01 |
Google Electra small discriminator | 81.47 ± 0.58 | 86.50 ± 0.38 | 1242.70 ± 39.59 |
Google MobileBERT | 72.44 ± 4.74 | 80.62 ± 3.71 | 526.57 ± 28.87 |
SqueezeBERT | 84.58 ± 0.49 | 88.22 ± 0.40 | 419.06 ± 6.39 |
Experiment details: Regular models
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Model Name | Test Accuracy | Test F1 score | Samples per second |
BERT base | 80.12 ± 2.05 | 85.50 ± 1.37 | 213.91± 3.09 |
Facebook BART base | 85.92 ± 0.21 | 89.74 ± 0.18 | 156.89 ± 5.03 |
Google Electra base discriminator | 85.31 ± 0.47 | 89.24 ± 0.38 | 207.08 ± 8.96 |
I-BERT RoBERTa base | 86.23 ± 0.50 | 89.63 ± 0.52 | 220.21 ± 0.91 |
Microsoft DeBERTa base | 86.40 ± 0.48 | 89.86 ± 0.34 | 169.76 ± 67.31 |
Experiment with LLM
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Large language models are powerful deep learning algorithms for various natural language processing tasks. Based on transformer models, trained on massive datasets.
We selected the Llama 2 model released in July 2023.
To finetune it we used a single NVIDIA A100 (40GB - Google Collab).
Finetuning LLama 2
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Finetuning LLama 2
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Model Name | Test Accuracy | Test F1 score |
Llama 2 7b | 84.52 | 88.48 |
Conclusions
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
The field of model application is one of the main criteria for model selection. One should investigate smaller Transformer models like ALBERT or SqueezeBERT to perform relatively well on small devices.
If accuracy is more critical - regular or large models will be your choice.
Conclusions
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Large Language Models can also be used to detect paraphrases, but the cost and complexity of their finetuning are significantly larger.
Conclusions
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
Further improvements to the model performance can be made by exploring large models with even more parameters (Llama 2 70B model, for example) or including more sentence-aware context into Transformer architecture. A good example is a DeBERTa model that tries to build a separate representation of a word and its position
Thank you for attention !
Vitalii Vrublevskyi
Faculty of Computer Science and Cybernetics
Taras Shevchenko National University of Kyiv
vitalii.vrublevskyi@gmail.com
Oleksandr Marchenko
Faculty of Computer Science and Cybernetics
Taras Shevchenko National University of Kyiv
omarchenko@univ.kiev.ua
Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine