1 of 14

X International conference�“Information Technology and Implementation” (IT&I-2023)�Kyiv, Ukraine

Estimation of the Factual Correctness of Summaries of a Ukrainian-language Silver Standard Corpus

Artem Kramov, Seraf AI LLC�

Dedicated to the tenth anniversary of the Faculty of Information Technology

2 of 14

The essence of the problem

Automatic evaluation of the truthfulness of the content that is generated by large language models (LLM) is still an actual problem.

Detection of factual mistakes by means of machine learning may help to perform human feedback reinforcement learning (HFRL) in a self-supervised manner.

Despite the availability of different existing datasets for NLP tasks (summarization, question-answering, etc.), many of them has to be verified for the factual consistency.

3 of 14

The essence of the problem (1)

The estimation of factual correctness of datasets can be important especially for low-resource languages (e.g., the Ukrainian language) as far as in most of cases such datasets (e.g., XL-Sum) are constructed in a self-supervised manner.

4 of 14

The goals of the work

Analysis of the state-of-the-art machine learning methods for the detection of factual inconsistency between a document and a summary.
Development of the metric to evaluate the factual consistency for a Ukrainian-language document-summary pair in an automatic way.
Estimating of the factual correctness of summaries of the existing Ukrainian-language dataset, analysis of problematic cases.
Evaluation of the performance of the pre-trained Ukrainian-language summarization model according to the created metric with the further detection of erroneous samples.

5 of 14

Literature analysis

According to the literature analysis, the hallucination errors are most common for both existing English-language datasets and models:

Document: “…The man is a teacher…”, summary: “The man is a doctor.”.
Document: “The man is a teacher. He likes his job.”, summary: “School is a nice place.”.

In order to detect both hallucinations and incorrect statements for the Ukrainian-language summaries, the SummaC method was used as a backbone.

6 of 14

SummaC method

7 of 14

XL-Sum dataset

XL-Sum is a silver multilingual summarization dataset that was created by the web-scraping of the BBC news portal. Each web-page of an article was parsed; the short description of the article was treated as a summary.
XL-Sum dataset contains Ukrainian-language samples as well; moreover, the authors provided the pre-trained summarization model for the Ukrainian language.
According to the previous research, up to 42% of the summaries of the dataset may be erroneous.

8 of 14

Results - Inconsistent summaries discrimination

For each document-summary pair in a dataset another summary was picked. The ROUGE-1 F1 score of a new summary had to be higher that the corresponding metric of the original summary. The task consisted in the ability of the metric to detect the original one by assigning higher value.

Metric	Model	Accuracy, %	PCC
	paraphrase-multilingual-mpnet-base-v2	85.855	0.168
	distiluse-base-multilingual-cased-v2	81.655	0.273
	xlm-roberta-large-xnli	75.664	-0.075

9 of 14

Results - XL-Sum dataset

10 of 14

Results – Pre-trained model

11 of 14

Results – Metrics for the dataset and model

In order to compare the results between ground-truth and model-generated summaries, it was decided to take a median value as an average score, and the interquartile range (IQR) value for the measurement of the deviation of the metric.

Summaries	Median	IQR
Ground-truth	0.848	0.248
Model-generated	0.958	0.186

12 of 14

Conclusions

Suggested SummaC-based metric for the evaluation of the factual consistency of a Ukrainian-language document-summary pair showed ~85% of accuracy in the discrimination task that may show the possibility of the usage of this metric for the detection of erroneous summaries.
The analysis of the values of the chosen NLI-based metric for the ground-truth samples of the XL-Sum dataset may indicate the availability of at least 50% of erroneous summaries that match the results of the previous research.
Finally, it was shown that the metrics retrieved from evaluating the factual consistency of model-generated summaries are higher than those of ground-truth summaries. Nevertheless, the availability of generated summaries with an almost zero metric score may indicate the big impact of the hallucinated dataset on the trained model.

13 of 14

Limitations and future directions

Suggested metric may return low values for summaries with a high abstractiveness level. For instance, if a summary sentence was entailed from several non-neighboring document sentences, the final score will not be high by default.

The aforementioned problem can be solved by the search for the set of sentences that form an evidence for each summary sentence or its parts (e.g., clauses). Such a problem falls into a category of the search for document origins and requires more investigation.

14 of 14

Authors

Artem Kramov, Doctor of Philosophy (Ph.D) in Information Technology, large language models engineer in Seraf AI LLC

artemkramov@gmail.com