1 of 13

X International conference�“Information Technology and Implementation” (IT&I-2023)�Kyiv, Ukraine

1

THE METHOD FOR DETERMINING THE DEGREE

OF SUSPICIOUSNESS OF A PHISHING URL

Serhii Buchyk, Anastasia Shabanova, Oleksandr Buchyk, Serhii Toliupa

Taras Shevchenko National University of Kyiv

Dedicated to the tenth anniversary of the Faculty of Information Technology

2 of 13

Introduction

In the era of high online activity in the modern digital world, addressing network security is becoming an extremely important and challenging task. The rapid development of technology is leading to an increase in threats, with phishing attacks coming to the fore. Phishing, defined as a type of deceptive scheme aimed at extracting confidential information such as passwords, banking data or personal identifiers, is becoming a key aspect of cybersecurity and remains a serious threat to network users.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

According to the Incident Response Threat Summary report, such attacks have increased by 30% and are characterized by the main type of interference to obtain sensitive information. The healthcare sector is particularly vulnerable, accounting for 22% of all incidents, and compromised credentials are used in nearly 40% of cases.

3 of 13

Impact of war

It should be noted that at the outbreak of the war in Ukraine in 2022, there was a significant increase in phishing attacks. In the first quarter of that year, fraudulent websites appeared that redirected funds under the guise of humanitarian aid for Ukraine.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Throughout the year, investigations were conducted to analyze links in emails aimed specifically at English-speaking users, offering to transfer money to help the affected Ukrainians using a direct mechanism of transactions on bitcoin wallets, as it is more difficult to identify the recipient in cryptocurrency than in bank transactions.

4 of 13

Phishing statistics

In 2022, email anti-virus programs detected about 166,187,118 malicious attachments in emails, an increase of 18 million compared to the previous year - the highest number of detections in February, March, and June 2022, which is proof of the decisive impact on the transformation in cyberspace, confirming the close connection between external events and characteristic changes in the digital environment.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

5 of 13

Types of phishing

In today's information world, there is a variety of types of phishing, each targeting different aspects of the consequences and using different methods to influence the potential victim. Some of them are based on mass mailings or standard attacks, while others are refining guaranteed designed and personalized approaches to make sure they work on their target audience. Phishing as a modern cyber threat is growing in importance, requiring a deeper understanding of its various aspects.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Types of phishing

Targeted phishing

Vishing

Spear phishing

Pharming

Spear

Cloning

6 of 13

PhishTank

In the modern period, there is an initiative focused on countering such forms of fraud. One example of such measures is the PhishTank platform which is a community where users can work together to detect and block phishing attacks by leaving links to phishing sites. This helps to improve the level of protection against cybercriminals and ensures overall security on the global Internet, contributing to the protection of the confidentiality of users' personal data.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

7 of 13

General algorithm for determining the degree of URL

Fuzzy logic is a mathematical approach for dealing with fuzzy and uncertain concepts, so it effectively manages the ambiguity of this type of threat in phishing detection. It can be used to expand the understanding of the similarity between phishing and legitimate elements, especially when considering ambiguities in text attributes such as URLs or headers.

The use of vaguely described values allows us to determine the degree of suspicion of a link - in our case, the main basis for analysis and comparison will be the current page with a list of new phishing URLs on the PhishTank website.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

The use of vaguely described values allows us to determine the degree of suspicion of a link - in our case, the main basis for analysis and comparison will be the current page with a list of new phishing URLs on the PhishTank website.

8 of 13

Levenshtein distance

Let's consider the first method, based on the Levenshtein distance, which is able to determine the minimum number of edits required to transform one string into another. These edits include insertions (adding characters), deletions (removing characters), and replacements (replacing one character with another). The training strategy involves the creation of a matrix where the value in each cell represents the distance between rows and strings. The maximum length specified in the code is used to normalize the Levenshtein distance to obtain a value between 0 and 1.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

9 of 13

Cosine Similarity

Cosine Similarity is a method of measuring the similarity between two vectors in a vector space. In the context of textual comparison, using ASCII character codes, text strings, in this case url, are converted into numerical vectors. The similarity itself is calculated as the cosine of the angle between these vectors, which allows you to ignore the absolute size of the vectors and focus on their direction.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

This method is effective in general for comparing text documents or words because it takes into account semantic similarity and context. Compared to the method based on Levenshtein distance, which measures the "order of precedence" between two strings, cosine similarity allows for semantic relationship and context, which is important for textual information.

10 of 13

Jaccard Similarity

The Jaccard Similarity method is a measure of the distance between sets, described as the number of common elements divided by the total number of unique elements. In the code under consideration, this method is used to determine the similarity between a pattern word and a URL.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

A statistical average uses the same weighting for all methods, which can lead to an unfair consideration of the contribution of any one chosen methodology. This measure is classified as a weighted weighting strategy that allows for the importance of each approach to be noted in comparison to the others, which can be represented by denoting the effect of each method as E1, E2, E3 and their weights as W1, W2, W3

11 of 13

Algorithm

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

12 of 13

How it works

This context of mathematical analytics is determined to be of great value in tracking and investigating the most recently added URLs on the PhishTank website. Such an approach can help to recognize and avoid potentially dangerous sites in time, focusing on the underlying mechanism of operation: the above algorithm of actions began with obtaining the HTML content of the PhishTank page using the get_page_content function, after which beautifulsoup was used to parse the HTML and create the phishtank_soup object, which allows convenient interaction with HTML elements on the site itself.

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Then, the user-entered URL (base_url) is algorithmically compared with each URL in the phishtank_urls list, and the level of similarity between the two URLs is calculated using mathematical methods. If the similarity is greater than the current maximum, the most_similar_url is changed.

13 of 13

Thank you for your attention !!!

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine