Phishing URL Detection using Machine Learning

Authors: Rutul Patel, Sanjay Kshetry, Sanket Berad, Justin Zirthantlunga

DOI Link: https://doi.org/10.22214/ijraset.2022.39979

Abstract

As we have moved the majority of our monetary, business related, and other day by day exercises to the web, we are presented to more serious dangers as cybercrimes. URL-based phishing assaults are quite possibly the most widely recognized dangers to web client. In this kind of assault, the aggressor takes advantage of the human weakness rather than programming defects. It targets the two people and associations, instigates them to tap on URLs that look secure, and take private data or infuse malware on our framework. Diverse AI calculations are being utilized for the identification of phishing URLs, that is, to group a URL as phishing or real. Analysts are continually attempting to work on the presentation of existing models and increment their exactness. In this work, we expect to audit different AI strategies utilized for this reason, alongside datasets and URL highlights used to prepare the AI models. The presentation of various AI calculations and the strategies used to build their exactness measures are talked about and investigated. The objective is to make an overview asset for scientists to become familiar with the current advancements in the field and add to making phishing discovery models that yield more precise outcomes.

Introduction

I. INTRODUCTION

The year 2020 saw social classes' life is absolutely dependent upon advancement due to the overall pandemic. Since digitalization became basic in the present circumstance, cybercriminals went on a web bad behavior gorge. Progressing reports and examines feature an extended number of security enters that costs the setbacks an enormous measure of money or the openness of arranged data. Phishing is a cybercrime that uses both social planning and concentrated trickery to take the singular character data or money related record licenses of victims[1]. In phishing, aggressors counterfeit trusted in locales and misdirect people to these destinations, where they are tricked into sharing usernames, passwords, banking or Mastercard nuances, and other sensitive authorizations. These phishing URLs may be transported off the customers through email, message, or text. As shown by the FBI bad behavior report 2020, phishing was the most broadly perceived kind of advanced attack in 2020, and phishing events nearly increased from 114,702 of each 2019 to 241,342 in 2020[2]. The Verizon 2020 Data Breach Investigation Report communicates that 22% of data breaks in 2020 involved phishing[3]. The amount of phishing attacks as seen by the Anti-Phishing Work Group (APWG) created through 2020, duplicating consistently. In the final quarter of 2020, it was seen that phishing attacks against money related foundations were the most unavoidable.

Phishing attack survey

Phishing assaults against SaaS and Webmail locales were down and assaults against E-business destinations raised, while assaults against media organizations diminished somewhat from 12.6% to 11.8%[1]. Considering the overarching pandemic circumstance, there have been numerous phishing assaults that exploit the worldwide spotlight on Covid-19. As indicated by WHO, numerous programmers and digital tricksters are sending fake messages and WhatsApp messages to individuals, exploiting the Covid disease[4]. These assaults are coming as phony bids for employment, created messages from wellbeing associations, Coronavirus antibody themed phishing, and brand pantomime.

In the following segment, different phishing recognition approaches are broke down. The most widely recognized AI calculations utilized on account of AI based methodologies are talked about.

II. BACKGROUND

A. Phishing Detection

A URL-based phishing assault is completed by sending pernicious connections, that appear to be authentic to the clients, and fooling them into tapping on them. In phishing location, an approaching URL is distinguished as phishing or not by investigating the various elements of the URL and is grouped appropriately. Diverse AI calculations are prepared on different datasets of URL highlights to arrange a given URL as phishing or authentic.

B. Phishing Detection Approaches

In List Based methodology, there are two records, called whitelist and boycott to characterize genuine and phishing URLs individually. In [5], admittance to sites happens provided that the URL is in the whitelist. In [6] boycott is utilized. In the Heuristic Based methodology, the design of a phishing URL is dissected. An example of URLs that were recently named phishing is made. URLs are ordered by their consistence with this example. The strategies used to deal with the highlights of the URL play a significant role in classifying websites accurately [7].

The visual similitude Based methodology works by looking at the visual closeness of the site pages. Sites are named phishing or not by taking a server-side perspective on them as in [8]. These two information are then contrasted and picture handling strategies. Counterfeit website pages are planned exceptionally near the first ones and it is simpler to see minor contrasts with picture handling methods, as clients can't see them without any problem.

The content-Based approach analyses the content of the pages. This technique separates highlights from page substance and outsider administrations like web crawlers and DNS servers. In [9] creators proposed a recognition strategy by determining loads to the words drawn out from URLs and HTML substance. The words may incorporate brand names that aggressors use in the URL to make it resemble a genuine one. Loads are determined by their quality at various situations in URLs. The most plausible words are picked and afterward shipped off Yahoo search to return the space name with the most elevated recurrence between the main 30 results. The proprietors of the area name are contrasted with choose if the site is phishing or not. In [10], they used a logo picture to observe the character of site pages by matching genuine and phony site pages.

The fuzzy Rule-based approach permits the handling of equivocal factors, then, at that point, coordinates human specialists to order those factors and relations between them. It is utilized to arrange website pages dependent fair and square of phishing that shows up in the pages by utilizing a particular gathering of measurements and predefined rules[11]. From the exploratory outcomes in the paper, for fluffy rationale frameworks, a lower number of elements prompts higher exactness. Assuming a fluffy rationale calculation is impacted by unessential highlights, the viability of the classifier will diminish as well as the other way around.

In Machine Learning-based approach, machine learning models are made to arrange a given URL as phishing or not utilizing regulated learning calculations. Various calculations are prepared on a dataset and afterward tried to get familiar with the exhibition of each model. Any varieties in the preparation information straightforwardly influence the exhibition of the model. This methodology furnishes proficient procedures with elite execution for distinguishing phishing. This is a huge field of examination and many papers talk about MI based phishing discovery.

III. MACHINE LEARNING ALGORITHMS

There are several machine learning algorithms such as Naive Bayes, Decision Tree, Random Forest, Support Vector Machine, Logistic Regression, and K-Nearest-Neighbor for detecting phishing websites. Among these, all, Support Vector Machine is a very popular approach that has proved to be very efficient and accurate compared to other methods.

Algorithm	Time Complexity	Training Data Size	Interpretability
SVM	O(n2)	Small	Median
Decision Tree	O(nd log n)	Small	High
Naïve Bayes	O(nd)	Small	High
k-NN	O(and)	Small	Median
Random Forest	O(knd log n)	Small	Median

IV. SUPPORT VECTOR MACHINE(SVM)

Support Vector Machine (SVM) is a typical machine-learning method for classification and regression. SVM finds the optimal separating hyperplane between two labels. It can be expressed by the kernel function K (x, x’), in which the similarity of two feature vectors is computed, and non-negative coefficients xi. SVM indicates which training example lies close to the decision boundary. It classifies data by computing distance to the decision boundary.

h(x) =

V. LITERATURE REVIEW

In this segment, a couple of the exploration works that convey the previously mentioned calculations are assessed and their results are summed up. In the paper [11], the creators Rishikesh Mahajan and Irfan Siddavatam picked two calculations for arrangement Random Forest and Support Vector Machine. Their dataset contained 17,058 harmless URLs and 19,653 phishing URLs gathered from the Alexa site and Phish Tank individually, with 16 highlights each. The dataset was isolated into preparing and testing set in the proportions 50:50, 70:30, and 90:10 individually. The exactness score, bogus negative rate, and false-positive rate were considered as execution assessment measurements. They accomplished 97.14% precision for the Random Forest calculation with the most reduced bogus negative rate. The paper presumed that precision increments when more information is utilized for preparing.

The review directed by Jitendra Kumar et al. in [12] prepared various classifiers like Logistic Regression, Naive Bayes Classifier, Random Forest, Decision Tree, and K-Nearest Neighbor dependent on the highlights extricated from the lexical construction of the URL. They made the dataset of URLs so that it tackled the issues of information irregularity, one-sided preparing, change, and overfitting. The dataset contained an equivalent number of named phishing and authentic URLs and was additionally parted in the proportion 7:3 for preparing and testing. Every one of the classifiers had practically a similar AUC (region under ROC bend), yet the Naive Bayes Classifier ended up being more appropriate as it had the most noteworthy AUC esteem. Innocent Bayes accomplished the most noteworthy exactness of 98% with a precision=1, recall=0.95, and F1-score=0.97.

Mehmet Korkmaz et al. proposed in [13] an AI based phishing identification framework by utilizing 8 distinct calculations on three diverse datasets. The calculations utilized were Logistic Regression (LR), K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Decision Tree (DT), Naive Bayes (NB), XGBoost, Random Forest (RF), and Artificial Neural Network (ANN). It was seen that the models utilizing LR, SVM, and NB have low precision rates. As far as preparing time, NB, DT, LR, and ANN calculations gave better outcomes. They reasoned that the RF calculation or ANN calculation might be utilized due to less preparing time alongside a high exactness rate.

Summary of Literature Review

Paper	Approach	Conclusion	Accuracy
[3]	8 different algorithms are applied on three different datasets making use of 48 features.	RF has the highest accuracy, on all three datasets. ANN is also preferred.	Dataset 1: 94.59% Dataset 2: 90% Dataset 3: 91.26%
[11]	Dataset is split into training and testing set in 50:50, 70:30, and 90:10 ratios respectively and SVM classifiers are used.	SVM has better accuracy with the least false-negative rate. Accuracy increases when more data is used for training	50:50 split ratio: 96.72% 70:30 split ratio: 96.84% 90:10 Split ratio: 98.4%
[5]	A balanced dataset is used to train LR, NBRF, DT, k-NN classifiers based on features extracted from the lexical structure of a URL.	The RF and NB classifiers have better accuracies among all classifiers. In terms of AUC, Gaussian Naive Bayes has a slightly higher value of 0.991	Random Forest:98.03% Gaussian Naive Bayes: 97.18%
[13]	Accuracy, F-measure, and AUC are used to evaluate the performance of classifiers ANN, k-NN SVM, C4.5 DT, RF, and RoF on the UCI dataset	RF produces reliable results in terms of Accuracy, F- measure, and AUC. It is faster, robust, and more accurate.	Random Forest: 97.36%

VI. DATASETS

Typically, the phishing site information is gathered from kaggle.com. kaggle.com is a site where phishing URLs are recognized and can be gotten to through API call. Their information is utilized by organizations like Kaspersky, Mozilla, and Avast. Since it doesn't store the substance of website pages, it is a decent hotspot for URL-based examination [6]. There are openly accessible datasets like the UCI AI store dataset utilized in [11] which contains 8,450records, each record having 21 elements, and the Kaggle phishing dataset utilized in [12] which contains 10,050 records, each record having 35 elements.

VII. FEATURE EXTRACTION

URLs have specific qualities and examples that can be considered as its elements. Fig. 3 shows the pertinent pieces of a normal URL.

On account of URL-based investigation for planning AI models, we really want to extricate these highlights to shape a dataset that can be utilized for preparing and testing. There are four classes of elements that are most usually considered for include extraction as in [10]. They are as per the following:

Address bar-based features
Abnormal based features
HTML and JavaScript-based features
Domain-based features

VIII. PERFORMANCE EVALUATION METRICS

To evaluate the efficiency of a system, we use certain parameters. For each machine learning model, we compute the Accuracy, Precision, Recall, F1 Score, and ROC bend to decide its exhibition. Every one of these measurements is determined dependent on True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). On account of URL classification, True Positive (TP) is the quantity of phishing URLs that are accurately named phishing. True Negative (TN) is the quantity of real URLs that are accurately named authentic. False Positive (FP) is the quantity of genuine URLs that are named phishing. False Negative (FN) is the quantity of phishing URLs that are named genuine. These qualities are summed up in the table called Confusion Matrix.

	Predicted Phishing	Predicted Legitimate
Actual Phishing	TP	FN
Actual Legitimate	FP	TN

Table Of Confusion Matrix For Phishing Detection

Precision is the quantity of URLs that are phishing out of the multitude of URLs anticipated as phishing. It estimates the classifier's precision. The recipe to work out precision is given by Equation (1) beneath.

Precision = ……. (1)

Recall is the quantity of URLs that the classifier recognized as phishing out of the relative multitude of URLs that are phishing. It is likewise called sensitivity or True positive rate. It is a significant measure and ought to be pretty much as high as could be expected. The formula to compute Recall is given by Equation (2) beneath.

Recall = ……. (2)

F1-Score is the weighted normal of accuracy and recall. It is utilized to quantify accuracy and recall simultaneously. The formula to compute F1-Score is given by Equation (3) beneath.

F1Score=2* … (3)

Accuracy is the quantity of cases that were accurately ordered out of the relative multitude of cases in the test information. The recipe to ascertain exactness is given by Equation (4) beneath

Accuracy = …. (4)

XI. OBSERVATIONS

Phishing attack are continually advancing and the digital world is hit by new kinds of assaults frequently. Consequently a specific location approach or calculation can't be labeled as the best one giving precise outcomes. Through the writing study, We discovered that Support Vector Machine gives better outcomes in many situations. However at that point the exhibition of every calculation differs relying upon the dataset utilized, train-test split proportion, highlight determination strategies applied, and so forth Scientists like to make AI models that perform phishing location with the best incentive for assessment boundaries and least preparing time. Subsequently, our future works center around working on these parts of phishing identification.

Conclusion

Phishing detection is currently an area of incredible interest among specialists because of its importance in ensuring the protection and giving security. Numerous techniques perform phishing location by characterization sites utilizing prepared AI models. In this paper, we depicted our precise study of existing URL-based phishing identification procedures from various perspectives. Albeit past overview papers exist, they by and large spotlight on in general phishing location methods, while we zeroed in on itemized URL-based discovery concerning highlights. Right off the bat, we audited the writing on by and large phishing identification plans. Second, we examined the design of URL-based phishing, and ordinarily utilized calculations and highlights. Third, normal information sources were recorded, and near assessment results and grids were displayed for a superior study. At long last, we closed with our idea to continue with the Support Vector Algorithm for more successful phishing URL identification in our venture.

References

[1] Patil, Srushti, and Sudhir Dhage. \"A methodical overview on phishing detection along with an organized way to construct an anti-phishing framework.\" 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS). IEEE, 2019. [2] Garcés, Ivan Ortiz, Maria Fernada Cazares, and Roberto Omar Andrade. \"Detection of phishing attacks with machine learning techniques in cognitive security architecture.\" 2019 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE, 2019. [3] Ahmed, Abdulghani Ali, and Nurul Amirah Abdullah. \"Real-time detection of phishing websites.\" 2016 IEEE 7th Annual Information Technology, Electronics and Mobile Communication Conference (ICON). IEEE, 2016. [4] Nathezhtha, T., D. Sangeetha, and V. Vaidehi. \"WC-PAD: Web Crawling based Phishing Attack Detection.\" 2019 International Carnahan Conference on Security Technology (ICCST). IEEE, 2019. [5] Shahrivar, Vahid, Mohammad Mahdi Darabi, and Mohammad Izadi. \"Phishing Detection Using Machine Learning Techniques.\" arXiv preprint arXiv:2009.11116 (2020). [6] Zabihimayvan, Mahdieh, and Derek Doran. \"Fuzzy rough set feature selection to enhance phishing attack detection.\" 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE, 2019. [7] Yong, Kelvin SC, Kang Leng Chiew, and Choon Lin Tan. \"A survey of the QR code phishing: the current attacks and countermeasures.\" 2019 7th International Conference on Smart Computing & Communications (ICSCC). IEEE, 2019. [8] Cervantes, Jair, et al. \"A comprehensive survey on support vector machine classification: Applications, challenges, and trends.\" Neurocomputing 408 (2020): 189-215. [9] Butnaru, Andrei, Alexios Mylonas, and Nikolaos Pitropakis. \"Towards Lightweight URL-Based Phishing Detection.\" Future Internet 13.6 (2021): 154. [10] Tang, Lizhen, and Qusay H. Mahmoud. \"A Survey of Machine Learning-Based Solutions for Phishing Website Detection.\" Machine Learning and Knowledge Extraction 3.3 (2021): 672-694. [11] Rishikesh Mahajan, and Irfan Siddavatam, “Phishing website detection using machine learning algorithms,” International Journal of Computer Applications (0975-8887), vol. 181, no. 23, 2018 [12] Jitendra Kumar, A. Santhanavijayan, B. Janet, Balaji Rajendran, and Bindhumadhava BS, “Phishing website classification and detection using machine learning,” International Conference on Computer Communication and Informatics (ICCCI), 2020 [13] Mehmet Korkmaz, Ozgur Koray Sahingoz, Banu Diri, “Detection of phishing websites by using machine learning-based URL analysis,” 11nth International Conference on Computing, Communication and Networking Technologies (ICCCNT), 2020.

Copyright

Copyright © 2022 Rutul Patel, Sanjay Kshetry, Sanket Berad, Justin Zirthantlunga. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET39979

Publish Date : 2022-01-17

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here