Improved Phishing Detection using Ensemble Models in Machine Learning

Authors: Dasari Sri Sai Phani Venkat

DOI Link: https://doi.org/10.22214/ijraset.2023.54359

Abstract

Phishing attack is one of the simplest ways to obtain sensitive information from innocent users who are unaware. The main motive of the phishers is to acquire critical information like username, password and bank account details etc. using a fake link which actually looks like a genuine one from an authorized source. Users who have good technical knowledge might be able to identify these links easily but for naive users, this is going to be dangerous as it might lead to loss of privacy and assets. There are other techniques of spam detection based on the sender’s information and content-based detection. In this paper we follow an efficient approach in which we can directly identify an URL if it is Phishing or not by its features using Machine Learning. The aim of this paper is to build multiple Machine Learning models, compare their performances and narrow down to best model based on its performance and efficiency.

Introduction

I. INTRODUCTION

Phishing attacks are limitedly defined as stealing of personal information from the users by a non-trusted source which is actually pretending to be an authorized source, however it is not always the case. A link or site is said to a phishing whenever it act as a trusted party, when it is actually not, with a motive to confuse the naive users and make them to perform an action which they would do only with the trust of the party. The phishing link might grab our personal information and misuse it or there are cases in which the link might grab our bank account details or any authentication details of online assets to manipulate the accounts. There is also another possibility of damage with this phishing attacks where clicking the phishing link activates a worm into the user’s device. This worm may give unnecessary permission to the cyber criminals to get over the total control and access of the device. Also, sometimes the worm may cause damage to software and hardware of the system. Phishing is basically a social engineering activity. The cyber criminals who involve in making these phishing attacks are known as phishers. Although phishing is the simplest attack possible to obtain information illegally, it is totally dependent on human weakness. General people may refer these attacks as ‘Hacking’ but it’s a misconception. This is just a trick being played by phishers to trap users. Naive users get into these traps easily as the link seems to be almost real. In recent times, the number of phishing cases are increasing in India. The current spam detection techniques which are being used are based on sender’s reputation and content-based identification, but the phishers are using trustful identities to send the links with generic content as the link is just enough to fulfil their motives. In this case, phishing can only be prevented by identification of the URL links. URL means Uniform Resource Locator. Just like how we have address to any particular location, an URL is an address for location where the resources are stored on the internet. So, these URLs will contain a lot of information. The elements present in the URL, domain information and the content of the link plays a major role in describing an URL. Classification is one of the Machine Learning concepts which can help in detecting Phishing attacks. The differentiation between a benign link and phishing link is possible. The URL link which is being sent by the phishers will be having certain features both internally and externally which will be used to decide whether the link is phishing link or a benign one. Initially, two classes named ‘Phishing’ and ‘Benign’ are considered. Based on the features of the URL, the prediction is made on which class the URL belongs to. This classification approach can be implemented by using various supervised learning algorithms suitable for classification.

II. LITERATURE REVIEW

While the use of social engineering began to rise in the world, Phishing has been a simple and convenient way to manipulate naive users and obtain their data maliciously. The first phishing attack was found to be happened in the mid 1990’s and targeted American online users. The victims unknowingly provided their login details in the phishing links and the Phishers started using the victim’s accounts for spamming and adding likes as said in Reference [1].

In the year 2000, people used to receive mails with title ‘I LOVE YOU’ attached with a love letter. Users who clicked on the letter, it initiated a worm which obtained all the personal image files and sent it to all the contacts in the outlook. This has been published in the article [2].

According to Reference [3], In current times, particularly in India, people are receiving a lot of links pretending to be that they are from official banks or from government saying that the user need to immediately update his identity details like Aadhar no., PAN no. etc. Innocent users tend to click on those links and provide the information thinking that it is genuine. The government and network providers are not able to do anything more than warn the users to stay away from these links as there is no responsible source other than human weakness of the victim.

There are various naive methods to identify phishing attacks but most of them are inefficient. Also, sometimes it really doesn’t matter even if the detection worked after the attack has been committed as the loss has been happened already. So, there needs to be a better approach. The spam detection techniques which are currently in use, which can be seen in SMS applications and Mail applications are based on content-based detection and sender’s behaviour as shown in References [6] and [7].

Sometimes people tend to receive mails stating that they have won lotteries, lucky draws or any other gifts and asks the user to click there to claim their prize. This kind of mails can be detected by content-based detection approaches.

We might see messages and mails getting thrown into spam based on the sender. The user’s reputation is based on his history, the type of messages or mails he was sending, number of reports on him, etc. Also, the way of style he maintains, like some patterns can be considered. In this way, the detection can be done based on the sender’s behaviour and statistics.

The above approaches can be useful, but a methodology is required in order to protect the naive users from phishing. Rather than detecting the mail or message, detecting the URL would be a better way. As we have discussed earlier, the URL detection is totally based on its features, but what exact features? is the biggest question. According to past studies, the features are majorly classified into 3 types: Address-bar based features, HTML-JavaScript based features, Domain based features, as said in Reference [4]. The address bar-based features are directly taken from the URL string. The HTML-JavaScript based are taken by crawling through the source code. The domain-based features are taken from the WHOIS database, DNS records and other APIs, shown in Reference [4].

Based on different kinds of features, many studies have been evolved around the Machine Learning. Using different kinds of datasets, different sets of features, and different algorithms resulted in various outcomes. According to Machine Learning, the suitable concept used for this kind of problems is Classification, can be observed in [8]. Generally, a classifier is built in order to decide to which class a particular input belongs to. Here, the URL, based on its features, need to be classified into a class which it belongs to, either Phishing URL or Benign URL. From [5], the proposed algorithm was SVM.

III. PROPOSED FRAMEWORK

A. System Architecture

The Website link/URL received by the user needs to be identified whether it is a Phishing URL Link or a Benign one. For that, the URL will be passed into the Feature Extraction Program, which is a python program written using various string techniques to extract the address bar-based features and also used various web scraping techniques to extract its content-based features and domain-based features. After successful execution of the program, the Feature Extraction Program gives an output as an Array. The array contains all the required and extracted features of the URL and hence it is called the Feature Array. The Feature Array which is a result of the feature extraction program is used as an input to the Classifier. The model makes a prediction for the feature array that has been provided and gives output as 0 or 1 as it is a binary class classification. ‘0’ says that the URL is Benign and ‘1’ says that the URL is Phishing.

B. Dataset

A dataset has been taken from Kaggle and re-processed. According to the current scenario, only required columns(features) have been taken into the new dataset. The dataset consists of 11,430 URLs in which 5,715 are Phishing and 5,715 are Benign. The Phishing URLs are labelled as ‘1’ and the Benign URLs are labelled as ‘0’.

C. Features

Although there are a large number of features that can be taken from the URL, we have taken the 18 features which play a major role in the classification. The following are the features considered:

Length of the URL
Presence of IP address in the URL
Number of ‘.’ in the URL
Number of ‘-’ in the URL
Number of ‘@’ in the URL
Number of ‘/’ in the URL
Number of ‘//’ in the URL
Presence of HTTP/HTTPS token in the URL
Prefix-Suffix separation in Domain name
URL Shortening Services
Number of Hyperlinks in Source code
WHOIS Registration of Domain
DNS Records
Domain Registration Length
Domain Age
Web Traffic
Google Index
Page Rank

D. Algorithms

Logistic Regression: Logistic Regression is a machine learning algorithm used for Classification. The classification in Logistic regression is based on the probabilities between the classes. The logistic function curve shows the probability of something to which class it belongs.
Support Vector Machine: Support Vector Machine, known as SVM, is one of the most popular Supervised Learning algorithms. It is most widely used for Classification, although it can perform for Regression problems. The main objective of SVM is to create a hyperplane, which can divide the dimensional space into classes. The algorithm uses extreme points of data as support vectors that help in creating the hyperplane.
K-Nearest Neighbor: K-Nearest Neighbor is simply known as KNN algorithm. KNN is the simplest algorithm which can be used for both Regression and Classification. In the current scenario, we are using this algorithm for classification of URLs. KNN differentiate between the classes on the basis of distance between the datapoints. The most common metrics that are used to calculate the distance between the datapoints are Euclidean distance and Manhattan’s distance.
Random Forest: Random Forest is a supervised machine learning algorithm. It is a type of bagging algorithm in Ensemble learning. It is an algorithm built on the idea of integrating various classifiers to solve complex issues and enhance model performance. In Random Forest, instead of depending on one decision tree, multiple decision trees are used for making a prediction based on the majority value. This algorithm can be used under regression and classification problems.
XG Boost: XG Boost Algorithm is a Supervised Machine Learning Algorithm. It was found by Tianqi Chen. XG Boost is a boosting type of algorithm in Ensemble Learning. Boosting is a type of Ensemble technique in machine learning where models are built in sequential manner, a model is built and then based on its performance the next model is built. Here every model will be a weak learner to its next model and this is going to stop when we get a model with good performance which will be a strong learner.
Stacking: Stacking is a type of ensemble learning technique. Ensemble Learning is a learning technique where the predictive power of multiple models called the base models are taken and a resultant predictive model is formed. Apparently, the resultant predictive model formed has a lot more predictive power than the individual model predictive accuracy. In stacking, a model is used to find the final predictive learning classifier from the other base learners using a common data for all.

IV. RESULTS AND DISCUSSIONS

Logistic Regression, SVM and KNN are the conventional algorithms and the Random Forest, XG Boost and Stacking are algorithms working on the principle of Ensemble Learning. Stacking algorithm is a special case as it is customizable, the Level-1 models (weak learners) and meta-model can be taken by our own choice. In this case, the Leavel-1 models are KNN, Random Forest and XG Boost and the meta-model is SVM. Table 1 shows the performance measures of the algorithms when the test size is 20%. The SVM performed least compared to the other 5 algorithms. The Logistic Regression is better than SVM but not the best. KNN gave decent results. Random Forest and XG Boost were performing absolutely good with above 90% scores. The belief that using better performance models as Level-1 models in Stacking could give slight but essential improvement in the performance. There is much distinctness among the outputs of KNN, Random Forest and XG Boost. SVM might be a better choice for the meta model. The implementation of this Stacking model gave the highest scores as shown in the Table 1. The Stacking classifier was performing with the highest accuracy of 97.46%, precision of 97.44%, recall of 97.37% and F1-score of 97.29%.

TABLE I
PERFORMANCE MEASURES

Model	Accuracy (in %)	Precision (in %)	Recall (in %)	F1-Score (in %)
Logistic Regression	78.83	76.86	78.62	89.69
Support Vector Machine	71.57	65.32	74.94	87.88
K-Nearest Neighbor	83.94	83.33	84.43	83.88
Random Forest	94.09	94.54	93.45	93.99
XG Boost	96.63	96.47	96.72	96.60
Stacking Classifier	97.46	97.44	97.37	97.29

Conclusion

Phishing attacks are being quite vulnerable for users from many years.In this paper, best possible approaches were attempted for collecting the best set of features of the URLs and finding the suitable Machine Learning approaches. The traditional algorithms, Logistic Regression, SVM and KNN are not performing upto the mark while Random Forest and XG Boost are giving improved results with the same input data. Stacking technique has given much better results. The significance of ensembling techniques compared to conventional algorithms have been shown. As per understanding, Stacking can be a good technique to make slight improvements for both Classification and Regression problems when we already have performing models. In the future, this methodology can be used to detect phishing websites accurately and reduce the major social engineering problem in the society. The future scope of this methodology would be automation of this model in the mobile and computer applications to detect phishing sites.

References

[1] History of Phishing. [Online]. Available: https://cofense.com/knowledge-center/history-of-phishing/ [2] Love bug virus creates worldwide chaos. [Online]. Available: https://www.theguardian.com/world/2000/may/05/jamesmeek [3] A new phishing attack lurking to scam banking customers: Advisory. [Online]. Available: https://timesofindia.indiatimes.com/business/india-business/a-new-phishing-attack-lurking-to-scam-banking-customers-advisory/articleshow/85236685.cms [4] R. M. Mohammad, F. Thabtah and L. McCluskey, \"An assessment of features related to phishing websites using an automated technique,\" 2012 International Conference for Internet Technology and Secured Transactions, London, UK, 2012, pp. 492-497. [5] J. Rashid, T. Mahmood, M. W. Nisar and T. Nazir, \"Phishing Detection Using Machine Learning Technique,\" 2020 First International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia, 2020, pp. 43-46, doi: 10.1109/SMART-TECH49988.2020.00026. [6] Uur Ozker, Ozgur Koray Sahingoz, \"Content Based Phishing Detection with Machine Learning\", 2020 International Conference on Electrical Engineering (ICEE), 25-27 September 2020. [7] S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn, \"Considering behavior of sender in spam mail detection,\" INC2010: 6th International Conference on Networked Computing, Gyeongju, Korea (South), 2010, pp. 1-5. [8] S. Chowdhury and M. P. Schoen, \"Research Paper Classification using Supervised Machine Learning Techniques,\" 2020 Intermountain Engineering, Technology and Computing (IETC), Orem, UT, USA, 2020, pp. 1-6, doi: 10.1109/IETC47856.2020.9249211.

Copyright

Copyright © 2023 Dasari Sri Sai Phani Venkat. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET54359

Publish Date : 2023-06-23

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here