Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Sri Sai Phani Venkat Dasari, Dr. Nitalaksheswara Rao Kolukula
DOI Link: https://doi.org/10.22214/ijraset.2023.55506
Certificate: View Certificate
A phishing attack is one of the simplest ways to obtain sensitive information from unaware, innocent users. The main motive of the phishers is to acquire critical information like usernames, passwords, and bank account details using a malicious link that looks genuine. Users with sound technical knowledge might be able to identify these links quickly, but this will cause harm to naive users, leading to a loss of privacy and assets. There are techniques to detect spam, such as content-based and the sender\'s reputation-based detection. This project aims to present an approach to detect phishing attacks based on the URL and by applying Machine Learning.
I. INTRODUCTION
Phishing attacks are limitedly defined as stealing personal information from users by a non-trusted source pretending to be a trusted source; however, this is not always true. A link is said to be phishing whenever it acts as a genuine one to confuse naive users and make them perform an action they would do only if they trust. Although phishing is the most straightforward attack to obtain information illegally, it depends on the user's weakness. People call these attacks 'Hacking', but it is a misconception. It is a trick being played by phishers to trap users. The cyber criminals who originate phishing attacks are known as phishers. The phishers usually grab the personal information of people and misuse it. The most common cases are trading information and manipulating accounts. The other possibility of phishing is to inject a worm into the user's system, which gives control to phishers or damages the software/hardware. Recently, the number of phishing cases has been increasing all around the world. The existing spam detection techniques were not able to save people from phishing attacks.
URL means Uniform Resource Locator. URL is an address of a location where specific resources are stored on the internet, and the users get to access them. So, URLs contain a lot of information. The behaviour of the URL, its domain information, and the content on its page describe the nature of a URL. Classification is a Machine Learning concept that can help in detecting phishing attacks. Differentiating benign and phishing links is possible through classification, a supervised machine-learning approach.
II. LITERATURE REVIEW
While social engineering use began to rise worldwide, Phishing has been a simple and convenient way to manipulate naive users and obtain their data maliciously. The first phishing attack was found to happen in the mid-1990s and targeted American online users. The victims unknowingly provided their login details in the phishing links, and the Phishers started using the victim’s accounts for spamming and adding likes, said in [1]. In 2000, people received emails titled ‘I LOVE YOU,’ attached with a love letter. The systems of the users who clicked on the letter got injected by a worm that obtained all the personal image files and sent them to all the contacts in Outlook. The above information is present in [2]. According to [3], Indian citizens are receiving links through SMS pretending to be from official banks or government bodies, saying that the user needs to update his identity details like Aadhar number and PAN immediately. Innocent users tend to click on those links and provide the information, believing it is genuine. The government and network providers can only do something other than warn the users to avoid these links.
The current detection techniques in SMS and Mail applications are based on content-based and sender reputation-based detection, as shown [6] and [7]. The URL detection is based on its features. According to [4], the features are classified into address-bar-based, HTML-JavaScript-based, and Domain-based features. Many studies have evolved around Machine Learning based on different kinds of features. Using different kinds of datasets, different sets of features, and different algorithms resulted in various outcomes. According to Machine Learning, the suitable concept for this problem is Classification, which can be observed in [8]. Generally, a classifier is built to decide which class a particular input belongs to. Based on its features, the URL needs to be classified into a class to which it belongs, either a Phishing URL or a Benign URL. From [5], the proposed algorithm was SVM.
III. PROPOSED FRAMEWORK
A. System Architecture
As shown in Fig. 1, the input URL is passed through the Feature Extraction Program, and its execution results in an array of feature values. Feature Extraction Program consists of string methods to extract address-based features, web scraping to extract content-based features, and API calls to extract domain-based features. The output array of features is referred to as a Feature Array. The feature array is now passed as an input to the ML Classifier. The classifier, based on the training, makes the detection. If the classifier's output is 0, then it is a benign URL; otherwise, it is a phishing URL.
B. Dataset
A dataset has been taken from Kaggle and re-processed. According to the current scenario, only the required columns(features) have been considered. The dataset consists of 11,430 URLs, of which 5,715 are Phishing, and 5,715 are Benign. The Phishing URLs are labelled '1', and the Benign URLs are labelled '0'.
C. Data Visualization
Fig. 3 shows the Correlation Heatmap between the features. The lighter colour represents higher correlation and the darker colour represents lower correlation.
D. Features
Although many features can be taken from the URL, we have taken the 18 features that play a significant role in the classification. The following are the features considered:
E. Algorithms
IV. RESULTS AND DISSCUSSIONS
Logistic Regression, SVM, and KNN are the conventional algorithms. Random Forest, XG Boost, and Stacking are algorithms working on the principle of Ensemble Learning. The stacking algorithm is unique and customizable; our choice can take the Level-1 models (weak learners) and meta-model. In this case, the Level-1 models are KNN, Random Forest, and XG Boost, and the meta-model is SVM. Table 1 shows the performance measures of the algorithms when the test size is 20%. The SVM performed the least compared to the other five algorithms. The Logistic Regression is better than SVM but not the best. KNN gave decent results. Random Forest and XG Boost performed well, with above 90% scores. Using better performance models as Level-1 models in Stacking, could give a slight but essential improvement in the performance. There is much distinctness among the KNN, Random Forest, and XG Boost outputs. SVM might be a better choice for the meta-model. The implementation of this Stacking model gave the highest scores, as shown in Table 1. The Stacking classifier performed with the highest accuracy of 97.46%, precision of 97.44%, recall of 97.37%, and F1-score of 97.29%.
TABLE I
PERFORMANCE MEASURES
Model |
Accuracy (in %) |
Precision (in %) |
Recall (in %) |
F1-Score (in %) |
Logistic Regression |
78.83 |
76.86 |
78.62 |
89.69 |
Support Vector Machine |
71.57 |
65.32 |
74.94 |
87.88 |
K-Nearest Neighbor |
83.94 |
83.33 |
84.43 |
83.88 |
Random Forest |
94.09 |
94.54 |
93.45 |
93.99 |
XG Boost |
96.63 |
96.47 |
96.72 |
96.60 |
Stacking Classifier |
97.46 |
97.44 |
97.37 |
97.29 |
Phishing attacks have been quite vulnerable to users for many years. In this project, the best possible approaches were attempted for collecting the best set of features and finding the suitable Machine Learning approaches for phishing URL detection. The traditional algorithms, Logistic Regression, SVM, and KNN, must perform better. Random Forest and XG Boost are giving improved results with the same input data. The stacking technique has given much better results than the previous models. The significance of ensemble techniques compared to conventional algorithms has been shown. As per understanding, Stacking can be an excellent technique to make slight improvements for Classification and Regression problems when we already have performing models. In the future, this methodology can be used to detect phishing websites accurately and reduce society\'s significant social engineering problem. The future scope of this methodology would be automating this model in mobile and computer applications to detect phishing sites.
[1] History of Phishing. [Online]. Available: https://cofense.com/knowledge-center/history-of-phishing/ [2] Love bug virus creates worldwide chaos. [Online]. Available: https://www.theguardian.com/world/2000/may/05/jamesmeek [3] A new phishing attack lurking to scam banking customers: Advisory. [Online]. Available: https://timesofindia.indiatimes.com/business/india-business/a-newphishing-attack-lurking-to-scam-banking-customers-advisory/articleshow/85236685.cms [4] R. M. Mohammad, F. Thabtah and L. McCluskey, \"An assessment of features related to phishing websites using an automated technique,\" 2012 International Conference for Internet Technology and Secured Transactions, London, UK, 2012, pp. 492-497. [5] J. Rashid, T. Mahmood, M. W. Nisar and T. Nazir, \"Phishing Detection Using Machine Learning Technique,\" 2020 First International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia, 2020, pp. 43-46, doi: 10.1109/SMART-TECH49988.2020.00026. [6] Uur Ozker, Ozgur Koray Sahingoz, \"Content Based Phishing Detection with Machine Learning\", 2020 International Conference on Electrical Engineering (ICEE), 25-27 September 2020. [7] S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn, \"Considering behavior of sender in spam mail detection,\" INC2010: 6th International Conference on Networked Computing, Gyeongju, Korea (South), 2010, pp. 1-5. [8] S. Chowdhury and M. P. Schoen, \"Research Paper Classification using Supervised Machine Learning Techniques,\" 2020 Intermountain Engineering, Technology and Computing (IETC), Orem, UT, USA, 2020, pp. 1-6, doi: 10.1109/IETC47856.2020.9249211.
Copyright © 2023 Sri Sai Phani Venkat Dasari, Dr. Nitalaksheswara Rao Kolukula. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET55506
Publish Date : 2023-08-25
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here