Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: M. Robin Raj Paul, P. Sushanth, Dr . K Santhi Sree
DOI Link: https://doi.org/10.22214/ijraset.2024.63655
Certificate: View Certificate
In this digital age, phishing attacks are something that are quite prevalent and are on the rise. This paper explores the various avenues for detecting such kind of attacks which will pave way to mitigating such kinds of attacks in the future. We primarily focused on proving that deep learning methods are much more efficient than traditional machine learning models; for this purpose we are evaluating the performance of a traditional machine learning model namely Naive Bayes and two deep learning models which are Convolutional Neural Networks(CNN) and Recurrent Neural Networks(RNN). The process starts with normalizing the input features and then the categorical data is transformed after which the dataset containing the URLs are loaded and are preprocessed. The performance of the models was evaluated against metrics like Accuracy, Precision, Recall and F1-Score.The end results proved that CNN was able to achieve the optimal performance and was capable of outperforming the other two models. Therefore this paper is of the view that such CNN or Neural Network empowered Models are the only way to mitigate these types of attacks and will also act as a catalyst in developing systems or models that are immune to such kinds of attacks.
I. INTRODUCTION
Phishing Attacks are a very well known cyberthreat that has become increasingly prevalent in this cyber age , it works by using misleading URLs to deceive users such that they provide their own private information. To curb these kinds of attacks traditional methods like blacklisting and other heuristic based approaches help but are not fully efficient thereby mandating the evolution of novel frameworks or methods to tackle such kinds of attacks. These attackers who divulge in such kinds of attacks often use interesting and clever ways of making the URLs seem legitimate thereby making the job of those responsible to safeguard systems much more difficult, these attacks can take advantage of the educated and digitally aware people, needless to say that it’s a bane for those who are not digitally literate. The major problem associated with such kinds of attacks is that we need models that can constantly update , learn and detect on their own thereby throwing such kinds of problems right in the ballpark of deep learning. This Paper also deals with three such models of which one namely Naive Bayes is a Machine Learning Algorithm whereas the other two that is CNN and RNN are Deep Learning Approaches/Techniques. In this paper we, with the help of open access resources and dataset we have created efficient models and trained them well to detect such kinds of malicious URLs, we have also provided various training loss curves and confusion matrices and also compared the performance of the three models in terms of their Accuracy, Precision, Recall and F1-Score.
II. RELATED WORK
Detection of Phishing attack is one such endeavour that several researchers have been at since more than a decade. The Researchers have utilized several machine learning, deep learning models and even tried to create hybrid versions of models that would perform well, therefore most of the related or existing work will also revolve around the aforementioned domains. However two commonly followed approaches are as follows:
A. Customary AI based Approaches
Rule based approaches can be used to detect phishing attacks, these are direct, efficient and use logical reasoning for their detection purposes, however these require regular updates and can be easily bypassed. To ensure perfect phishing detection Jain and Gupta[8],Moghimi and Varjani[11] and Satheesh Kumar [12] have looked into several rule based techniques.
B. Deep Learning Based Approaches
All of this research has been quite instrumental in improving systems that detect phishing URLs until now, but since the attacks are becoming much more diverse and sophisticated;
depending on static, rule-based or outwardly focused strategies will no longer be helpful therefore profound learning models are a dignified answer for coping with such type of attacks. Therefore in order to determine the most effective models for phishing detection this paper distinguishes the results of the RNNs, CNNS and the conventional Naive Bayes classifiers. This will help in making the Internet safe for all users.
III. PROPOSED WORK
This Project identifies a way to detect phishing URLs to curb the danger presented by phishing attacks. This paper provides a way to improve and achieve the highest accuracy in detecting such kind of attacks using Convolutional Neural Networks and Recurrent Neural Networks; CNNs can identify hierarchical characteristics from the data and can provide a detailed and open view of all the URL characteristics whereas RNN is used to process sequential data. The way in which this works involves the following steps namely planning, preparing and assessing both the RNN and CNN models in order to distinguish phishing URLs from the dataset of marked URLs. RNN is used to learn the temporal dependencies that have the potential of identifying signs that are indicative of phishing. CNN on the other hand is built in such a fashion that it treats the URLs like a one-dimensional model so it is capable of distinguishing spatial features and other characteristics of URLs, therefore the amalgamation of both these models will provide a complete view of detecting such kind of phishing attacks and or URLs ; After which comparing and contrasting the performance of these models with the most preferred traditional machine learning model namely Naive Bayes classifier which is very popular for text based classification and will provide a real time example of how these can be implemented in the real world.
A. Dataset
The dataset used for this project is the PHI-2018 Phishing URL Dataset[21].This dataset holds a vast collection of URLs that have numerous features and other properties that can be exploited in order to train and evaluate the models in such a manner that they are extremely capable of detecting any kind of phishing attacks. This dataset has a total of 2,35,794 entries and also has 56 features to choose from. It is on the basis of these features that we decide the legitimate nature of the phishing URLs . This dataset was taken from an open source UCI Mchine Learning repository and it is open for academic and research utilization.
The dataset includes the following features:
This particular architecture typically depicts a comprehensive methodology for detecting phishing URLs by employing three models namely Convolutional Neural Networks(CNN], Recurrent Neural Network(RNN) and Naive Bayes. The process starts with the dataset that contain both the authentic and fraud URLs, which then goes through preprocessing in order to normalize and standardise the information after which the data that has been preprocessed is used to train the models, to be specific CNN,RNN and Naive Bayes. The Final Performance of each model is then used to determine their overall performance and the comparison results are portrayed in order to determine how well they can perform in real time situations or the real world
C. Methodology
The first step is data loading and preprocessing , since the dataset is in a csv file, so the first step is to load the file containing the features using Pandas then the next step is to implement error handling functions for potential parsing issues. Then the most important step is to convert non-numeric columns to numeric format using python’s and scikit-learn’s ‘LabelEncoder’ function also such operations are done to ensure that the data is in a numerical format for the machine learning models.
The next step is Exploratory Data Analysis, this can be implemented with matplotlib, visualizing the dataset using a label distribution graph which shows legitimate vs phishing URLs, this helps in understanding class balance and dataset distribution. The next task is Data Splitting which can be implemented using Scikit-learn’s ‘train_test_split’ to divide the dataset in a 80:20 ratio for training and testing respectively then evaluating the validity of the trained models; Also the data should be Normalized/Standardized using Scikit-learn’s ‘StandardScaler’ also we have to ensure that features are lying on the same range in order to enhance the accuracy which is crucial for deep learning models like RNN and CNN.
The next important step is Model Training, we have first trained three distinct models such as Naive Bayes(Gaussian Naive Bayes), RNN and CNN. The Naive Bayes can be implemented using ‘GaussianNB’ classifier from Scikit-learn, the RNN and CNN can be implemented using PyTorch(‘nn.RNN’-RNN and ‘nn.Conv1d’-CNN).Then the models have to be optimized using Adam Optimizer i.e. ‘optim.Adam’ and the cross-entropy loss has to be minimized using ‘nn.CrossEntropyLoss’.Then the Models have to be trained for several iterations or epochs using loops like ‘for epoch in range(num_epochs)’.
The final and most important step is Model Evaluation and Analysis followed by Visualisation the evaluation metrics are computed using Scikit-learn: Accuracy, Precision, Recall and F1 score. Then the Confusion Matrices should be generated to evaluate the model’s performance in it’s ability to differentiate between Authentic and Fraud URLs . Then Curves such as Training loss curves are plotted. Finally the performance of all the three models are displayed.
IV. EXPERIMENTAL ANALYSIS AND RESULTS
The majority of the model training and evaluation are heavily dependent on the number of legitimate and phishing URLs in the dataset, getting a thorough understanding of the URL distribution is very important in order to successfully categorize the classes, therefore such kind of a visualization provides a clear understanding of the dataset’s class balance.
???????
In this study we have executed and analyzed different learning models to distinguish phishing URLs. We have highlighted the strength of deep learning techniques in detecting such phishing attacks or threats by utilizing a improved CNN, RNN and then we have compared it’s performance with a machine learning algorithm called Naive Bayes. The Bayes Model acted as the reference and the RNN and improved CNN have given us exceptional design acknowledgement capacities, Through our results we were able to come to the conclusion that Accuracy, Precision, Recall and F1-Score of the improved CNN model were by far the best and finer than the results of those of Naive Bayes. The CNN model includes several convolutional layers, batch normalization and dropout systems showed higher capability in differentiating potential phishing URLs. The results itself are an indication of how important these models can be in securing the internet and it’s users. Further expansion on this work will involve formulating the formed models into ongoing phishing location frameworks thereby upgrading their importance and application in network safety. Having a much more diverse dataset will always help the model to train effectively and classify further such events properly; going much more in depth rather than URL analysis, integrating the Natural Language Processing techniques can also guarantee the detection of phishing content and is capable of offering a comprehensive security detail. Therefore this can act as the basis and can keep cyberthreats and risks at bay.
[1] Anti-Phishing Working Group. (Sep. 2022). Phishing Attacks Trends Report-Q2 2022. Accessed: Oct. 15, 2022. [Online]. Available: https://apwg.org/trendsreports/ [2] Cloudflare’s 2023 Phishing Threats Report. Accessed: Oct. 1,2023.[Online].Available:https://www.cloudflare.com/lp/2023-phishing-report/ [3] M. Volkamer, K. Renaud, B. Reinheimer, and A. Kunz, ‘‘User experiences of TORPEDO: Tooltip-powered phishing email detection,’’Comput. Secur., vol. 71, pp. 100–113, Nov. 2017, doi: 10.1016/j.cose.2017.02.004. [4] N. Q. Do, A. Selamat, O. Krejcar, E. Herrera-Viedma, and H. Fujita,‘‘Deep learning for phishing detection: Taxonomy, current challenges and future directions,’’ IEEE Access, vol. 10, pp. 36429–36463, 2022, doi: 10.1109/ACCESS.2022.3151903. [5] T. Mahara, V. L. H. Josephine, R. Srinivasan, P. Prakash, A. D. Algarni, and O. P. Verma, ‘‘Deep vs. shallow: A comparative study of machine learning and deep learning approaches for fake health news detection,’’ IEEE Access, vol. 11, pp. 79330–79340, 2023, doi: 10.1109/ACCESS.2023.3298441. [6] Google Safe Browsing. Accessed: Oct. 1, 2023. [Online]. Available: https://safebrowsing.google.com/ [7] (2019). Office 365 Advanced Threat Protection Safe Links. Accessed:Jul. 10, 2023. [Online]. Available: https://docs.microsoft.com/enus/office365/securitycompliance/atp-safe-links [8] A. K. Jain and B. B. Gupta, ‘‘A novel approach to protect against phishing attacks at client side using auto-updated white-list,’’ EURASIP J. Inf. Secur., vol. 2016, no. 1, pp. 1–11, Dec. 2016, doi: 10.1186/s13635- 016-0034-3. [9] N. A. Azeez, S. Misra, I. A. Margaret, L. Fernandez-Sanz, and S. M. Abdulhamid, ‘‘Adopting automated whitelist approach for detecting phishing attacks,’’ Comput. Secur., vol. 108, Sep. 2021, Art. no. 102328, doi: 10.1016/j.cose.2021.102328. [10] N. Abdelhamid, A. Ayesh, and F. Thabtah, ‘‘Phishing detection based associative classification data mining,’’ Expert Syst. Appl., vol. 41, no. 13, pp. 5948–5959, Oct. 2014, doi: 10.1016/j.eswa.2014.03.019. [11] M. Moghimi and A. Y. Varjani, ‘‘New rule-based phishing detection method,’’ Expert Syst. Appl., vol. 53, pp. 231–242, Jul. 2016, doi: 10.1016/j.eswa.2016.01.028. [12] M. SatheeshKumar, K. G. Srinivasagan, and G. UnniKrishnan, ‘‘A lightweight and proactive rule-based incremental construction approach to detect phishing scam,’’ Inf. Technol. Manage., vol. 23, no. 4, pp. 271–298, Dec. 2022, doi: 10.1007/s10799- 021-00351-7. [13] A. K. Jain and B. B. Gupta, ‘‘Phishing detection: Analysis of visual similarity based approaches,’’ Secur. Commun. Netw., vol. 2017, pp. 1–20, Oct. 2017, doi: 10.1155/2017/5421046. [14] E. Medvet, E. Kirda, and C. Kruegel, ‘‘Visual-similarity-based phishing detection,’’ in Proc. 4th Int. Conf. Secur. Privacy Commun. Netowrk, Sep. 2008, pp. 1–6, doi: 10.1145/1460877.1460905. [15] W. Liu, X. Deng, G. Huang, and A. Y. Fu, ‘‘An antiphishing strategy based on visual similarity assessment,’’ IEEE Internet Comput., vol. 10, no. 2, pp. 58–65, Mar. 2006, doi: 10.1109/MIC.2006.23. [16] Y. Zhou, Y. Zhang, J. Xiao, Y.Wang, andW. Lin, ‘‘Visual similarity based anti-phishing with the combination of local and global features,’’ in Proc. IEEE 13th Int. Conf. Trust, Secur. Privacy Comput. Commun., Sep. 2014, pp. 189–196, doi: 10.1109/TRUSTCOM.2014.28. [17] G. Varshney, M. Misra, and P. K. Atrey, ‘‘Improving the accuracy of search engine based anti-phishing solutions using lightweight features,’’ in Proc. 11th Int. Conf. Internet Technol. Secured Trans. (ICITST), Dec. 2016,pp. 365–370, doi: 10.1109/ICITST.2016.7856731. [18] Y. Huang, Q. Yang, J. Qin, and W. Wen, ‘‘Phishing URL detection via CNN and attention-based hierarchical RNN,’’ in Proc. 18th IEEE Int. Conf. Trust, Secur. Privacy Comput. Commun./13th IEEE Int. Conf. Big Data Sci. Eng., Aug. 2019, pp. 112–119, doi: 10.1109/Trustcom/BIGDATASE.2019.00024. [19] O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, ‘‘Machine learning based phishing detection from URLs,’’ Expert Syst. Appl., vol. 117, pp. 345–357, Mar. 2019, doi: 10.1016/j.eswa.2018.09.029. [20] S. Singh, M. P. Singh, and R. Pandey, ‘‘Phishing detection from URLs using deep learning approach,’’ in Proc. 5th Int. Conf. Comput., Commun. Secur. (ICCCS), Oct. 2020, pp. 1–4, doi: 10.1109/ICCCS49678.2020.9277459. [21] Datset:Prasad,Arvind and Chandra,Shalini. (2024). PhiUSIIL Phishing URL (Website). UCI Machine Learning Repository. https://doi.org/10.1016/j.cose.2023.103545. [22] O. K. Sahingoz, E. BUBEr and E. Kugu, \"DEPHIDES: Deep Learning Based Phishing Detection System,\" in IEEE Access, vol. 12, pp. 8052-8070, 2024, doi: 10.1109/ACCESS.2024.3352629.
Copyright © 2024 M. Robin Raj Paul, P. Sushanth, Dr . K Santhi Sree. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET63655
Publish Date : 2024-07-17
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here