Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Michael Richard Chinguwo, R. Dhanalakshmi
DOI Link: https://doi.org/10.22214/ijraset.2023.49422
Certificate: View Certificate
Cloud computing enables users to access computing services over the Internet, but this also presents a security risk due to the anonymous nature of the Internet. Social engineering attacks are one of the most common security breaches in cloud computing, where attackers trick cloud users to reveal sensitive information. Detecting phishing attacks in cloud computing is challenging, and various solutions have been proposed, including rule-based and anomaly-based detection methods. Machine learning techniques have proven to be effective in detecting and classifying phishing attacks, particularly for distinguishing between legitimate and phishing websites. This paper proposes an ensemble approach utilizing four different machine learning classifiers to detect phishing websites. The study analyzes various features, such as address bar-based, domain-based, and HTML & JavaScript-based features, and the findings reveal that the proposed ensemble approach outperforms the base classifiers, achieving the highest accuracy of 98.8%.
I. INTRODUCTION
Cloud computing allows users to access computing services over the internet, making it convenient for users to access their data and programs from anywhere as long as they have an internet connection. However, the anonymous nature of the internet also makes it easier for attackers to breach cloud-based services, and social engineering attacks, such as phishing, are prevalent threats in cloud computing. These attacks aim to extract sensitive information from cloud service users, which can lead to unauthorized access to their accounts, programs, and data.
Phishing attacks are a common form of social engineering used against cloud service users, where attackers create fake websites that resemble genuine ones and trick victims into entering their login credentials. Traditional solutions like antivirus software, firewalls, and designated software are insufficient in fully preventing web phishing attacks [15]. Similarly, implementing Secure Socket Layer (SSL) and Digital Certificate (CA) is not enough to protect web users from phishing attacks [10]. Machine learning (ML) has demonstrated promising outcomes in addressing phishing attacks compared to other traditional anti-phishing methods [1].
Phishing attacks in cloud-based environments can pose significant threats to individuals and organizations that rely on cloud services. Attackers can use various phishing techniques to obtain sensitive information, and phishing websites can be designed to appear as legitimate sites, controlled by attackers. To prevent phishing attacks in a cloud-based environment, organizations must educate their employees on the dangers of phishing, how to recognize phishing emails and websites, and how to report them. Implementing multi-factor authentication, regularly monitoring access logs, and using encryption to protect sensitive data are other effective measures.
Employing machine learning to combat phishing attacks in cloud-based systems is a potential solution, as it can aid organizations in detecting and responding to these threats in real-time. Machine learning algorithms can be trained on extensive datasets of known phishing websites to detect patterns and anomalies suggestive of phishing attacks [8]. However, the advancement of technological tools is enabling phishing attackers to create bogus websites that closely resemble legitimate ones, making phishing detection a difficult problem. Existing methods face challenges in improving their accuracy rate, reducing false positives and eliminating false negatives, as well as the risk of overfitting [25]. Therefore, an ensemble machine learning approach that utilizes multiple models combined to enhance accuracy and reduce bias is a promising solution.
II. LITERATURE REVIEW
A. Phishing Attack Trends
In recent years, phishing attacks have become a common form of cybercrime and have been increasing in frequency.
According to the [30] Report, phishing was the second most common action used in data breaches, accounting for 36% of incidents analyzed, which is a significant increase from the 25% reported in 2020. However, in [31] report shows that phishing attacks were the third most common cyber-attack after Ransomware and DoS, comprising 20% of all incidents reported in 2022. The report further reveals that 33,473,532 accounts were breached through phishing, which accounts for 2.9% of the 1,154,259,736 personal records breached in 2022. Additionally, the report indicates that phishing is responsible for more than 60% of all breaches, followed by stolen credentials and Ransomware. Email servers and web application servers were the most commonly breached assets in 2022 due to the high prevalence of phishing. Reference [6] reported a new record in the third quarter of 2022 with 1,270,883 total phishing attacks, marking it as the worst quarter for phishing ever observed by the organization. Meanwhile, credential stuffing attacks, which involve using stolen credentials from one website to gain unauthorized access to another website, are also on the rise. According to [2] report, credential stuffing attacks increased by 77% in the first quarter of 2021 compared to the same period in 2020.
B. Cloud-Based Phishing Attacks
In 2022, there were two phishing campaigns [32], [28] that utilized cloud-based services like AWS and Dropbox to host and distribute their phishing pages and malware. Reference [19] identified two categories of cloud-based phishing attacks, and recommended countermeasures like improved monitoring and intrusion detection systems to deal with them. A study by [21] provided guidance on how to prevent different types of cloud-based phishing attacks, and discussed relevant regulations and compliance frameworks. In their work, [4] analyzed the potential impact of cloud-based phishing attacks on cloud users, and explored various types of attacks and the associated risks of sensitive data loss and financial damages.
C. Phishing Detection by Rule-Based Detection
Rule-based detection is a method of identifying phishing websites by using predefined rules or heuristics. It can be effective at detecting known types of phishing attacks but may not be as effective against new or sophisticated attacks. Researchers in [9], [22], [16] have proposed various rule-based approaches that utilize machine learning algorithms to generate rules and classify websites as either phishing or legitimate based on features such as domain age, SSL certificate, URL length, and specific keywords. Some approaches also incorporate user awareness training and other detection techniques such as data mining algorithms. Rule-based detection can be lightweight and proactive, making it suitable for real-time detection of unknown phishing URLs [27].
D. Phishing Detection by Anomaly-Based Detection
Anomaly-based detection is a technique used to identify phishing websites by comparing their characteristics with those of legitimate websites. Several studies propose anomaly-based approaches to detect phishing URLs. One approach [26] involves extracting features from W3C DOM objects/properties to train a classifier to distinguish between legitimate and phishing pages. Another study [17] suggests a two-stage approach that uses mobile eye tracking to analyze users' eye movements as they view a web page and an anomaly-based detection technique to identify phishing URLs. Another study [14] proposes a method for detecting malicious URLs in instant messaging platforms using anomaly-based techniques and a scoring mechanism with 11 features.
E. Phishing Detection by Machine Learning-Based Detection
Machine learning-based detection uses algorithms that analyze website data to detect phishing attacks. These algorithms are trained on large datasets of legitimate and known phishing websites and learn to differentiate between them based on features such as URL structure, content, and behavior. Various methods [7], [20], [18] have been proposed, such as using SVM, Multilayer Perceptron, Decision tree Induction, Naïve Bayes, and K-Nearest Neighbor algorithms to classify phishing and legitimate URLs. These approaches use different features, such as textual properties, link structures, webpage contents, DNS information, and network traffic.
F. Phishing Detection by Ensemble Machine Learning
Ensemble machine learning techniques are used for phishing website detection by combining multiple models to improve accuracy and robustness [13]. Various data mining techniques are utilized in intelligent systems to classify websites as legitimate or phishing [29]. XGBOOST algorithm in [24] is used in a novel phishing detection model, and bagging and boosting algorithms such as Gradient Boosting and Cat Boosting are proposed in [11] to improve accuracy in phishing website detection.
An optimized stacking ensemble model for phishing website detection is proposed in [5], where the parameters of multiple ensemble machine learning methods are optimized using a genetic algorithm, and the best three classifiers are chosen as base classifiers for a stacking ensemble method.
G. Challenges with Phishing Website Detection
Detecting phishing websites using machine learning can be challenging due to various reasons such as data imbalance, feature selection, dynamic nature of attacks, adversarial attacks, generalization, and interpretability. It is recommended in [7] to carefully design and test machine learning models to ensure their effectiveness and robustness. Studies in [11] recommend using various machine learning algorithms, adopting hybrid technology, evaluating on multiple datasets, and utilizing other evolutionary algorithms to enhance efficiency.
Additionally, [3] suggested that the hybrid approach for predicting phishing websites should be evaluated on various datasets with a wide range of website features. A larger dataset is needed to create reliable models to address the issue of phishing attacks in the business sector.
III. METHODOLOGY
The project aims to develop a new approach for detecting cloud-based phishing attacks using stacking ensemble machine learning. The methodology involves collecting and preprocessing a dataset, training several base machine learning models using various algorithms, identifying the most effective algorithms, and employing a stacking ensemble technique to merge them. A meta-model will then be trained on the base models' predictions to improve the overall performance of the model, which will determine whether a website is legitimate or a phishing site.
A. Data Collection
This research collected a dataset of cloud-based phishing attacks from various public repositories. The dataset includes four datasets with varying numbers of instances, features, and class distributions. Dataset 1 has 11,054 instances with a distribution of 55.7% phishing and 44.3% legitimate instances, while Dataset 2 has 10,000 instances with a distribution of 50% phishing and 50% legitimate instances. Dataset 3 has 1,353 instances with a distribution of 88.5% phishing and 11.5% legitimate instances, and Dataset 4 has 10,000 instances with a distribution of 50% phishing and 50% legitimate instances.
B. Data Preprocessing
Data preprocessing phase of the research involved cleaning, normalizing, selecting, and engineering features of the collected datasets. Various Python libraries such as Pandas, NumPy, Scikit-learn, Seaborn, and Matplotlib were used for data preprocessing. Pandas was used to handle and transform data, NumPy for array manipulation and mathematical operations, Scikit-learn for data preprocessing, model selection, and evaluation, Seaborn for data visualization, and Matplotlib for creating different types of visualizations. These libraries provide a range of functionalities for handling, transforming, analyzing, and visualizing the phishing dataset.
C. Feature Engineering
Feature engineering is the process of selecting, creating, and transforming features from raw data to improve machine learning models. In the context of phishing datasets, it is crucial to extract relevant and discriminative features from URLs to detect phishing attacks. A study by [23] introduced a new tool for automatically extracting important features from websites to predict website legitimacy. T
he authors analyzed 17 features that differentiate phishing websites from legitimate ones and developed a new rule for each feature. The study concluded that "Request URL," "Age of Domain," and "HTTPS and SSL" features are the most significant in detecting phishing websites. The crucial features have been grouped into four categories: Address bar features, Abnormal-based features, HTML and JavaScript-based features, and Domain-based features.
D. Proposed Method
The stacking ensemble machine learning approach involves a formula that combines the predictions of multiple base models to make a final prediction. The formula can be represented as follows:
Final_Pred = Meta-Model (Pred1, Pred2, ..., PredN)
Where:
E. Proposed Approach Algorithm
INPUT: Training data (X, Y), base classifiers (C1, C2, ..., Cn), meta-classifier (M)
SPLIT: Split the training data into two parts: training set (X_train, Y_train) and validation set (X_val, Y_val)
FOR each base classifier Ci in base classifiers:
TRAIN: Train Ci on X_train, Y_train
PREDICT: Use Ci to make predictions on X_val, store predictions as P_val_i
COMBINE: Combine the predictions P_val_i to form a new feature set for X_val
TRAIN: Train M on (X_val, Y_val, P_val_i)
RETURN: Trained meta-classifier M
F. Architecture of Proposed Approach
The proposed model's architecture for detecting phishing websites is illustrated in Fig. 1.
The proposed stacking ensemble machine learning methodology includes three stages: training, ranking, and detection. The training stage involves feature generation, transformation, and training of eight classifiers. The ranking stage selects the top four performing algorithms for the ensemble learning model. The detection stage uses the trained ensemble machine learning model to classify new web pages as phishing or legitimate, and measures the accuracy of the system in identifying phishing attacks in real-time.
IV. RESULTS AND DISCUSSION
The study aimed to assess the effectiveness of a stacking ensemble machine learning technique in detecting cloud-based phishing attacks. The dataset was divided into training and testing sets, and various machine learning models, including Random Forests, Support Vector Machines, K-Nearest Neighbors, Decision Tree, Gradient Boosting Classifier, CatBoost Classifier, XGBoost Classifier, and Multi-layer Perceptron, were trained and tested. The best four models were combined using a stacking ensemble approach. The results showed that the proposed method achieved high accuracy, precision, and recall in detecting phishing websites, outperforming other methods and individual models.
A. Evaluation Metrics
Evaluation metrics are measures used to evaluate machine learning model performance. The metrics assess the accuracy of the model in making accurate predictions on a given dataset. Metrics like accuracy, precision, recall and F1-score are used to evaluate phishing detection system effectiveness. Confusion matrix is used to calculate these metrics and provide detailed breakdown of true positives, false positives, true negatives, and false negatives for the classification model predicting phishing websites.
B. Experiment Results
This study compared the accuracy of various traditional and stacking ensemble machine learning classifiers for detecting phishing attacks. Results showed that the proposed stacking ensemble approach outperformed other classifiers in all datasets, achieving the highest accuracy scores of 97.6%, 86.2%, 88.6%, and 98.8% for Datasets 1 to 4, respectively. The K-Nearest Neighbors classifier had the lowest accuracy score in all datasets. The proposed stacking ensemble approach was found to be more effective than traditional machine learning classifiers for detecting phishing attacks. Table 1 shows evaluation metrics on dataset 1.
TABLE 1
EVALUATION METRICS OF DATASET 1
ML Model |
Accuracy |
f1_score |
Recall |
Precision |
K-Nearest Neighbors |
0.956 |
0.961 |
0.991 |
0.989 |
Support Vector Machine |
0.964 |
0.968 |
0.980 |
0.965 |
Decision Tree |
0.958 |
0.962 |
0.991 |
0.993 |
Random Forest |
0.969 |
0.972 |
0.995 |
0.988 |
Gradient Boosting Classifier |
0.974 |
0.977 |
0.994 |
0.986 |
CatBoost Classifier |
0.972 |
0.975 |
0.994 |
0.989 |
XGBoost Classifier |
0.952 |
0.958 |
0.968 |
0.947 |
Multi-layer Perceptron |
0.967 |
0.970 |
0.996 |
0.977 |
The Proposed Approach |
0.976 |
0.978 |
0.996 |
0.984 |
Fig. 2 shows evaluation of the proposed approach and other classification algorithms based on their accuracy, precision, recall, and F-measure on Dataset 1.
C. Proposed Model Performance on Different Datasets
The proposed stacking ensemble machine learning classifier achieved the highest accuracy score of 98.8% on dataset 4, which had 10000 instances and 48 features and was balanced. The performance on dataset 1, which had an imbalanced dataset with 11054 instances and 30 features, was the second highest with an accuracy score of 97.6%. The lowest performance was observed in Dataset 3, which had an imbalanced dataset with 1353 instances and 8 features. The results suggest that the number of instances and features in a dataset can impact the accuracy of a phishing detection system. Fig. 3 presents an evaluation of the proposed approach's performance on various datasets based on accuracy, precision, recall, and F-measure.
D. Comparison with Existing Research
The study presents a comparison of a stacking ensemble learning approach with existing studies in the field of phishing detection using machine learning. The approach involves combining multiple machine learning models to improve the accuracy of predictions. The study conducted a literature review and evaluated the approach against previous studies using accuracy metrics. Table 2 presents a comparative analysis between our approach and the previous studies.
TABLE 2
COMPARISON OF OUR METHODOLOGY WITH PREVIOUS RESEARCH
Work |
ML Algorithm Used |
No. of Features |
Sample |
Accuracy |
[29] |
RF, ANN, Rotation Forest, KNN, C4.5, CART and NB, SVM |
30 |
2456 |
97.36%
|
[24] |
XGBOOST, PNN, RF, NB, KNN |
30 |
2456 |
97.27%
|
[11] |
Gradient Boost, Cat Boost, RF |
30 |
2456 |
97.40%
|
[12] |
LR, RF, and CatBoost |
30 |
11055 |
97.87 %
|
[5] |
RF, AdaBoost, XGBoost, Bagging, GradientBoost, and LightGBM, RF+GB+ SVM |
30 |
11055 |
97.16%
|
[3] |
BPNN, DNN, SVM, NB, C4.5, KNN |
9 |
1353 |
88.77%. |
The Proposed Approach |
K-Nearest Neighbors, Decision Tree, Multi-layer Perceptron, Random Forest, Cat Boost, Gradient Boost, XGBoost, Support Vector Machines |
30 |
11054 |
97.60%
|
48 |
10000 |
98.80%
|
V. FUTURE WORK
The research suggests future work to improve the interpretability of the stacking ensemble technique and identify key features that contribute to phishing detection. Combining the technique with advanced methods like deep learning and natural language processing is also recommended to improve accuracy, particularly for evolving attack characteristics. Further testing on larger and more diverse datasets is also suggested to evaluate the model's robustness against adversarial attacks. Overall, the proposed approach presents promising results, but continued research is needed to refine and enhance the technique for real-world effectiveness.
This study explored the use of a stacking ensemble machine learning technique for detecting cloud-based phishing attacks, and demonstrated its efficacy in accurately identifying such attacks with high levels of precision, recall, accuracy, and F1 score. The study highlights the importance of feature selection and dataset size in improving the accuracy of phishing detection systems, and recommends that organizations incorporate this technique into their security strategy to better detect and respond to cloud-based phishing attacks. Regular monitoring and updates to security measures are also advised, as attackers continuously evolve their tactics.
[1] Abdelhamid, N., Thabtah, F. A., & Abdel-jaber, H. (2017). Phishing detection: A recent intelligent machine learning comparison based on models content and features. 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), 72-77. [2] Akamai. (2021). Phishing for Finance: Volume 7, Issue 2 . Cambridge: Akamai Technologies, Inc. [3] Ali, W., & Ahmed, A. A. (2019). Hybrid intelligent phishing website prediction using deep neural networks with genetic algorithm-based feature selection and weighting. IET Information Security, Volume13, Issue 6, 659-669. [4] Alnuem, M. A., Damodaran, S. K., & Fowler, J. E. (2014). Cloud-based Phishing Attacks: An Overview. International Journal of Computer Applications Volume 107 - No. 5. [5] Al-Sarem, M., Saeed, F., Al-Mekhlafi, Z. G., Mohammed, B. A., Al-Hadhrami, T., Alshammari, M. T., . . . Alshammari, T. S. (2021). An Optimized Stacking Ensemble Model for Phishing Websites Detection. Electronics 2021, 10, 1285., 1-18. [6] APWG. (2022). Phishing Activity Trends Report, 3rd Quarter 2022. Lexington: APWG. [7] Assegie, T. A. (2021). K-Nearest Neighbor Based URL Identification Model for Phishing Attack Detection. Indian Journal of Artificial Intelligence and Neural Networking (IJAINN), Volume-1 Issue-2, 18-21. [8] Basit, A., Zafar, M., Liu, X., Javed, A. R., Jalil, Z., & Kifayat, K. (2021). A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommunication Systems, 139–154. [9] Basnet, R., Sung, A. H., & Liu, Q. (2012). Rule-Based Phishing Attack Detection. [10] Culafi, A. (2019, November 07). SSL certificate abuse drives growing number of phishing attacks. Retrieved from TechTarget: https://www.techtarget.com/searchsecurity/news/252473628/SSL-certificate-abuse-drives-growing-number-of-phishing-attacks [11] Deekshitha, B., Aswitha, C., Sundar, C. S., & Deepthi, A. K. (2022). URL Based Phishing Website Detection by Using Gradient and Catboost Algorithms. International Journal for Research in Applied Science &Engineering Technology (IJRAS), Volume 10 Issue VI. [12] Fang, L. C., Ayop, Z., Anawar, S., Othman, N. F., Harum, N., & Abdullah, R. S. (2021). URL Phishing Detection System Utilizing Catboost Machine Learning Approach. IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.9, 297-302. [13] Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. Sebastopol: O’Reilly Media, Inc. [14] Guan, D. J., Chen, C.-M., & Lin, J.-B. (2022). Anomaly Based Malicious URL Detection in Instant Messaging. [15] Gupta, B. B., Arachchilage, N. A., & Psannis, K. E. (2018). Defending against Phishing Attacks: Taxonomy of Methods, Current Issues and Future Directions. Telecommunication Systems, 247-267. [16] H F. Fazliya, H. N. (2019). A Rule Based Prediction of Phishing Websites Using Data Mining. Journal of Technology and Value Addition Volume 1 (2). [17] Kaniuk, S., Patel, D., & Ma, X. (2020). Security Study: Phishing URLs Using Mobile Eye Tracking and Anomaly Based Detection. [18] Karnik, R., & Bhandari, G. M. (2016). Support Vector Machine Based Malware and Phishing Website Detection. International Journal of Computing and Technology, Volume 3, Issue 5. [19] Khalil, I. M., Khreishah, A., & Azeem, M. (2014). Cloud Computing Security: A Survey . Computers 2014, 3(1, 1-35. [20] Lakshmi, V. S., & Vijaya, M. (2012). Efficient prediction of phishing websites using supervised. Procedia Engineering 2012, Volume 30, 798-805. [21] Mather, T., Kumaraswamy, S., & Latif, S. (2009). Cloud Security and Privacy: An Enterprise Perspective on Risks andCompliance. Sebastopol: O’Reilly Media, Inc. [22] Mohammad, R. M., Thabtah, F., & McCluskey, L. (2014). Intelligent rule-based phishing websites classification. [23] Mohammad, R. M., Thabtah, F., & Mccluskey, T. L. (2012). An assessment of features related to phishing websites using an automated technique. Internet Technology And Secured Transactions. [24] Musa, H., Gital, A., Zambuk, F. U., Umar, A., Umar, A. Y., & Wazir, J. U. (2019). A Comparative Analysis of Phishing Website Detection Using XGBoost Algorithm. Journal of Theoretical and Applied Information Technology, Vol.97. No 5. [25] P.Kalaharsha, & Mehtre, B. M. (2021). Detecting Phishing Sites - An Overview. [26] Pan, Y., & Ding, X. (2006). Anomaly Based Web Phishing Page Detection. 2006 22nd Annual Computer Security Applications Conference (ACSAC\'06). [27] SatheeshKumar, M., Srinivasagan, K. G., & UnniKrishnan, G. (2022). A lightweight and proactive rule-based incremental construction approach to detect phishing scam. Information Technology & Management; Dec 2022, Vol. 23 Issue 4, 271-298. [28] Scroxton, A. (2022, November 02). Dropbox code compromised in phishing attack. Retrieved from TechTarget: https://www.computerweekly.com/news/252526838/Dropbox-code-compromised-in-phishing-attack [29] Subasi, A., Molah, E., F. A., & Chaudhery, T. J. (2017). Intelligent Phishing Website Detection using Random Forest Classifier. 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA). [30] Verizon. (2021). Data Breach Investigations Report. New York: Verizon. [31] Verizon. (2022). Data Breach Investigations Report. New York: Verizon. [32] Zurier, S. (2022, August 22). Hackers steal credentials by building phishing pages on AWS. Retrieved from SC Media: https://www.scmagazine.com/news/cloud-security/hackers-steal-credentials-by-building-phishing-pages-on-aws
Copyright © 2023 Michael Richard Chinguwo, R. Dhanalakshmi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET49422
Publish Date : 2023-03-07
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here