Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: B. Deekshitha, Ch. Aswitha, Ch. Shyam Sundar, A. Kavya Deepthi
DOI Link: https://doi.org/10.22214/ijraset.2022.43986
Certificate: View Certificate
Phishing is one of the most common and most dangerous attacks among cybercrimes. The aim of these attacks is to steal the information used by individuals and organizations to conduct transactions. Phishing websites contain various hints among their contents and web browser-based information. In existing system the Random forest algorithm is used. In our proposed system, we are using different classification algorithm like bagging and boosting algorithms that are Gradient Boosting, Cat boosting to increase accuracy. The features extracted based on the features of websites in UC Irvine Machine Learning Repository. Here, we have performed the performance analysis between the boosting algorithms like Gradient boost, Cat boost and the random forest. From the performance analysis we can determine the best suitable algorithm to detect the phishing website .This study is considered to be an applicable design in automated systems with high performing classification against the phishing activity of websites.
I. INTRODUCTION
3. Regression: Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting.
4. Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. A classification model attempts to draw some conclusion from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes. In short Classification either predicts categorical class labels or classifies data (construct a model) based on the training set and the values (class labels) in classifying attributes and uses it in classifying new data. There are a number of classification models. Classification models include logistic regression, decision tree, random forest, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
5. Unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data. Unsupervised learning is the training of machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Unsupervised learning is classified into two categories of algorithms:
6. Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
7. Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.
A. Applications of Machine Learning
B. Challenges to Machine Learning
C. Applications of Machine Learning
D. Project Deliverables
E. Project Scope
II. BACKGROUND AND RELATED WORK
A. Altyeb Taha
“Intelligent Ensemble Learning Approach for Phishing Website Detection Based on Weighted Soft Voting.” Ensemble learning combines the predictions of several separate classifiers to obtain a higher performance than a single classifier. This paper proposes a intelligent ensemble learning approach for phishing website detection based on weighted soft voting to enhance the detection of phishing websites
B. Mohammad, R.M., Thabtah, F. and McCluskey
“Predicting Phishing Websites Based on Self-Structuring Neural Network”.The Artificial Neural Networks (ANN) are computational models inspired by the structure of the brain and aim to simulate human behaviour, such as learning, association, generalization and abstraction when subjected to training. In this paper, an ANN Multilayer Perceptron (MLP) type was applied for websites classification with phishing characteristics. The results obtained encourage the application of an ANN-MLP in the classification of websites with phishing characteristics.
C. Doyen Sahoo, Chenghao Liu, Steven C.H. Hoi
Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of malicious URL detectors, machine learning techniques have been explored with increasing attention in recent years. Further, this article provides a timely and comprehensive survey for a range of different audiences, not only for machine learning researchers and engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them understand the state of the art and facilitate their own research and practical applications.
D. Alisha Maini; Navan Kakwani; Ranjitha B; Shreya M K; Bharathi R
Technology is evolving at an exponential rate, and so are human minds. One of the cybercrimes is phishing attacks. Traditional anti-phishing techniques which use blacklists to iterate and check if the URL is legitimate or phishing is not very useful as the phishers can attack using new URLs. Therefore, Machine learning algorithms can be used to train models to learn the semantic differences between legitimate and phishing URLs. To perform classification of legitimate and phishing URLs, eight ML algorithms which are Random Forest, Decision tree, Naive Bayes, AdaBoost, KNN, XGBoost, Support Vector Machines (SVM) and Logistic Regression are trained and tested. To improve the standard of the classification model, an ensemble model is built using the above-mentioned machine learning algorithms. From the results observed, the machine learning algorithms, XGBoost achieved the highest accuracy and the ensemble model achieved an accuracy higher than all individual machine learning models.
III. METHODS AND FUNCTIONING
A. Machine Learning Algorithm
Three machine learning classification model Gradient boost classifier, Cat boost classifier and Random forest has been selected to detect phishing websites.
B. Random Forest
It is one of the Supervised Algorithm. It is mainly used to perform the Classification and Regression problems. It mainly build’s the Decision trees on different samples and takes majority vote on the classification and average in case of Regression.
C. Gradient Boosting
It is one of the Boosting Technique. The main theme, of the Boosting is to combine all the weak learners together to form the strong model.
D. Cat Boost or Categorical Boosting
It is an open-source boosting library developed by Yandex. In addition to regression and classification, Cat Boost can be used in ranking, recommendation systems, forecasting and even personal assistants.
IV. IMPLEMENTATION AND RESULTS
Scikit-learn tool has been used to import Machine learning algorithms. Dataset is divided into training set and testing set in 80:20 ratios respectively. Each classifier is trained using training set and testing set is used to evaluate performance of classifiers. Performance of classifiers has been evaluated by calculating classifier's accuracy score, false negative rate and false positive rate
ML Model |
Accuracy |
F1_score |
Recall |
Precision |
Gradient boost classifier |
0.974 |
0.977 |
0.994 |
0.986 |
Cat boost classifier |
0.972 |
0.975 |
0.994 |
0.989 |
Random forest |
0.976 |
0.970 |
0.995 |
0.988 |
Results shows that Gradient boost classifier gives better detection accuracy which is 97.4 and Cat boost classifier gives detection accuracy which is 97.2% with lowest false negative rate than decision tree and support vector machine algorithms. Result also shows that detection accuracy of phishing websites increases as more dataset used as training dataset. All classifiers perform well when 90% of data used as training dataset.
V. SHOWING HOW MUCH PERCENT A WEBSITE IS SAFE TO USE
This screen presents the results derived from experimental evaluation. These are derived by using the Algorithms used in proposed systems to achieve highest accuracy.
Nowadays, phishing websites are increasing rapidly and causing more damage to the users and organizations. It is becoming a biggest threat to people’s daily life and the networking environment. In these attacks, the intruder puts on an act as if it is trusted organization with an intention to purloin liable and essential information. Phishing website is a mock website that looks similar in appearance but different in destination. The unsuspected users post their data thinking that these websites come from trusted financial institutions. Hence, there is a need for efficient mechanism for the detection of phishing website. In our project, we developed a model that can be mainly used in determining the website’s as either phishing or legitimate by using the features extraction techniques from the URL. These features are compared with the features present in the features extraction dataset and validated accordingly. Here, in our project we applied the algorithms like Gradient Boost, Cat Boost and Random Forest on the model that has been developed. During testing, it has been observed that the system has performed well and as expected. This paper aims to enhance detection method to detect phishing websites using machine learning technology. We achieved 97.4% detection accuracy using Gradient boost classifier and 97.2% using Cat boost classifier with lowest false positive rate. As classifiers give better performance when we used more data as training data. In future hybrid technology will be implemented to detect phishing websites more accurately, for which random forest algorithm of machine learning technology and blacklist method will be used.
[1] “Intelligent Ensemble Learning Approach for Phishing Website Detection Based on Weighted Soft Voting” by Altyeb Taha November 2021. [2] Mohammad, R.M., Thabtah, F. & McCluskey, L. “Predicting phishing websites based on self-structuring neural network”. Neural Comput & Applic 25, 443–458 (2014). [3] Malicious URL Detection using Machine Learning: A Survey Doyen Sahoo, Chenghao Liu, Steven C.H. Hoi [Submitted on 25 Jan 2017 (v1), last revised 21 Aug 2019 (this version, v3)]. [4] A. Maini, N. Kakwani, R. B, S. M K and B. R, \"Improving the Performance of Semantic-Based Phishing Detection System Through Ensemble Learning Method,\" 2021 IEEE Mysore Sub Section International Conference (MysuruCon), 2021, pp. 463-469. [5] CatBoost : gradient boosting with categorical features support Anna Veronika Dorogush, Vasily Ershov , Andrey Gulin [v1] Wed, 24 Oct 2018. [6] Bentéjac, C.Csörg?, A. & Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif Intell Rev 54, 1937–1967 (2021). [7] Singh and Meenu, \"Phishing Website Detection Based on Machine Learning: A Survey,\" 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 2020, pp. 398-404.
Copyright © 2022 B. Deekshitha, Ch. Aswitha, Ch. Shyam Sundar, A. Kavya Deepthi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET43986
Publish Date : 2022-06-08
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here