Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Ambalika Dey, Sumit Bhowmik
DOI Link: https://doi.org/10.22214/ijraset.2024.64096
Certificate: View Certificate
Conventional intrusion detection systems (IDS) are frequently inadequate to combat complex cyber threats in the changing cybersecurity landscape. To create a reliable intrusion detection system, this work investigates the use of machine learning, more especially the RandomForestClassifier method. The study highlights that in order to maximize model training, careful data preprocessing—including feature selection, label encoding, and scaling—is essential. It also uses Optuna for hyper-parameter adjustment to improve the classifier\'s performance. As a scalable and dependable response to contemporary cybersecurity issues, the findings show how effective machine learning is in identifying sophisticated attacks. The outcomes show a high detection accuracy and resilience to complex intrusions, confirming machine learning\'s potential as a scalable and reliable answer to today\'s cybersecurity problems.
I. INTRODUCTION
The security and integrity of computer networks are always at danger in the current digital era due to more complex cyberattacks. These attacks threaten the security of vital infrastructures and confidential data, rendering conventional intrusion detection techniques inadequate. Advanced, effective solutions that can successfully adapt to and combat growing cyber threats are desperately needed. In cybersecurity, machine learning (ML) has become a powerful tool due to its ability to analyze large amounts of network traffic data and identify patterns and anomalies that could be signs of an intrusion.
The capacity of traditional IDS to identify new or developing threats is hampered by their heavy reliance on predefined rules and signatures. On the other hand, machine learning (ML)-based methods make use of statistical models and algorithms to learn from past data, which increases their ability to identify attack patterns that were previously undetected. Machine learning can detect minute signs of compromise that human analysts overlook by automating the examination of large and intricate information. Furthermore, real-time network traffic processing is made possible by the scalability of machine learning models, which is essential for today's high-speed networks. Integrating machine learning (ML) with intrusion detection systems (IDS) is a noteworthy advancement in improving network security and safeguarding confidential data, given the ongoing evolution of complex and frequent cyber threats. This paper investigates using the robust machine-learning algorithm RandomForestClassifier to create a scalable and dependable intrusion detection system (IDS) that can handle the difficulties presented by contemporary cyber threats.
This work aims to create an advanced threat recognition system for intrusion detection by utilizing machine learning. The main goal is to produce highly accurate IDS using the RandomForestClassifier method. To guarantee that the dataset is best prepared for model training, the methodology includes extensive data pre-processing, such as feature selection, label encoding, and scaling. After the initial model training, hyper-parameter adjustment is made using Optuna, a cutting-edge optimization framework, to improve the classifier's performance. By combining these cutting-edge methods, such as data pre-processing, which includes cleaning the data, dealing with missing values, and converting raw data into a format that is appropriate for modeling. Important pre-processing actions consist of feature selection, which is used to find the most pertinent features that have a substantial correlation with the target variable by reducing the number of dimensions in the dataset; this phase increases the accuracy and efficiency of the model. Label encoding includes transforming categorical data into numerical values to facilitate efficient data processing by machine learning algorithms. Scaling is used to find standardising the data to guarantee that each feature adds the same amount to the model's learning. When it comes to distance-based algorithms, this phase is especially crucial, and hyper-parameter tuning with Optuna is used to find the most advanced hyper-parameter optimization framework, Optuna automates the process of finding the ideal hyper-parameters. Implementing a comprehensive pre-processing step, feature selection, and Optuna-driven hyperparameter tuning led to a notable enhancement in the model's performance, attaining a detection accuracy of over 99% in many classes. This indicates how well the suggested methodology improves intrusion detection systems' performance in practical settings. This research aims to offer a scalable and dependable solution for practical cybersecurity applications, showcasing machine learning's potential to transform intrusion detection and provide a strong defence against constantly changing cyber threats.
II. BACKGROUND AND RELATED WORK
An important topic of study in network security is intrusion detection systems, which attract much attention from scientists looking to improve and optimize their technology. A detecting system needs to be steady and efficient in order to be effective. Now, many researchers use machine-learning techniques on the KDD CUP 99 dataset. This data set was used for The Third International Knowledge Discovery and Data Mining Tools Competition, in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model distinguishing between “bad” connections, called intrusions or attacks, and “good” normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment and made available to the public to enhance the identification of hostile network activity and, hence, enhance the capabilities of intrusion detection systems.
In paper [1], the author has presented a survey of the various data mining techniques proposed for enhancing anomaly intrusion detection systems. The author also applied the classification methods for classifying the attacks (intrusions) on the DARPA dataset. The results show that the Random Forest's performance is better than other classifiers. However, the time taken for Random Forest is longer than that for other classifiers. In paper [2], Artificial Intelligence methods are gaining the most attention regarding their ability to learn and evolve, making them more precise and efficient in facing the huge number of unpredictable attacks. Hence, a methodology based on a Genetic Algorithm for detecting Probing, Denial of Service, and Remote user attacks is proposed. The proposed approach aims to gain maximum detection of probing, R2L, and DoS attacks with a minimum false positive rate. Out of the total intrusions in the testing dataset, this approach expects to detect more than 97% of the intrusions. This approach will be very useful for detecting today’s changing attack methodologies. If the rules are updated dynamically with the firewall’s log data, this method will be very effective against new attacks. The paper's authors [3] describe three frameworks for network intrusion detection based on data mining. The author uses the random forests algorithm to detect hybrids, anomalies, and abuse. The author used the random forests technique to create patterns of invasions in order to address the issues with rule-based systems. Rather than requiring manual rule coding, the random forests algorithm may automatically construct patterns by learning over training data. The WEKA environment and the Fortran 77 software are used to implement the suggested ways in a Java program. The experimental results demonstrate that our approaches outperform the best KDD'99 results. The author assessed the implementations over several datasets derived from the KDD'99 datasets. The system may automatically detect intrusions in real time by utilizing the built-in intrusions patterns created during the offline phase of the author's abuse detection framework. The author optimizes the random forests algorithm's settings and applies the feature selection algorithm to increase the system's accuracy. To improve the likelihood of detecting minority intrusions in the framework, the author additionally employs sampling techniques. The author suggested a new method for unsupervised anomaly detection because misuse detection cannot identify fresh invasions.
In the paper [4], the authors presented a Random Forest model for Intrusion Detection Systems (IDS) focusing on improving the intrusion detection performance by reducing the input features. From the context of real-world applications, smaller numbers of features are always advantageous in terms of both data management and processing time. The results indicate that the ability of the RF classification with reduced features (25 features) produces more accurate results than that found from Random Forest classification with all features (41 features). Moreover, the time required to process 25 features with RF is smaller than the processing time of RF with 41 features. The research in intrusion detection and feature selection using the RF approach is still ongoing due to its good performance. The findings of this paper will be very useful for research on feature selection and classification. These findings can also be applied to use RF more meaningfully to maximize the performance rate and minimize the false positive rate. In the paper [5], the author explains the Random Forest Algorithm, the number of trees in the forest, and the results from them are directly related, i.e., the more trees, the more accurate the result. It is important to note that decisionmaking using the gain or gain approach is not the same as creating a random forest. This paper presented an overview of the random forest algorithm, and a survey of various techniques proposed by several researchers was summarized. In the paper [6], the author deals with the Random Forest (RF) algorithm to detect four types of attack: DOS (Denial of Service), probe, U2R (User to Root), and R2L (Remote to Local). Here, the author adopted 10 cross-validations that were applied for classification. Feature selection is applied to the data set to reduce dimensionality and remove redundant and irrelevant features. The author applied symmetrical uncertainty of attributes, which overcomes the problems of information gain. The proposed approach is evaluated using the NSL KDD data set. The author also compared random forest modeling with j48 classier in terms of accuracy, DR (Detection Rate), FAR (False Alarm Rate), and MCC (Matthews Correlation Coefficient). Our experimental result proves that accuracy, DR, and MCC for four types of attacks are increased by our proposed method. For future work, we will apply evolutionary computation as a feature selection measure to further improve the accuracy of the classifier.
III. PROPOSED METHODOLOGY
This research aims to create and assess a sophisticated threat recognition system that uses machine-learning methods to detect intrusions. By developing and optimizing machine learning models—Random Forest classifiers in particular—to detect a variety of network intrusions, including both known and unknown threats, the research seeks to increase intrusion detection accuracy.
An ensemble learning technique called Random Forest is mainly applied to regression and classification problems. During training, it constructs several decision trees and combines them to generate a more reliable and accurate forecast. The fundamental principle is that a collection of decision trees, or weak learners, can combine to become a strong learner.
A. Random Forest Process
B. Random Forest's advantages over other techniques
C. Equations
1) Classification Prediction Function
The majority vote among all n trees determines the final prediction (y) for an input (x) in classification.
Where the ith tree's prediction is hi(x).
2) Prediction Function (Regression)
The average of all n trees' predictions is the final prediction (y) in regression:
3) Gini Impurity (for node splitting):
For each node, the Gini impurity G is computed as follows:
Where c is the total number of classes and pi is the probability of selecting an element from class i.
4) Information Gain (used in some tree splits)
Where H(T) is the entropy of the set T1 and Tv is the subset for which attribute a has value v.
In order to determine which network qualities are most suggestive of malicious activity, it will concentrate on finding and sifting the most important features using methods like feature importance analysis and recursive feature elimination (RFE).
dataset will be ready for reliable model training with the application of efficient data pre-processing techniques, such as handling missing values, encoding categorical variables, and scaling features.
The dataset's structure, including the column names, non-null count, and data types, is outlined in this table.
Table I:
Intrusion Detection Dataset Schema
Sl No. |
Column |
Non-Null Count |
Data Type |
0 |
duration |
145585 |
int64 |
1 |
protocol_type |
145585 |
object |
2 |
service |
145585 |
object |
3 |
flag |
145585 |
object |
4 |
src_bytes |
145585 |
int64 |
5 |
dst_bytes |
145585 |
int64 |
6 |
land |
145585 |
int64 |
7 |
wrong_fragment |
145585 |
int64 |
8 |
urgent |
145585 |
int64 |
9 |
hot |
145585 |
int64 |
10 |
num_failed_logins |
145585 |
int64 |
11 |
logged_in |
145585 |
int64 |
12 |
lnum_compromised |
145585 |
int64 |
13 |
lroot_shell |
145585 |
int64 |
14 |
lsu_attempted |
145585 |
int64 |
15 |
lnum_root |
145585 |
int64 |
16 |
lnum_file_creations |
145585 |
int64 |
17 |
lnum_shells |
145585 |
int64 |
18 |
lnum_access_files |
145585 |
int64 |
19 |
lnum_outbound_cmds |
145585 |
int64 |
20 |
is_host_login |
145585 |
int64 |
21 |
is_guest_login |
145585 |
int64 |
22 |
count |
145585 |
int64 |
23 |
srv_count |
145585 |
int64 |
24 |
serror_rate |
145585 |
float64 |
25 |
srv_serror_rate |
145585 |
float64 |
26 |
rerror_rate |
145585 |
float64 |
27 |
srv_rerror_rate |
145585 |
float64 |
28 |
same_srv_rate |
145585 |
float64 |
29 |
diff_srv_rate |
145585 |
float64 |
30 |
srv_diff_host_rate |
145585 |
float64 |
31 |
dst_host_count |
145585 |
int64 |
32 |
dst_host_srv_count |
145585 |
int64 |
33 |
dst_host_same_srv_rate |
145585 |
float64 |
34 |
dst_host_diff_srv_rate |
145585 |
float64 |
35 |
dst_host_same_src_port_rate |
145585 |
float64 |
36 |
dst_host_srv_diff_host_rate |
145585 |
float64 |
37 |
dst_host_serror_rate |
145585 |
float64 |
38 |
dst_host_srv_serror_rate |
145585 |
float64 |
39 |
dst_host_rerror_rate |
145585 |
float64 |
40 |
dst_host_srv_rerror_rate |
145585 |
float64 |
41 |
label |
145585 |
object |
Here in the dataset, we have 145585 numbers of rows and each row usually corresponds to a single observation, sample, or record. It contains 42 numbers of columns and each column represents a feature or variable in the dataset, including the target variable (label).
This work uses thorough data pre-processing approaches, such as addressing missing values, encoding categorical variables, and scaling features, to guarantee the dataset is ready for dependable model training. Resolving class imbalances is a crucial part of the pre-processing, since it can greatly affect the model's capacity to identify less common forms of incursion. In order to counteract this, fair representation is ensured by creating synthetic instances for minority classes through the use of the Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE improves the Random Forest Classifier's ability to spot imbalances in data by helping it learn from it more successfully. Furthermore, Optuna—a sophisticated framework that automates the search for ideal model parameters—is used for hyperparameter optimisation. In order to optimise the model's performance and guarantee that it can precisely identify different kinds of intrusions, this procedure is essential. A variety of metrics, such as accuracy, confusion matrix, classification report, and ROC-AUC score, are used to assess the model's performance. Additionally, the model's performance is examined on a variety of datasets to guarantee its resilience and generalisability.
In order to make sure that the results are not over fit to a single dataset, the paragraph's last sentence stresses the significance of testing the model on several test datasets. This stage is essential to establishing the model's dependability in practical applications by displaying how well it generalises to fresh, untested data. Feature importance visualization is essential for comprehending their respective roles in identifying various forms of network intrusions. We can determine which elements are most helpful in spotting possible dangers by examining these visualisations. This realisation enables us to focus in on and concentrate on the most significant features, improving the model's detection accuracy. An actual intrusion detection system (IDS) will use the best model, which will have been determined by extensive testing and feature analysis. Evaluating how well it performs in real-world settings where it can assess network data instantly and spot possible dangers as they materialise is the aim. The model's usefulness will be illustrated by this real-world application, which will explain how to incorporate it into current cybersecurity frameworks to offer a strong defence against both current and new threats.
IV. DATA PROCESSING
In this stage, categorical features are transformed using the Encoding approach to create a matrix that includes binary values and category features. Subsequently, we allocate categories inside services. Subsequently, we allocate categories inside services. Currently, Label Encoder is being used to convert the features, turning each category into a number. Next, we divided the dataset label into 23 different attacks. Here, we give each attack type a fictitious name in the following table:
TABLE II
NETWORK EVENT LABEL MAPPING
Label |
Network Event |
0 |
Normal |
1 |
Buffer Overflow |
2 |
Load Module |
3 |
Perl |
4 |
Neptune |
5 |
Smurf |
6 |
Guess Password |
7 |
Pod |
8 |
Teardrop |
9 |
Port Sweep |
10 |
Ipsweep |
11 |
Land |
12 |
FTP Write |
13 |
Back |
14 |
IMAP |
15 |
Satan |
16 |
PHF |
17 |
Nmap |
18 |
Multihop |
19 |
Warezmaster |
20 |
Warezclient |
21 |
Spy |
22 |
Rootkit |
Following this, features are scaled by dividing them into two groups, X and Y, respectively, consisting of a data frame of features and a set of output variables.
V. FEATURE SELECTION
A critical step in the pre-processing stage of the data is feature selection. In order to improve the model's performance and interpretability, it entails locating and keeping only the most pertinent elements that substantially aid in the prediction of the target variable.
The following phases provide a comprehensive definition of the Recursive Feature Elimination (RFE) procedure that is applied to the dataset:
A. Step 1: First, separate the features and labels.
Take the labels and features out of the dataset. The dataset is used to extract the feature set (X_train) and target labels (Y_train). The column marked "label" contains the target variable, which frequently denotes the result or class.
B. Step 2: Initialize the Estimator
Select the feature-ranking model that RFE uses. A classifier such as RandomForestClassifier is usually chosen as the base estimator. This model will be applied to assess feature importance.
C. Step 3: Apply Recursive Feature Elimination (RFE)
Iteratively identify and remove less significant traits.
D. Step 4: Complete the Feature Set
Choose the best subset of attributes. The most crucial features are kept when RFE completes its cycles. It is anticipated that these characteristics will have the greatest predictive power and the least amount of redundancy, which will improve model performance.
E. Step 5: Selective Features for Model Training
Utilising the improved feature set, train the model. To make sure the classifier is performance-optimized with the most pertinent data, the model is retrained using only the chosen features.
The RandomForestClassifier is employed alongside Recursive Feature Elimination (RFE) to identify the top ten features from the training set. This approach involves building the model multiple times and recursively removing the least significant features until the optimal set is achieved. Once the top features are selected, StandardScaler is applied to standardize the features, ensuring they have a mean of zero and a variance of one. This step is crucial for improving the model's performance and consistency. To evaluate the effectiveness of the model, the dataset is split, with 30% allocated for training and 70% for validation. This division helps in assessing how well the model generalizes to new, unseen data.
VI. BUILDING MODEL
A high-performing model selection is necessary to obtain a highly accurate classifier that processes real-time data. Using the scikit-learn module in Python, we constructed the intrusion detection model. After isolating the features (X_train) and labels (Y_train), we used a RandomForestClassifier in conjunction with Recursive Feature Elimination (RFE) to determine the top 10 features. In order to ensure zero mean and unit variance, we used StandardScaler to standardise these features.
The RandomForestClassifier is doing incredibly well on both the training and validation sets, as evidenced by these high accuracy numbers. On both the training and validation sets, the model's accuracy is computed. Training Accuracy: 0.9996, or around 99.96 percent Test Accuracy: 0.9983, or almost 99.83 percent. With great precision, the model has learnt to predict the target variable. High accuracy, however, can also indicate overfitting, a condition in which the model performs remarkably well on training data but poorly on fresh, untested data. To make sure the model is resilient, it is crucial to further validate it using methods like cross-validation, looking at confusion matrices, and taking additional metrics (precision, recall, F1 score) into account.
Further, we implemented Optuna for hyperparameter adjustment; the RandomForestClassifier's performance is enhanced. The hyperparameters (n_estimators, max_depth, min_samples_split, and min_samples_leaf) and their corresponding ranges are defined by the objective function. Using these hyperparameters, it trains a RandomForestClassifier and yields the accuracy on the validation set. The goal of a study is to maximise accuracy, or the objective function. The research conducts one hundred trials, experimenting with various hyperparameter values to determine the optimal combination. The best trial's hyper parameters that were discovered throughout the optimisation procedure. Here: Ideal trial: {'max_depth': 32,'min_samples_split': 9,'min_samples_leaf': 1, 'n_estimators': 138}. On the validation set, the final classifier makes predictions, and the accuracy is computed. Maximum Test Precision: 0.9982828097811155. The optimisation procedure revealed that the hyperparameters {'n_estimators': 138,'max_depth': 32,'min_samples_split': 9,'min_samples_leaf': 1} produced the best results on the validation set. Top Test Precision: On the validation set with optimal hyperparameters, the accuracy is roughly 99.83%. This indicates that the RandomForestClassifier keeps a high degree of accuracy after adjusting the hyperparameters, proving that the model is well optimized and performs admirably on the validation data. Given that the optimised model still performs at a comparable level, this further implies that the initial high accuracy was not exclusively the result of overfitting.
VII. FEATURE IMPORTANCE
The contribution of each feature to the model's predictions is indicated by the feature significance values. The following summarises the meaning of the output:
Greater values indicate that the attribute influences the forecasts more.
Table III
Feature Importance
Index |
feature |
Importance |
Percentage Contribution |
4 |
same_srv_rate |
0.289317 |
28.93% |
5 |
diff_srv_rate |
0.202333 |
22.33% |
3 |
count |
0.157761 |
15.77% |
2 |
src_bytes |
0.139157 |
13.92% |
1 |
flag |
0.084418 |
8.44% |
7 |
dst_host_same_srv_rate |
0.053143 |
5.31% |
0 |
service |
0.023987 |
5.31% |
6 |
dst_host_srv_count |
0.020161 |
2.01% |
8 |
dst_host_diff_srv_rate |
0.018173 |
1.81% |
9 |
dst_host_serror_rate |
0.011550 |
1.15% |
VIII. EXPERIMENTAL RESULT
The experiment was conducted utilising the KDD CUP 99 data set. The KDD CUP 99 dataset has 42 attributes total, with class label making up the final attribute. We experimented with different numbers of Random forest trees. The classification report and confusion matrix offer comprehensive insights into how well the RandomForestClassifier performs in various classifications.
The table used to find the performance of confusion matrix in this classification model. The matrix given below
A. Description of Confusion Matrix Analysis as follow
The dataset's true labels, or actual classes, are represented as rows.
The anticipated classes (the model's output) are shown in columns.
From top-left to bottom-right, the diagonal elements display the number of correctly identified instances for each class.
There are 291 samples in total of class 0
It is accurate to anticipate that all 291 samples are Class 0.
For Class 0, the model displays perfect accuracy.
There are 15,527 samples in total of class 9
It is accurate to anticipate that all 15527 samples are Class 9.
For Class 9, the model displays perfect accuracy.
There are 26,367 samples in total of class 9
With only a few misclassifications, nearly all 26,367 cases are accurately predicted as Class 11. The model has very few faults and is accurate for Class 11.
The model performs exceptionally well with larger classes (such as Class 9 and Class 11), where nearly all cases are correctly identified, as the confusion matrix demonstrates.
The model's accuracy declines for smaller classes (such as Class 1), which can result in occasional misclassifications.
B. Explanation of Classification Report
The categorisation report offers important indicators.
1) Precision: The ratio of accurately predicted positive observations to the total number of predicted positives is known as precision. It measures how well the model predicts the positive results.
Formula:
Where:
TP: True Positives (Correctly predict the positive observations)
FP: False Positive (incorrectly predict the positive observations)
High precision indicates that the model does not make many false positive predictions
2) Recall (Sensitivity): The ratio of accurately predicted positive observations to all observations made during the actual class is known as recall. It gauges how well the model is able to recognise every positive example.
Formula
Where:
FP: False Negatives (actual positive observations that were incorrectly classified as negative)
When a model has a high recall, it means that it can catch the majority of positive cases and hardly ever misses a positive class.
3) F1-Score: The harmonic mean of Precision and Recall is known as the F1-Score. It offers a fair assessment that takes into consideration both false positives and false negatives.
Formula:
When you need to strike a compromise between precision and recall or when the class distribution is unbalanced, the F1-Score comes in handy.
4) Support
The amount of real instances in each class is referred to as support. It shows the total number of samples that are a part of the class as a whole. Understanding the number of observations for each class is important for assessing model performance measures, and support aids in this process. For example, if there are relatively few instances, a class with poor support could have an F1-Score that is misleading.
TABLE IIIV
CLASSIFICATION REPORT
Class |
Precision |
Recall |
F1-Score |
Support |
0 |
1.00 |
1.00 |
1.00 |
291 |
1 |
0.67 |
0.50 |
0.57 |
4 |
2 |
0.00 |
0.00 |
0.00 |
3 |
3 |
1.00 |
0.90 |
0.95 |
20 |
4 |
1.00 |
1.00 |
1.00 |
1 |
5 |
0.89 |
0.97 |
0.93 |
210 |
6 |
1.00 |
0.80 |
0.89 |
5 |
7 |
0.00 |
0.00 |
0.00 |
1 |
8 |
0.00 |
0.00 |
0.00 |
1 |
9 |
1.00 |
1.00 |
1.00 |
15527 |
10 |
0.74 |
0.39 |
0.51 |
51 |
11 |
1.00 |
1.00 |
1.00 |
26367 |
12 |
0.00 |
0.00 |
0.00 |
1 |
13 |
0.98 |
0.98 |
0.98 |
56 |
14 |
0.99 |
0.98 |
0.99 |
130 |
15 |
0.99 |
0.98 |
0.99 |
130 |
16 |
0.00 |
0.00 |
0.00 |
3 |
17 |
0.99 |
0.99 |
0.99 |
258 |
18 |
1.00 |
0.99 |
0.99 |
199 |
20 |
1.00 |
1.00 |
1.00 |
287 |
21 |
0.98 |
0.98 |
0.98 |
255 |
22 |
0.67 |
1.00 |
0.80 |
6 |
accuracy |
|
|
1.00 |
43676 |
macro average |
0.71 |
0.69 |
0.69 |
43676 |
weighted average |
1.00 |
1.00 |
1.00 |
43676 |
5) Observation
The resilience of the model in detecting these categories is demonstrated by its good accuracy in most classes, particularly in classes 0, 9, 11, 17, 18, and 20. Nonetheless, some classes—2, 7, 8, 12, and 16—show low recall and precision at zero, indicating difficulties in identifying these particular kinds of intrusions. In spite of these problems, the model performs well overall, handling most classes with little error, according to the high weighted average metrics.
In order to forecast network intrusion detection with a high degree of accuracy, we successfully constructed and optimised a Random Forest classifier for this project. Achieving a stunning 99.83% test accuracy required careful feature selection, careful data preparation, and hyperparameter optimisation with Optuna. This high degree of accuracy highlights the model\'s stability and dependability in terms of identifying network intrusions with a small margin of error. This project\'s success depends on figuring out the essential elements that have a big impact on the model\'s decision-making process. We made sure the model was reliable and effective at processing data and generating predictions by identifying and limiting its features to those that had the greatest influence. In cybersecurity applications, where accuracy and speed are critical, this phase is essential. The model\'s performance was further improved by the Optuna-powered hyperparameter optimisation method, which adjusted the model\'s parameters to produce the optimum outcomes. By combining these approaches, we were able to create a model that performs well on a variety of datasets and is both accurate and generalizable. Using sophisticated evaluation criteria, including as accuracy, precision, recall, and F1-score, to gauge the model\'s performance was another crucial component of this endeavour. With the use of these measurements, the model\'s advantages and disadvantages could be thoroughly understood, facilitating ongoing development and modification. The model\'s capacity to accurately detect intrusions while the high scores demonstrate minimising false positives and false negatives obtained across several criteria. The model functioned well across all classes, including those that were under-represented in the dataset, thanks in large part to the application of SMOTE to address class imbalances. In conclusion, this study demonstrates how machine learning may be used to improve network security. Through the use of feature selection, hyperparameter tuning, and sophisticated evaluation procedures, we have optimised the Random Forest classifier and produced a highly accurate and dependable model that can identify a variety of network intrusions. Organisations can use the project\'s findings to protect their networks from potential threats by applying them to real-world cybersecurity scenarios. It is impossible to overestimate the significance of having resilient and flexible intrusion detection systems given the ongoing evolution of cyber threats. This project demonstrates how machine learning can be essential to reaching that objective.
[1] Phyu Thi Htun, Kyaw Thet Khaing, “Anomaly Intrusion Detection System using Random Forests and k-Nearest Neighbor”, International Journal of P2P Network Trends and Technology (IJPTT) - Volume 3 Issue 1 January to February 2013 [2] Swati Paliwal, Ravindra Gupta, “Denial-of-Service,Probing &Remote to User (R2L) AttackDetection using Genetic Algorithm” , International Journal of Computer Applications (0975 – 8887) Volume 60– No.19, December 2012 [3] Jiong Zhang, Mohammad Zulkernine, and Anwar Haque, “Random-Forests-Based Network Intrusion Detection Systems”, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 38, NO.5, SEPTEMBER 2008 [4] Md. Al Mehedi Hasan, Mohammed Nasser, Shamim Ahmad, Khademul Islam Molla, “Feature Selection for Intrusion DetectionUsing Random Forest”, Journal of Information Security, 2016, 7, 129-140 [5] Kritika Singh, Bharti Nagpal, “Random Forest Algorithm in Intrusion Detection System: A Survey”, International Journal of Scientific Research in Computer Science, Engineering and Information Technology © 2018 IJSRCSEIT | ISSN: 2456-3307 [6] Nabila Farnaaz, M. A. Jabbar, “Random Forest Modeling for Network Intrusion Detection System”, Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) [7] Abdulla Amin Aburomman, Mamun Bin Ibne Reaz, A survey of intrusion detection systems based on ensemble and hybrid classifiers, Computers & Security, Volume 65, 2017, Pages 135-152, ISSN 0167-4048, https://doi.org/10.1016/j.cose.2016.11.004. [8] Max Landauer, Sebastian Onder, Florian Skopik, Markus Wurzenberger, Deep learning for anomaly detection in log data: A survey, Machine Learning with Applications, Volume 12, 2023, 100470, ISSN 2666-8270, https://doi.org/10.1016/j.mlwa.2023.100470. [9] A. M. S. Ngo Bibinbe, M. F. Mbouopda, G. R. Mbiadou Saleu and E. Mephu Nguifo, \"A survey on unsupervised learning algorithms for detecting abnormal points in streaming data,\" 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 2022, pp. 1-8, doi: 10.1109/IJCNN55064.2022.9892195. [10] Ranjeethapriya K, Susila N, Granty Regina Elwin, Balakrishnan S, “Raspberry Pi Based Intrusion Detection System”, International Journal of Pure and Applied Mathematics, Volume 119, No. 12, 2018, pp.1197-1205. [11] S. Balakrishnan, B. Persis Urbana Ivy and S. Sudhakar Ilango, “A Novel and Secured Intrusion Detection System for Wireless Sensor Networks Using Identity Based Online/Offline Signature”, ARPN Journal of Engineering and Applied Sciences. November 2018, Vol. 13 No. 21, pp. 8544-8547. [12] J.P.Ananth, S.Balakrishnan, S.P.Premnath, (2018). “Logo Based Pattern Matching Algorithm for Intrusion Detection System in Wireless Sensor Network”, International Journal of Pure and Applied Mathematics, Volume 119, No. 12, 2018, pp. 753-762. [13] Chandola, V., Banerjee, A., Kumar, V., 2009. Anomaly detection: A survey. ACM Comput. Surv. 41 (3), 1–58. http://dx.doi.org/10.1145/1541880.1541882. [14] Ranshous, S., Shen, S., Koutra, D., Harenberg, S., Faloutsos, C. and Samatova, N.F. (2015), Anomaly detection in dynamic networks: a survey. WIREs Comput Stat, 7: 223-247. https://doi.org/10.1002/wics.1347 [15] Patcha A, Park J-M. An overview of anomaly detection techniques: existing solutions and latest technological trends. Comput Netw. 2007;51(12):3448– 70. [16] Oswal, S., Shinde, S., Vijayalakshmi, M. (2023). A Survey of Statistical, Machine Learning, and Deep Learning-Based Anomaly Detection Techniques for Time Series. In: Garg, D., Narayana, V.A., Suganthan, P.N., Anguera, J., Koppula, V.K., Gupta, S.K. (eds) Advanced Computing. IACC 2022. Communications in Computer and Information Science, vol 1782. Springer, Cham. https://doi.org/10.1007/978-3-031-35644-5_17 [17] Heard, Nicholas A., David J. Weston, Kiriaki Platanioti, and David J. Hand. “BAYESIAN ANOMALY DETECTION METHODS FOR SOCIAL NETWORKS.” The Annals of Applied Statistics 4, no. 2 (2010): 645–62. http://www.jstor.org/stable/29765524.
Copyright © 2024 Ambalika Dey, Sumit Bhowmik. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET64096
Publish Date : 2024-08-28
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here