Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Arjun R Nair
DOI Link: https://doi.org/10.22214/ijraset.2024.65559
Certificate: View Certificate
I. INTRODUCTION
In the rapidly evolving landscape of cybersecurity, network intrusion detection systems (IDS) have become essential tools for safeguarding information assets against malicious activities. As cyber threats grow in complexity and frequency, particularly Distributed Denial of Service (DDoS) attacks, there is an increasing need for advanced detection methodologies that can accurately differentiate be- tween benign and malicious network traffic. Machine learning and deep learning techniques have emerged as powerful approaches to improve the efficacy of IDS by learning intricate patterns from network data.
The Canadian Institute for Cybersecurity’s Intrusion Detection System 2017 (CICIDS2017) dataset has been established as a benchmark for evaluating intrusion detection algorithms. This dataset contains a rich mix of benign traffic and the most up-to-date common attacks, which closely resemble real-world network environments through Packet Capture (PCAP) files. It includes the results of network traffic analysis using CICFlowMeter, offering labeled flow data based on timestamps, source and destination IP addresses, ports, protocols, and attack types in the CSV format. A significant aspect of the CICIDS2017 dataset was the generation of realistic background traffic. Utilizing the B-Profile system proposed by Sharafaldin et al. (2016), the dataset profiles the abstract behavior of human interactions to generate naturalistic benign traffic. This approach models the behavior of 25 users across various protocols including HTTP, HTTPS, FTP, SSH, and email, thereby creating a comprehensive and realistic dataset for intrusion detection research.
Despite the availability of such detailed datasets, accurately detecting anoma- lies and outliers, especially DDoS attacks—remains a challenge. Factors such as high dimensionality, redundant features, and class imbalance in the data can hinder the performance of traditional detection methods. Therefore, there is a pressing need to develop advanced models that not only improve detection accuracy but also optimize computational efficiency and resource utilization.
A. Problem Statement
This research addresses the challenge of detecting anomalies and DDoS attacks within network traffic by improving intrusion detection accuracy using machine learning algorithms. Specifically, it focuses on enhancing the performance of IDS by leveraging deep learning techniques to classify network flow samples into benign or malicious categories, considering both binary and multiclass classifi- cation scenarios.
B. Objectives
The primary objective of this study is to develop a robust intrusion detection model that achieves high accuracy while maintaining computational efficiency. The specific goals include:
II. LITERATURE REVIEW
Intrusion Detection Systems (IDS) are critical components in modern network security infrastructure, designed to monitor network traffic and identify poten-tial threats or unauthorized access [2]. Over the years, IDS have evolved from signature-based detection methods to incorporate advanced machine learning (ML) and deep learning techniques, enabling them to detect novel and sophis- ticated cyber-attacks [3].
A. Intrusion Detection Systems and Machine Learning
Traditional IDS rely on predefined signatures or anomaly detection rules, which can be insufficient against zero-day attacks or sophisticated evasion techniques [4]. Machine learning offers a dynamic and adaptive approach, where models learn patterns from data to detect anomalies [4]. Studies have demonstrated that ML-based IDS can significantly improve detection rates and reduce false positives [1].
B. Feature Engineering in IDS
Feature engineering is a pivotal process in developing ML models for IDS. It involves selecting and transforming variables to improve model performance [5]. In network intrusion detection, features can include various network flow char- acteristics such as packet sizes, durations, protocols, and statistical measures [1]. Effective feature selection reduces dimensionality, mitigates the curse of dimensionality, and enhances model interpretability [6]. Techniques like prin- cipal component analysis (PCA) and autoencoders have been used to extract significant features [7].
Sharafaldin et al. [1] emphasized the importance of comprehensive feature selection and proposed a new approach for generating reliable datasets, leading to the creation of the CICIDS2017 dataset.
C. CICIDS2017 Dataset in IDS Research
The CICIDS2017 dataset has become a standard benchmark for evaluating IDS models due to its realistic representation of network traffic and inclusion of contemporary attack types [1]. Several studies have utilized this dataset to assess the effectiveness of various detection techniques.
For instance, Maci´a-Fern´andez et al. [8] investigated the impact of metadata features on machine-learned IDS models using the CICIDS2017 dataset. They found that including certain metadata can contaminate the model, leading to overfitting and reduced generalizability. Their study highlights the necessity of careful feature selection to avoid the inclusion of non-representative data that could skew model performance.
Other researchers have employed the dataset to explore deep learning ap- proaches. Yin et al. [9] proposed a deep learning framework using recurrent neural networks (RNN) to achieve high detection rates. Similarly, Ullah and Mahmoud [10] developed a hybrid model combining convolutional neural net- works (CNN) and long short-term memory (LSTM) networks, demonstrating improved performance on the CICIDS2017 dataset.
D. Machine Learning Approaches for IDS
Among machine learning algorithms, ensemble methods like Extreme Gradient Boosting (XGBoost) have gained significant attention for IDS applications [11]. XGBoost is known for its scalability, efficiency, and high predictive accuracy, especially on large and complex datasets.
XGBoost operates by building an ensemble of weak learners, typically de- cision trees, in a sequential manner where each new tree focuses on correcting the errors of the previous ones [11]. This approach makes it powerful for classi- fication tasks within IDS, particularly in anomaly detection where patterns are subtle and complex.
A key advantage of XGBoost is its ability to handle imbalanced datasets, a common issue in IDS where attack instances are rare compared to normal traffic [12]. Techniques such as weighting the loss function and using appropriate evaluation metrics help in addressing class imbalance. Furthermore, XGBoost’s parallelization capability ensures it trains faster compared to other gradient boosting methods.
Comparisons with other algorithms have been explored in the literature. LightGBM, developed by Microsoft, offers faster training and lower memory us- age by using histogram-based algorithms and leaf-wise tree growth [13]. It often achieves higher accuracy than level-wise growth used in many other algorithms, particularly when dealing with large datasets or high-dimensional data [14].
CatBoost, another gradient boosting algorithm, is designed to handle cate- gorical features efficiently and reduce overfitting [15]. It incorporates techniques to address the prediction shift caused by target leakage, which is beneficial in IDS applications where data integrity is crucial.
Random Forest, an ensemble method using bagging and feature randomness, constructs multiple decision trees and aggregates their results, offering robust- ness and ease of interpretation [16]. Although it may not achieve the same level of accuracy as boosted models, Random Forest is valuable for its robustness to overfitting and its performance on a wide range of datasets [17].
In the context of IDS, the choice between these algorithms depends on spe- cific dataset characteristics and computational constraints. For instance, Light- GBM might be preferred for very large datasets due to its speed and efficiency [14], while Random Forest may be suitable for problems where model inter- pretability is a priority.
E. Feature Interpretation and Explainability
Understanding model decisions is crucial in IDS to ensure trust and compliance. Techniques such as Shapley values provide insights into feature contributions to predictions [18]. Shapley values offer a unified approach to interpreting model outputs, helping analysts understand the importance of each feature in the detection process [19].
Moreover, tools like Pandas Profiling (now known as ydata-profiling) facili- tate exploratory data analysis by generating comprehensive reports on dataset features [20]. Such tools help in identifying data quality issues, understanding feature distributions, and uncovering patterns that may influence model perfor- mance.
Quantile functions are also used in statistical analysis to understand the distribution of features, which is valuable in preprocessing and normalizing data for ML models [21]. By analyzing quantiles, researchers can detect outliers and better prepare data for training robust IDS models.
F. Summary
The integration of advanced machine learning techniques in IDS has shown sig- nificant promise in enhancing detection capabilities. The CICIDS2017 dataset serves as a robust benchmark for evaluating these methods. XGBoost, among other algorithms, has demonstrated high accuracy and efficiency in handling the complexities of intrusion detection. However, the choice of algorithm should be tailored to the specific characteristics of the dataset and the requirements of the deployment environment. Feature engineering and model interpretability remain critical components in developing effective IDS solutions.
III. METHODOLOGY
A. Dataset Description
The Canadian Institute for Cybersecurity’s Intrusion Detection System 2017 (CICIDS2017) dataset is utilized in this research as the primary source of network traffic data for intrusion detection analysis. This dataset is renowned for its comprehensive and realistic representation of modern network traffic pat- terns, including both benign activities and a variety of malicious attacks [1].
The CICIDS2017 dataset was created with the goal of resembling true real- world data, incorporating Packet Capture (PCAP) files that capture network traffic over a period of five consecutive days. The data collection commenced at 9 a.m. on Monday, July 3, 2017, and concluded at 5 p.m. on Friday, July 7, 2017, covering both normal and attack traffic periods.
Key features of the dataset include:
3. Data Format: The dataset includes labeled flow-based features extracted using CICFlowMeter, providing detailed information such as timestamps, source and destination IP addresses, ports, protocols, and attack labels. The data is stored in CSV files, facilitating ease of use for machine learning applications.
The inclusion of both benign and malicious activities, along with comprehen- sive feature representation, makes the CICIDS2017 dataset suitable for develop- ing and evaluating intrusion detection models under realistic network conditions.
B. Data Preprocessing
Effective data preprocessing is essential to ensure the quality and reliability of the machine learning models developed for intrusion detection. The following steps were undertaken to prepare the dataset for analysis:
1) Data Consolidation:
2) Handling Null Values:
null_counts = df_data . isnull (). sum () total_nulls = null_counts . sum () print ( f"{ total_nulls } null entries have been found in the dataset \ n") |
df_data . dropna ( inplace = True ) |
3) Handling Duplicate Entries:
duplicate_count = df_data . duplicated (). sum () print ( f"{ duplicate_count } duplicate entries have been found in the dataset \ n") |
df_data . drop_duplicates ( inplace = True ) |
df_data . reset_index ( drop =True , inplace = True ) |
4) Data Type Inspection:
categorical_columns = df_data . select_dtypes ( include =[ ’ object ’]). columns . tolist () print (" Categorical columns :", categorical_columns , ’ \ n’) |
5) Feature and Target Separation:
X = df_data . drop (’ Label ’, axis =1) |
y = df_data [’ Label ’]. copy () |
C. Transformation into Binary and MultiClass Classifi- cation
The nature of intrusion detection problems can vary based on the specific ob- jectives. In this research, the problem was approached from both binary and multiclass classification perspectives.
1) Justification for Transforming the Problem Binary Classification
2) Implementation of Classification Configurations Binary Classification Configuration
∗ 0 for ’Benign’ traffic.
∗ 1 for all types of attacks.
# Binarize labels : Map ’ Benign ’ to 0 and all other attack labels to 1 y_b = y. map ({ ’ Benign ’: 0}) . fillna (1) |
A naive model predicting the majority class would achieve an accuracy equal to the proportion of the dominant class (e.g., if benign traffic is 84.92%, the baseline accuracy is 84.92%).
The model must outperform this baseline to be considered effec- tive.
from sklearn . preprocessing import Label Encoder label_encoder = Label Encoder () y_encoded = label_encoder . fit_transform ( y) y_m = pd. Series ( y_encoded ) |
∗ Some attack types constitute a very small fraction of the dataset.
∗ Special techniques (e.g., resampling, class weighting) may be nec- essary to ensure these classes are adequately represented during model training.
3) Addressing Class Imbalance
∗ Oversampling minority classes or undersampling majority classes to balance the dataset.
Figure 1: Class Imbalance between binary and multiclass configuration
∗ Using models that incorporate class weights (e.g., XGBoost) to penalize misclassification of minority classes more heavily.
∗ Relying on metrics beyond accuracy, such as precision, recall, F1-score, and area under the ROC curve, to assess model per- formance on imbalanced data.
4) Benefits of Dual Approach
D. Conclusion on Transformation
Transforming the intrusion detection problem into both binary and multiclass classification tasks allows for a robust evaluation of machine learning models.
It addresses different practical needs in network security, from the quick detection of any intrusion to the precise identification of attack types. By meticulously preprocessing the data and thoughtfully configuring the classification tasks, the research sets a solid foundation for developing effective and efficient intrusion detection models.
E. Feature Engineering
Feature selection is a crucial step in developing efficient and effective machine learning models, especially in high-dimensional data scenarios common in intru- sion detection systems. In this research, we performed feature selection using the Extreme Gradient Boosting (XGBoost) algorithm. XGBoost is a tree- based ensemble learning algorithm that inherently performs feature selection during its training process. This embedded method helps in selecting the most informative features while reducing the dimensionality of the input space.
1) Introduction to Feature Selection with XGBoost
XGBoost is widely used for classification and regression tasks due to its scala- bility and speed [11]. One of its key advantages is the ability to automatically compute feature importance scores during model training. These scores indicate the contribution of each feature to the predictive performance of the model.
Explanation of Feature Importance Calculation In decision tree-based algorithms like XGBoost, feature importance scores are derived from the impact of features on reducing impurity in decision trees. One common measure of impurity used in classification tasks is the Gini impurity.
Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. For a binary classification problem with classes 0 and 1, the Gini impurity G for a node with N samples is calculated as:
G = 1 − Σ p , (1) |
1
2
i
i=0
where pi is the probability of class i in the node.
During the construction of decision trees, at each node, possible splits on each feature are evaluated, and the split that maximally reduces the Gini impurity is selected. The feature importance score is then calculated based on the total reduction in impurity achieved by splitting on that feature across all trees in the ensemble.
Features that contribute more to reducing the impurity (i.e., lead to greater reduction in Gini impurity when used for splitting nodes) are considered more important by the algorithm. These important features can be used for further analysis or as input to other machine learning models, such as neural networks.
Relationship Between Gini Index and Feature Importance In XG- Boost, the Gini index serves as the default metric for assessing impurity when constructing decision trees. For a node with class probabilities p1 and p2 (where p1 + p2 = 1), the Gini index is calculated as:
GI = 1 − (p2 + p2). (2)
Lower values of the Gini index indicate a higher degree of class separation, signifying improved model performance. During tree construction, the algo- rithm selects splits that maximally reduce the Gini index. Feature importance is determined by the total reduction in the Gini index achieved by splits involv- ing that feature across all nodes of all trees in the ensemble.
2) Implementation of Feature Selection
The feature selection process was implemented using the following code.
The function above calculates the feature importances using an XGBoost classifier trained on the provided training data. It can optionally print out the feature names alongside their importance scores for inspection.
This function selects features whose importance scores exceed a specified threshold.
3) Feature Selection Under Binary and Multiclass Configurations
We performed feature selection separately under binary and multiclass configu- rations.
Binary Configuration We defined the classifier parameters and executed the feature selection process:
Multiclass Configuration Similarly, for the multiclass classification task:
4) Visualization of Feature Importances
We plotted the feature importances for both configurations to visualize the significance of each feature.
# Define figure and subplots fig , axs = plt . subplots (1 , 2 , figsize =(16 , 8)) # Binary configuration feature_names_b = [ tup [0] for tup in feature_tuples_b ] significance_scores_b = [ tup [1] for tup in feature_tuples_b ] axs [0]. barh ( feature_names_b , significance_scores_b , color =’ skyblue ’) axs [0]. set_xlabel (’ Feature Importance ’) axs [0]. set_ylabel (’ Features ’) axs [0]. set_title (" Feature Importance under Binary Classification ") axs [0]. invert_yaxis () # Multiclass configuration feature_names_m = [ tup [0] for tup in feature_tuples_m ] significance_scores_m = [ tup [1] for tup in feature_tuples_m ] axs [1]. barh ( feature_names_m , significance_scores_m , color =’ red ’) axs [1]. set_xlabel (’ Feature Importance ’) axs [1]. set_ylabel (’ Features ’) axs [1]. set_title (" Feature Importance under Multiclass Classification ") axs [1]. invert_yaxis () # Adjust layout and display plt . tight_layout () plt . show () |
5) Analysis of Unique Features
# Features unique to binary configuration unique_in_b = set ( feature_names_b ) - set ( feature_names_m ) # Features unique to multiclass configuration unique_in_m = set ( feature_names_m ) - set ( feature_names_b ) |
We identified features that were unique to each configuration:
6) Creation of New Features
While the primary focus was on selecting existing features based on their impor- tance scores, we also explored the creation of new features that could potentially enhance model performance. This involved combining existing features or com- puting statistical measures that may capture underlying patterns in the data.
Figure 2: Feature Importance Plots for Binary and Multiclass Configurations
The intended impact of creating new features was to provide the model with additional information that could improve its ability to distinguish between benign and malicious traffic, as well as between different types of attacks in the multiclass configuration.
F. Conclusion on Feature Engineering
Feature selection using XGBoost allowed us to identify and retain the most infor- mative features, effectively reducing the dimensionality of the dataset without compromising model performance. The reduction in dimensionality leads to lower computational costs and potentially improves the generalization of the model. The differences in selected features between the binary and multiclass configurations highlight the importance of context when performing feature se- lection.
G. Outlier Analysis
Note that outlier analysis is performed only for the Binary Configuration. We avoided this procedure for the multiclass setup to save time, as we expect analogous results. Since there is no outlier-handling procedure in the multiclass setting, this analysis serves to enhance our understanding of the problem.
1) Definition of Outlier
An outlier in statistics is a data point that deviates significantly from the over- all pattern of the remaining data. Mathematically, there isn’t a universally accepted definition, but two common approaches are often used:
Z-scores We can define an outlier based on its Z-score, which measures how many standard deviations a data point is away from the mean:
Where:
We can then define a threshold for Z-scores (e.g., ±3 standard deviations). Data points exceeding this threshold in absolute value can be considered outliers.
Interquartile Range (IQR) This approach utilizes the quartiles of the data distribution:
We calculate the Interquartile Range (IQR):
IQR = Q3 − Q1 (4)
The lower and upper bounds are defined as:
Lower bound = Q1 − 1.5 × IQR Upper bound = Q3 + 1.5 × IQR
(5)
Outliers can be defined as data points falling outside these bounds. Data points below the lower bound or exceeding the upper bound can be considered potential outliers.
Since the IQR method is robust against skewed distributions, while Z-scores are better for normally distributed data, the IQR approach will be adopted in this analysis.
2) Implementation of Outlier Detection
We implemented the outlier detection using the IQR method on our dataset. The following code demonstrates how outliers are detected:
3) Outlier Boxplot Visualization
To visualize the distribution of features and the presence of outliers, we used boxplots. The following function generates boxplots for the specified features:
def visualize_boxplots ( df , features , figsize =(15 , 28) ): """ Visualizes boxplots for the specified features using a grid layout . Args : df ( pandas . Data Frame ): The dataframe containing the features . features ( list ): A list of feature names to visualize . figsize ( tuple , optional ): The size of the figure . Defaults to (15 , 28) . """ n_features = len ( features ) # Example : 20 features rows = int ( np. ceil ( n_features / 3)) # Calculate number of rows for a 3 - column grid fig , axes = plt . subplots ( rows , 3 , figsize = figsize ) # Flatten the axes array for easy iteration axes_flat = axes . ravel () for i, feature in enumerate ( features ): axes_flat [ i]. boxplot ( df[ feature ]) axes_flat [ i]. set_title ( f"{ feature } Boxplot ") axes_flat [ i]. set_ylabel (" Value ") plt . tight_layout () plt . show () # Visualize boxplots visualize_boxplots ( X_final_b , X_final_b . columns . tolist ()) |
Interpretation of Boxplots The boxplot elements provide insights into the distribution of data along the y-axis:
Figure 3: Boxplots of Features for Outlier Analysis
Our statistical analysis identified a significant number of outliers, about 25% of the data. To address these outliers, efficient data scaling techniques will be preferred over traditional methods like winsorization, in order to minimize potential information loss. While other outlier detection methods like Isolation Forest and Local Outlier Factor exist, their high computational costs make them impractical for the large size of the current dataset.
[1] Sharafaldin, I., Lashkari, A. H., & Ghorbani, A. A. (2018). Toward generat- ing a new intrusion detection dataset and intrusion traffic characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP) (pp. 108–116). [2] Scarfone, K., & Mell, P. (2007). Guide to Intrusion Detection and Preven- tion Systems (IDPS). NIST Special Publication, 800(2007), 94. [3] Ahmed, M., Mahmood, A. N., & Hu, J. (2016). A survey of network anomaly detection techniques. Journal of Network and Computer Appli- cations, 60, 19–31. [4] Sommer, R., & Paxson, V. (2010). Outside the closed world: On using machine learning for network intrusion detection. In 2010 IEEE Symposium on Security and Privacy (pp. 305–316). IEEE. [5] Hall, M. A. (1999). Correlation-based feature selection for machine learning. PhD Thesis, University of Waikato. [6] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar), 1157–1182. [7] Wang, W., Sheng, Y., Wang, J., Zeng, X., Ye, X., & Huang, Y. (2018). HAST-IDS: Learning hierarchical spatial-temporal features using deep neu- ral networks to improve intrusion detection. IEEE Access, 6, 1792–1806. [8] Maci´a-Fern´andez, G., Garc´?a-Teodoro, P., & Mirsky, Y. (2022). Establish- ing the Contaminating Effect of Metadata Feature Inclusion in Machine- Learned Network Intrusion Detection Models. In Engineering Secure Soft- ware and Systems (pp. 21–37). Springer. Retrieved from https://link. springer.com/chapter/10.1007/978-3-031-09484-2_2 [9] Yin, C., Zhu, Y., Fei, J., & He, X. (2017). A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access, 5, 21954– 21961. [10] Ullah, I., & Mahmoud, Q. H. (2020). A hybrid model for anomaly-based intrusion detection in software-defined networks. Journal of Network and Computer Applications, 157, 102563. [11] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting sys- tem. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). [12] Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358–3378. [13] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, 30. [14] Wang, C., Pang, Y., Li, Y., & Yuan, F. (2020). LightGBM: A novel en- semble boosting model for accurate soccer result prediction based on book- maker odds. Entropy, 22(4), 437. [15] Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, 31. [16] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. [17] Liaw, A., & Wiener, M. (2002). Classification and regression by random- Forest. R News, 2(3), 18–22. [18] Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 30. [19] Molnar, C. (2020). Interpretable Machine Learning. Retrieved from https://christophm.github.io/interpretable-ml-book/shapley.html [20] DataCamp. (2023). Pandas Profiling (ydata-profiling) in Python: A Guide for Beginners. Retrieved from https://www.datacamp.com/tutorial/ pandas-profiling-ydata-profiling-in-python-guide [21] Ross, K. D. (2021). Quantile functions. In Probability and Statistics (pp. 111–119). Open Educational Resource. Retrieved from https://bookdown. org/kevin_davisross/probsim-book/quantile-functions.html
Copyright © 2024 Arjun R Nair. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET65559
Publish Date : 2024-11-26
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here