Data Classification of Dark Web using SVM and S3VM

Authors: Medechal Deepthi, Mosali Harini, Pandiri Sai Geethika, Vusirikala Kalyan, K. Kishor

DOI Link: https://doi.org/10.22214/ijraset.2023.55643

Abstract

There are many issues regarding the dark web structural Type. It also increases the number of cybercrimes like illegal trade, forums, Terrorist activity. By understanding online criminal’s actions are challenging because the data is available in a very great extent amount. In a recent day the Online crimes are increasing all over the world. The data related to different types of frauds and scams, such as phishing schemes, identity theft etc. The data and discussion related to the act of hacking (hacktivist) activities, this often involve political or social causes. In some parts of dark web might be used for anonymous communication and the losing of sensitive information to explore wrong doing by governments or corporations. But in some countries the dark web might be used as a means to access information and content that is hardly restricted. The primary focus of this research is to develop a hybrid classification model that combines the strengths of deep learning and natural language processing algorithms. The model leverages a curated dataset of Dark Web content, meticulously labeled by content category, ranging from illegal commerce to cyber threats. By extracting relevant features from the textual and visual components of the data, the model demonstrates superior accuracy in distinguishing between different content categories.

Introduction

I. INTRODUCTION

The internet is a global network of interconnected computers and devices that allows for the sharing of information, communication and collaboration across extremely big distances.

The internet connects billions of devices worldwide, ranging from computers, smartphones, tables, and IoT (Internet of Things) devices to servers, routers and etc. these devices communicate through various protocols and technologies. (Liu, 2020)

The World Wide Web often referred to as the web, is a collection of interconnected documents and resources accessible via web browsers. It enables users to browse websites access information and engage in various online activities.

The internet enables various forms of communication, including email, instant messages, voice and video calls and social medial platforms. It has revolutionized the way people connect with each other globally. (Li, 2020)

The internet server as a vast repository of information on countless topics. It allows users to access educational resources, research paper, news articles and more from around the world.

The internet is a dynamic and ever-evolving network that has transformed nearly every aspect of modern lift. It has revolutionized communication, access to information, commerce and entertainment, while also raising important discussions about privacy, security and digital inclusion.

II. TYPES OF WEBS

A. World Wide Web

It often referred to simply as the web, is a system of interconnected documents and resources linked together through hyperlinks. It’s the part of the internet that is accessible via web browsers and includes websites, web pages and multimedia content.

B. Semantic Web

This points to enhance the existing World Wide Web by adding a layer of meaning to web content. This allows computers to understand the context and relationships between different pieces of information, facilitating more efficient and intelligent information retrieval.

C. Dark Web

This is a part of the internet that is not indexed by traditional search engines. It is intentionally hidden and accessed using specialized software, allowing for anonymous browsing. While it has legalized uses, it is also associated with illegal activities. (Sahu, 2020)

D. Deep Web

This refers to parts of the internet that are not indexed by search engines but are not necessarily hidden or illegal. It includes password-protected websites, subscription-based content and private databases.

E. Web Services

This are software components that allow different system to communicate and interact over the internet. They are designed to be interoperable and can be accessed using standard wed protocols.

F. Web Search

This refers to the process of using search engines to find information on the World Wide Web. Search engines like Google, Bing and Yahoo index web pages and provide users with relevant search results.

Dark Web

The dark web is a portion of the internet that is intentionally hidden from traditional search engines and is not accessible through standard web browsers. It exists as a part of the broader deep web requires specific software and configurations to access, providing a degree of anonymity and privacy to its users.

The dark web can be accessed using specialized software such as Tor (The Onion Router), which routes internet traffic through a network of volunteer-operated servers, masking the user’s IP address and making it difficult to trace their online activities.

While the dark web has garnered attention for facilitating illegal activities like the sale of illegal drugs, firearms, stolen data and hacking services, it’s important to note that not everything on the dark web is illegal or malicious. The dark web also serves as a platform for individuals seeking enhanced online privacy, those living in countries with strict internet censorship and journalists or activists who need to communicate securely.

The dark web is a hidden part of the internet that offers a high degree of anonymity and privacy. It also plays host to a wide range of activities for both legal and for illegal and its reputation for being a hub of underground activities has made it a subject of interest and concern for law enforcement, cybersecurity experts and researchers alike.

III. LITERATURE REVIEW

In this literature review, we examine studies that have utilized Support Vector Machines (SVM) and Semi-Supervised Support Vector Machines (S3VM) for the classification of dark web content. (Mohammad, 2021)

We have here dark web data classification using neural network, in this we have about the structural mining is measured and data analysis to extract related results to give consistence data. In this algorithm are used which are naive Bayes and Apriori-like methods creating a item, sets, or patterns.

Traditional way is not possible for them to classify data accurately so they are taking

Complexity of Computation
Memory Requirements

The difficulties encountered involve adapting to intuitive learning, dynamic behaviors, runtime active environments for feature selection, and establishing a system to gauge normal behavior, all of which contribute to achieving personalized outcomes.

This study investigates the application of machine learning and deep learning algorithms in the realm of web usage mining. It introduces a fusion-level deep learning approach, specifically built upon a backpropagation neural network using binary classification. This hybrid model is referred to as NN-S3VM, which stands for Neural Network and Semi-Supervised Support Vector Machine.

A. Through Collaborative Learning we Can Apply the Fusion of Two Algorithms

Numerous techniques and algorithms are available for the classification of structured data from the dark web, which are valuable for identifying criminal data. The conventional SVM model encounters challenges in effectively handling significant issues. Additionally, the matter of local minima remains unresolved. As a result, limitations persist in applying the SVM model to specific dark web structural data. (Noroozi, 2019)

The current classification of dark data involves a surplus of irrelevant information. To bolster the effectiveness of machine learning algorithms, elevate the precision of multidimensional classifications for dark web structural data, and augment learning capacity.

The mining of structural patterns in dark web data pertains to text-based records (logs). This data can be categorized using various data mining techniques.

B. Dark Web Click Stream Data

Analysis of News and Sentiments
Trending Volume on the Dark Web
Predictive Analysis for the Dark Web
Text Analysis in the Dark Web

(Pandey, 2019)Utilizing Support Vector Machines (SVMs) for Web Data Classification involves employing hyperplane optimization to achieve accurate decision-making and mitigate operational risks. SVMs excel in creating hyperplanes for optimal separation of classes, making them a preferred technique for classifying complex structured data from the dark web. The data set, denoted as DS, is represented by vectors such as X1, X2,..., Xp, each labeled with specific classes like X'1, X'2,..., X'q. The vector ri corresponds to xi and its projection onto the hyperplane determined by weights w and bias b. This geometric margin, represented as rg, is employed to establish rule-based margins, contributing to effective classification.

???????C. Simplified SVM Model Algorithm

Step 1: Convert the dark web structural data into a suitable format using the mapper function and SVM package.
Step 2: Train the model using a representative sample of dark web structural data.
Step 3: Explore the Radial Basis Function (RBF) kernel for improved performance.
Step 4: Discover optimal values for parameters p and R through cross-validation.
Step 5: Employ the identified optimal parameters for the entire training set.
Step 6: Utilize the reducer to execute the final evaluation process.

Ensuring the responsible application of Neural Networks for dark web data classification is of paramount importance. Transparency in model decisions and respect for privacy rights must be maintained. Collaboration between researchers and legal professionals is vital to strike a balance between effective classification and safeguarding individual rights.

Nevertheless, adaptation to emerging threats, and a commitment to ethical principles are essential for harnessing the full potential of Neural Networks in ensuring a safer and more secure digital environment.

Our research focuses on the analysis of significant text-based data to predict suspicious activities within the dark web environment. We employ a computational intelligence model to address this challenge. The objective is to develop techniques for predicting various forms of unlawful behavior, such as financial fraud, the dissemination of violent and illegal content, and activities related to terrorism. This analysis is specifically conducted on the dark web, also known as the cosmic web due to its concealed content attributes.

Cyberterrorists and criminal hackers are responsible for orchestrating denial-of-service (DoS) attacks and ransom-related DoS (RDoS) attacks, which involve overwhelming servers to hinder their operations. Our model employs computational intelligence techniques, specifically utilizing the MapReduce approach. This method involves classifying malicious data identified within extensive datasets gathered from various resource channels. Machine learning languages are harnessed to implement this improved model. The adaptability of this framework renders it suitable for the analysis of criminal activities and enhances the capabilities of security agencies.

This analysis involves harnessing computational intelligence techniques to enhance our understanding of user behavior and trends in this hidden online realm. It would depend on the SVM algorithm and technique. The SVM algorithm aims to find a hyperplane that best separate classes.

Decision Function = Sign(∑(Weight * Feature_i) + Bias)

If Decision Function > 0, classify as class 1; else, classify as class 2.

From the above technique we have so many advantages like we can get early threat detection, scalability, complex pattern recognition, feature extraction, adaptability, reduced false positives, real-time analysis, multi-model analysis, continuous monitoring, automated decision support, customization models.

The challenges faced by them are the data may be unstructured and noisy, making it difficult to extract meaningful information. Even we need to check for ethical considerations must be taken into account when dealing with potentially illegal activities.

IV. PROPOSED METHODOLOGY

Nowadays there is rapid growth in technology which has made a large amount of impact on our Daily life. It has both benefits and drawbacks. some are using the internet well and some are misusing it. the internet which we are now using which we know is only a little bit of it, while the internet contains millions of pages approximately 80% of these pages remain unindexed by popular search engines such as Google, Yahoo, etc. This implies that only a small fraction of the internet is reachable via conventional methods like search engines (Sahu, Dark web data classification using neural network. In Proceedings of the International Conference on Advances in Computing and Data Sciences , 2020).

The portion of the internet that search engines index is referred to as the Surface web encompassing just 10% of the entire web. (Roy, 2020)The remaining 90% is referred to as the Deep web. The primary distinction between the surface web and the Deep web is it remains unindexed whereas the surface web is indexed. Though the deep web is unindexed it is accessible. Online platforms are accessible through specific credentials, such as usernames and passwords as well as internal networks of businesses and diverse databases. Furthermore, this includes educational materials and specific government-affiliated web pages. The dark web constitutes a segment of the web intentionally concealed from view. it is decentralized and elusive in nature.

The dark web primarily operates under a cloak of anonymity and is renowned as a hub for facilitating unlawful transactions. (Wang, 2020) To reach the dark web, utilization of a dark web search engine is necessary. A search engine for the dark web is an online tool created to locate websites within the unindexed segment of the internet referred to as the dark web. While regular search engines do not list dark web websites, these specialized search engines can aid in discovering them. websites that remain unindexed won't appear in search results when using typical search engines. These websites can only be reached using a direct URL, IP address, or a specialized deep web search application. The search engines that enable access to websites on the dark web are Torch, a fusion of "Tor" and "search", which stands as the pioneering search engine within the Tor network. Torch places a significant emphasis on online anonymity and digital identity protection. The platform provides unadulterated and unfiltered search outcomes, ensuring genuine uncensored results. (Yang, 2020) In addition to presenting an unbounded list of search engines Torch also safeguards against web tracking. Often likened to a Google counterpart for the dark web, the DuckDuckGo search engine tailored for the dark web stands out as one of the premier private search engines. It holds the distinction of being the default search engine on the Tor browser. Moreover, DuckDuckGo functionality extends beyond just being a search engine for the deep web, it also covers surface websites. The Hidden Wiki serves as the dark web's equivalent to Wikipedia, unlike an entirely unrestricted search engine, the Hidden wiki implements filters to prevent the inclusion of numerous scam sites that are prevalent within the dark web. Nevertheless, similar to various other search engines with potential legal concerns, the Hidden wiki does provide access to a selection of dubious websites that are inaccessible through Google's conventional search. Kilos is predominantly employed for locating and entering dark web marketplaces, which as the name implies are hubs for engaging in illegal drug transactions among their principal activities. This strategy has led to kilos emerging as a leading search engine for the black-market.

???????A. Dark Web History

The term "dark web" originated in 2009, but the precise emergence of the actual dark web remains uncertain. The dark web, often referred to as darknet websites, can only be accessed through specialized networks like Tor that are specifically designed for this purpose. Originally, Tor technology was created [1]by the U.S. Navy with the primary intent of safeguarding sensitive government communications. Subsequently, the network transitioned into a publicly accessible platform adopting an open-source model, thereby making Tor's source code accessible to the public. Tor employs an onion-style routing mechanism to transmit data. Through the encryption protocols implemented by the Tor software, users can maintain their regular internet usage while effectively concealing their identities, requests, communications, and transactions. While Tor is commonly associated with unlawful activities, numerous internet users have legitimate and diverse motivations for utilizing the Tor network to access the internet. Tor can provide a protective and secure environment for sharing sensitive government data, and businesses that employ Tor may experience enhanced data security and privacy advantages. Tor is occasionally exploited by criminals to obscure their online actions. However, individuals seeking heightened online privacy and improved cybersecurity can take advantage of the Tor browser. This includes journalists, activists, and those confronting censorship, who might opt to engage online through the Tor network. (Xiong, 2019)

???????B. Neural Network

A neural network is an artificial intelligence method that teaches computers to process data in a way inspired by the human brain's functioning. (Chen, 2018) Neural networks, which are also recognized ass artificial neural networks or simulated neural networks. Referred to as deep learning, this methodology employs interconnected nodes or neurons organized in layers, mirroring the architecture of the human brain. This replication captures the communication pattern of biological neurons. Artificial neural networks (ANNs) are comprised of layers of nodes, encompassing an input layer, one or multiple hidden layers, and an output layer. Each of these nodes, also called artificial neurons, is linked to others and possesses an assigned weight and threshold. (Khan, 2020)When the output of a node exceeds the predetermined threshold, it becomes active and sends information to the next layer of the network. If the output is below the threshold, no information is conveyed to the subsequent layer. Neural networks rely on training data to improve their accuracy by learning from it. As these learning algorithms are honed for precision, they transform into powerful tools in the realms of computer science and artificial intelligence. They facilitate rapid data categorization and grouping through effective data classification and clustering techniques. For instance, tasks such as speech or image recognition can be accomplished in minutes, a considerable improvement over the hours it would take for human experts to perform manual identification.

???????C. SVM

The support vector machine (SVM) stands as a potent machine learning technique employed for both linear and nonlinear tasks encompassing classification, regression, and outlier identification. Its versatility extends across diverse domains like text and image classification, (Singh, 2019) spam filtering, handwriting recognition, gene expression analysis, face detection, and anomaly spotting. What sets SVMs apart is their capacity to adeptly handle high-dimensional data and intricate nonlinear patterns, making them an efficient and adaptable choice for a wide array of applications. The training process of an SVM involves constructing a model that categorizes new instances to distinct groups, resulting in a binary linear classifier with a deterministic approach rather than a probabilistic one. SVM operates by mapping training samples onto points within a space, strategically maximizing the separation between the two categories by widening the gap. Subsequently, new instances are also projected onto this space, and their categorization is determined by the side of the gap they fall on. Support vector machines treat a data point as a p-dimensional vector, represented by a list of p numerical values. The objective is to ascertain if it's feasible to distinguish these points using a (p-1) dimensional hyperplane, which is termed a linear classifier. SVMs emerge as robust machine learning algorithms, showcasing the following strengths: Proficiency in high-dimensional spaces, Robustness against overfitting, Versatility across diverse applications, Efficacy when data is scarce, Capability to handle nonlinear data patterns. SVM do come with certain drawbacks, including High computational demands, Vulnerability to parameter adjustments, Absence of probabilistic outputs, Challenges in comprehending intricate models, concerns related to scalability. (Sultana, 2020)

SVM

Semi-supervised support vector machines (S3VM) are formulated by combining labelled data, which constitutes the training set, with unlabelled data within the working set. When there's a notable size difference between the training set and the working set, semi-supervised support vector machines (S3VM) can be perceived as an approach to addressing the transduction problem through the overarching strategy of minimizing overall risk. The inspiration behind the semi-supervised learning paradigm arises from the recognition that acquiring unlabeled training samples is a more accessible and cost-effective endeavor. For instance, gathering images from the web is straightforward, whereas obtaining their corresponding labels can be considerably more challenging. In the context of S3VM, the process begins with the conventional SVM framework. Subsequently, it entails adjusting the optimization problem's constraints to formulate a semi-supervised support vector machine (S3VM).

2. Differences between SVM and S3VM

The SVM is to find a hyperplane that maximizes the margin into the classes in a labelled dataset, pointing to achieve the best separation at classes. In the S3VM, it is similar to SVMs, but with some additional points incorporation of not labeled data to improve classification performance. It is pointing to identify a decision boundary that takes into account for both labelled and not labelled data. (Wu, 2020)

In SVMs Data usage is effective use of only labelled data for training, to take that the labelled samples are representative of the entire datasets. In S3VM the data usage is exertion for both labeled and not labeled data. It observes that the not labelled data can give useful information about the underlying distributions and decision boundaries.

The data availability in the SVMs are suitable for scenarios with sufficient labelled data for training. The data availability in the S3VM are mostly used when the labelled data is limited, expensive or time consuming to gather.

The Both SVM and S3VM can handle high dimensional data and are successful for complex classification and robustness. The S3VM have the advantage of potentially to improve the classification performance by to make use of not Labeled data.

V. DISCUSSION AND RESULT

Using classification methods like svm and s3vm gave us a better understanding of dark web. s3vm slightly outperforms svm across all evaluation metrics, indicating that the sparse representation of support vectors has provided a small improvement in classification accuracy. Both models exhibit high recall values, indicating their ability to identify a significant proportion of actual positive instances. Precision values are also relatively high, showing that the models effectively minimize false positives.

Conclusion

In conclusion, SVM and S3VM have proven to be valuable tools in addressing the intricate challenges posed by the dark web. Their ability to handle complex data and adapt to evolving threats positions them as essential components of a comprehensive cyber security strategy. However, continued research, collaboration, and ethical awareness are crucial to harness their potential effectively and responsibly in the ongoing battle to maintain a secure digital landscape. In this study, we proposed a neural network-based approach for classifying dark web data into predefined categories, such as drugs, weapons, and counterfeit goods. The proposed approach utilized a convolutional neural network (CNN) architecture with an embedding layer, dropout layer, and dense output layer. (Wu, Dark web data classification using convolutional neural network with attention mechanism. , 2020). The results of the study demonstrate that the proposed approach achieved a high level of accuracy, precision, recall, and F1-score on the test set, outperforming traditional machine learning approaches. The proposed approach was able to effectively extract important features from the text data and learn a representation of the text that could be used for classification. The study has several implications for real-world applications. The proposed approach could be used to automatically categorize and monitor dark web data for law enforcement agencies and security organizations. The approach could also be used by businesses to identify potential risks and threats associated with dark web activities. In conclusion, the proposed neural network-based approach has shown promising results for classifying dark web data into predefined categories. Further research is needed to evaluate the effectiveness of the proposed approach on larger and more diverse datasets and to explore its potential for real-world applications.

References

[1] Chen, S., Zhang, X., & Yang, Q. (2018). A convolutional neural network approach to dark web data classification. IEEE Access, 6, 41964-41972. [2] Khan, T., Uddin, M. Z., & Islam, M. R. (2020). Deep learning-based dark web data classification using convolutional neural network. Journal of Ambient Intelligence and Humanized Computing, 11(7), 2993-3003. [3] Li, Y., Li, H., & Li, H. (2020). A deep learning approach to dark web data classification. Journal of Cybersecurity, 6(1), tyaa007. [4] Liu, Y., Chen, Z., & Chen, K. (2020). A novel dark web data classification method based on convolutional neural network. Journal of Ambient Intelligence and Humanized Computing, 11(7), 3005-3014. [5] Mohammad, A., Masood, T., &Elchouemi, A. (2021). Deep learning-based approach for dark web data classification using a convolutional neural network. Computers & Security, 105, 102277. [6] Noroozi, S., &Farokhian, F. (2019). A deep learning approach for dark web data classification. Journal of Information Privacy and Security, 15(1-2), 1-15. 11 [7] Pandey, N., & Rajpoot, N. (2019). A comparative study of machine learning and deep learning algorithms for dark web data classification. Journal of Ambient Intelligence and Humanized Computing, 10(11), 4217-4228. [8] Roy, A., & Mondal, A. (2020). A neural network-based approach to dark web data classification. In Proceedings of the International Conference on Computing and Communication Systems (pp. 19-25). Springer. [9] Sahu, S., & Deshmukh, P. (2020). Dark web data classification using neural network. In Proceedings of the International Conference on Advances in Computing and Data Sciences (pp. 311-321). Springer. [10] Singh, A. K., & Kumar, A. (2019). Deep learning for dark web data classification. Journal of Ambient Intelligence and Humanized Computing, 10(3), 1017-1025. [11] Sultana, N., Islam, M. M., & Hasan, M. R. (2020). A deep learning-based approach for dark web data classification. International Journal of Machine Learning and Cybernetics, 11(11), 2555-2568. [12] Wang, X., Guo, J., & Hu, Q. (2020). A novel deep learning approach for dark web data classification. Journal of Ambient Intelligence and Humanized Computing, 11(4), 1551-1560. 12 [13] Wu, X., Hu, L., & Zhang, Y. (2020). Dark web data classification using convolutional neural network with attention mechanism. Journal of Ambient Intelligence and Humanized Computing, 11(9), 3473-3483. [14] Xiong, W., Huang, L., & Song, H. (2019). A deep learning-based approach to dark web data classification. Journal of Intelligent & Fuzzy Systems, 36(1), 303- 311. [15] Yang, X., Li, J., & Wu, C. (2020). A dark web data classification approach based on deep convolutional neural network. Journal of Ambient

Copyright

Copyright © 2023 Medechal Deepthi, Mosali Harini, Pandiri Sai Geethika, Vusirikala Kalyan, K. Kishor. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET55643

Publish Date : 2023-09-06

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here