Spam Email Detection Using Convolutional Neural Networks: An Empirical Study

Authors: Akshay Merugu, Hrishikesh Goud Chagapuram, Rahul Bollepalli

DOI Link: https://doi.org/10.22214/ijraset.2023.56143

Abstract

This study leverages Convolutional Neural Networks (CNNs); a state-of-the-art deep learning architecture primarily used in image analysis, and adapts it for the detection of phishing emails. By treating email content as multi-dimensional data, we employ CNNs to extract meaningful features and patterns from email headers, text, and attachments. Our approach not only identifies known phishing templates but also has the capability to detect emerging and zero-day phishing attacks.

Introduction

I. INTRODUCTION

Phishing attacks remain a pervasive and evolving threat in the digital landscape, exploiting human vulnerabilities to deceive individuals and organizations into divulging sensitive information. In response to this escalating cyber menace, this research focuses on the development of a novel approach termed "Phishing CNN" for the automated detection of fraudulent emails.

To enhance the robustness and accuracy of our model, we explore various data preprocessing techniques, feature engineering strategies, and transfer learning from related tasks. Furthermore, we delve into the integration of natural language processing (NLP) techniques to analyze email text and identify subtle linguistic cues that may indicate phishing attempts.

The evaluation of Phishing CNN is carried out on a diverse and large-scale dataset, incorporating real-world phishing emails and legitimate correspondence. Our results demonstrate promising accuracy rates, low false positive rates, and excellent generalization performance, positioning Phishing CNN as a valuable tool in the fight against phishing attacks.

Ultimately, this research contributes to the arsenal of cybersecurity tools, offering a reliable and automated approach to detect fraudulent emails, thereby safeguarding individuals and organizations against the financial, reputational, and security risks associated with phishing threats.

II. LITERATURE REVIEW

Upon Extensive Literature Survey, the previous researches were based on how CNN algorithms and deep learning techniques are involved in detection of phishing and spam emails.

One such research is:

“A Deep Learning-Based Phishing Detection System Using CNN, LSTM, and LSTM-CNN” by few students based in Saudi Arabia (detailed references are in later section of the paper)

The study conducted by these students aimed on using deep learning techniques namely CNN, LSTM (Long Short-Term Memory) which is a recurrent neural network (RNN) architecture widely used in Deep Learning. The study aimed to classify phishing URLs and stop financial losses and cybercrimes, our work offers a great contribution to the efficacy of using LSTM, CNN, and LSTM–CNN. Even though this paper is mostly based on solving phishing attacks through a particular source i.e.: emails, the paper mentioned above gives a meaningful insight on how these models usually work. This work aimed to classify phishing URLs and combat financial losses and cybercrimes.

Our project builds upon these foundations, further enhancing the efficacy of LSTM, CNN, and LSTM-CNN in the context of email security, thus contributing to the ongoing battle against phishing attacks.

Another paper from students of Sichuan University from China:

“Phishing Email Detection Using Improved RCNN Model with Multilevel Vectors and Attention Mechanism”

This is an advanced research on how RCNN Model with Multilevel Vectors and Attention Mechanism improves the currently deployed models used in several places of the internet. This model proposed a new phishing email detection model named THEMIS, which is used to model emails at the email header, the email body, the character level, and the word level simultaneously. To evaluate the effectiveness of THEMIS, we use an unbalanced dataset that has realistic ratios of phishing and legitimate emails which comprehensively models emails at various levels, including the email header, email body, character level, and word level. This innovative approach has been instrumental in improving existing models deployed across various facets of the internet. The evaluation of THEMIS against an unbalanced dataset with realistic ratios of phishing and legitimate emails demonstrates its potential to enhance email security significantly. These research contributions collectively underscore the continuous evolution and innovation in the domain of phishing email detection.

There are several other studies and researches conducted on this topic, whilst many aim to improve the model through several other techniques, the paper being produced here aims to drastically improve the accuracy and other metrics of detection using NLP (Natural Language Processing) and Sequence padding.

III. CNN METHODOLOGY

A. Neural Networks

Imagine a brain-like system that can learn from examples, make decisions, and solve complex problems. At its core, a neural network is a collection of interconnected nodes, often referred to as "neurons." These neurons work together to process information, just like our brain's neurons. Each neuron receives inputs, performs computations, and produces an output. When combined, these neurons can perform tasks ranging from recognizing images to playing games and making predictions. That’s the essence of a neural network—a powerful computational tool inspired by the human brain.

Neural networks are typically organized into layers: an input layer, one or more hidden layers, and an output layer. Think of these layers as processing stages. The input layer receives data (like pixel values in an image), the hidden layers analyze and transform this data, and the output layer produces a final result (like classifying an image as a cat or a dog).

B. Layers of Neurons

C. Connections and Weights

Connections between neurons are like synapses in our brains. Each connection has a "weight" that determines its strength. These weights are crucial because they influence how information flows through the network. During training, the network adjusts these weights to learn from data.

D. Activation Functions

Neurons use activation functions to decide whether to "fire" or pass information to the next layer. Common activation functions include the sigmoid, ReLU (Rectified Linear Unit), and tan h (Hyperbolic Tangent). These functions introduce non-linearity, allowing neural networks to learn complex patterns.

IV. COMPLICATIONS OF USING TRADITIONAL ANN’S FOR IMAGE CLASSIFICATION:

For a general image of small size, using a artificial neural network with multiple hidden layers for activation can be considered. But, as the size of the image (size here refers not only to the actual dimensions of the image but also to the definition of the image and the number of features involved in the image) Images are high-dimensional data, often with millions of pixels. Traditional ANNs may struggle to handle this high dimensionality, leading to a large number of weights and parameters, making training slower and more prone to overfitting. ANNs are prone to overfitting, especially when dealing with small or noisy datasets. Over fit models perform well on training data but poorly on new data because they capture noise and outliers. Processing large images with traditional ANNs can be computationally intensive and may require extensive down sampling or cropping to reduce dimensionality. This can result in information loss and hinder the network's ability to recognize fine details. The image requires much more computation power to adapt itself to the image and requires a dense layer of neural connections this in turn puts heavy strain on the CPU.

V. CNN OVER TRADITIONAL ANN

Convolutional Neural Network works in two stages namely:

A. Feature Extraction

Features are the particular trait of the image which the convolutional neural network looks for while matching with other images during the classification phase. These feature Extraction involves multiple phases of convolutional methods along a reduction phase to decrease the number of computations involved using a method called pooling.

Once an image is loaded into the network as any neural network it first converts itself into a matrix of RGB values which present the values of the colors being depicted in the picture.

As being displayed in the above image, the whole picture is initially converted into a matrix of numbers representing the RGB values of the color. Later during the process of convolution the image is then extracted of features which represent the picture for example in the given picture the branches, the water and many others. These are initially converted into matrices of values.

B. Convolution

In the convolution phase, the image which is being detected is being plotted (multiplied) with the cells of the matrix of a feature extracted from the initial image. This then checks the values which indicate whether the value is present or not indicating the presence of the feature.

This is then applied for all the features which are then again checked for major features involving the previous ones and then a neural network is formed to classify the given image into one category.

Pooling is the process of considering only the maximum values in a stride, usually a lot smaller than the feature matrix. Various levels of pooling and convolution take place to reduce the computations required to classify the image.

VI. EXISTING SYSTEM

The Modern systems are actively updating in terms of neural networks, yet the architecture behind it remains the same, A system used for phishing somewhat looks like this:

A. Email Input

The email text is provided as input to the system.

B. Feature Extraction and Preprocessing

This stage involves extracting relevant features from the email text, such as text content, sender information, attachments, etc. The data is preprocessed to make it suitable for analysis.

C. Phishing Detection:

This step uses existing phishing detection techniques, which could be based on machine learning models or rule-based systems. These systems analyze the extracted features and determine whether the email is phishing or not.

D. Detection Result:

The final output is whether the email is classified as phishing or not phishing. The existing email detection has many flaws, while the pro’s cannot be overlooked as these kinds of systems are being applied or used over many industries and organizations to protect themselves from many kinds of phishing attacks. These systems typically aim for a standard template of phishing mails which try to detect them and stop them before being delivered to the end user. There have been several upgrades to these systems where they can now detect newer type of emails which are sent with malicious intent. Yet, features such as headers, text images, URL’s, ASCII codes can still mislead the system into believing them that they are legit emails and not phishing. Several other limitations are explored in detail in the next section.

VII. LIMITATIONS OF EXISTING SYSTEM

A. False Positives and False Negatives

Existing systems may generate false positives (legitimate emails classified as phishing) and false negatives (phishing emails classified as legitimate). Achieving a balance between these two types of errors is challenging.

B. Evolution of Phishing Techniques

Phishers continually adapt and develop new techniques to evade detection. Existing systems may struggle to keep up with evolving phishing tactics.

C. Zero-Day Attacks

Rule-based systems may fail to detect zero-day phishing attacks that employ entirely new strategies, as these systems rely on predefined rules or patterns.

D. Imbalanced Datasets

Machine learning-based systems require large and balanced datasets for training. In practice, obtaining representative datasets with sufficient phishing examples can be challenging.

E. Feature Engineering

Traditional machine learning approaches often require manual feature engineering, which can be time-consuming and may not capture all relevant features.

F. Lack of Generalization

Some systems may perform well on specific types of phishing attacks but may struggle to generalize to different variations or new types of attacks.

G. Overfitting

Machine learning models can over fit to the training data, leading to poor performance on unseen data.

H. Resource Intensive

Some machine learning models, especially deep learning models, can be computationally expensive and may not be feasible for all organizations, especially smaller ones with limited resources.

I. Privacy Concerns

Some phishing detection systems may involve the analysis of email content, raising privacy concerns related to user data.

J. Scalability

As the volume of email traffic grows, scalability becomes a concern for some systems. Scalable deployment and real-time detection can be challenging.

K. Learning over Time

Machine learning models like CNNs can continuously improve their performance as they receive more data and feedback. This allows for ongoing refinement and adaptation to changing phishing techniques.

L. Potential for Real-Time Detection

Once trained, CNN-based models can make predictions in real-time, providing immediate detection of phishing emails.

VIII. PROPOSED SYSTEM

Building upon the foundation of existing phishing email detection techniques and leveraging Convolutional Neural Networks (CNNs), we propose an advanced system that enhances the accuracy and effectiveness of identifying phishing emails. Our system aims to address the evolving nature of phishing attacks and provide robust protection against cyber threats.

This is the base idea behind the proposed system. The code implementation is provided in the module section of this paper. The major steps involved are email inputting then there is feature extraction which is similar to that of the existing systems. Later we implement text tokenization and sequence padding where the text is tokenized and only the top most repeated words are considered for the convolution layer. Next we take a phishing email repository which can be found online (details provided in the reference section) our system will employ a deep learning architecture, specifically a CNN, designed to analyze email content comprehensively. CNNs excel in image and text analysis and have demonstrated remarkable capabilities in feature extraction and pattern recognition. The CNN will be configured to automatically extract relevant features from email text, including linguistic patterns, structural characteristics, and textual cues indicative of phishing attempts. In addition to email text, the proposed system may incorporate multiple input channels, such as email headers, sender information, and metadata, allowing for a holistic analysis of emails. Our system will provide real-time monitoring capabilities, enabling it to scan incoming emails for potential phishing threats as they arrive in the inbox. We may explore the use of ensemble learning techniques, combining the predictions of multiple models to improve accuracy and reduce false positives. An intuitive and user-friendly interface will be developed to allow users to interact with the system, report suspicious emails, and customize detection settings.

IX. ADVANTAGES OF PROPOSED SYSTEM

A. Deep Learning Architecture

Our system will employ a deep learning architecture, specifically a CNN, designed to analyze email content comprehensively. CNNs excel in image and text analysis and have demonstrated remarkable capabilities in feature extraction and pattern recognition.

B. Multiple Input Channels

In addition to email text, the proposed system may incorporate multiple input channels, such as email headers, sender information, and metadata, allowing for a holistic analysis of emails.

C. User-Friendly Interface

An intuitive and user-friendly interface will be developed to allow users to interact with the system, report suspicious emails, and customize detection settings.

D. Enhanced Accuracy

The deep learning architecture, combined with feature extraction capabilities, is expected to significantly enhance the accuracy of phishing email detection.

E. Real-Time Protection

By offering real-time monitoring, our system can swiftly identify and respond to phishing threats, reducing the risk of successful attacks.

F. Adaptability

Regular model updates ensure that the system remains effective against evolving phishing techniques and tactics.

G. Holistic Analysis

Multiple input channels and comprehensive feature extraction enable a holistic analysis of emails, improving detection capabilities.

H. Ensemble Learning

We may explore the use of ensemble learning techniques, combining the predictions of multiple models to improve accuracy and reduce false positives.

I. Regular Model Updating

To keep the system current and adaptive to emerging threats, regular model updates will be scheduled. These updates will incorporate the latest data and threat intelligence.

The proposed phishing email detection system represents a significant advancement in email security. By leveraging deep learning, real-time monitoring, and ensemble techniques, we aim to provide robust protection against phishing threats.

As we move forward with the development and implementation of this system, our commitment to staying at the forefront of cybersecurity remains unwavering, with the ultimate goal of safeguarding individuals and organizations from the ever-present dangers of phishing attacks.

X. IMPLEMENTATION OF THE PROPOSED SYSTEM

XII. FUTURE ADVANCEMENTS

While the proposed system holds great promise, several key areas of Future enhancements which we personally think would benefit are:

A. Expansion of Datasets

Quantity: Increasing the scale of the training dataset is vital. A large and greater diverse dataset can assist enhance model generalization and robustness.
Diverse Sources: Diversifying the resources of the dataset by way of along with emails from diverse industries, regions, and email customers can better simulate real-global conditions.
Balanced Distribution: Ensuring a balanced distribution of phishing and valid emails inside the dataset enables the version keep away from biases and perform greater successfully across both classes.

B. Multilingual Support

Language Diversity: Extending language aid beyond English to embody a big range of languages is essential in addressing international phishing threats.
Multilingual Training: Training the version to come across phishing attempts in multiple languages, with appropriate tokenization and embedding techniques, complements its applicability in multicultural contexts.

C. Variety of Emails

Email Types: Expanding the scope of the system to investigate exclusive kinds of emails, which includes promotional emails, newsletters, and transactional emails, can offer a greater nuanced expertise of e mail content material.

2. Rich Media: Incorporating aid for emails with rich media content material, together with pictures, attachments, and embedded hyperlinks, can provide greater complete analysis abilities.

D. Enhanced Model Architecture

Ensemble Models: Exploring ensemble getting to know strategies via combining the predictions of a couple of models can further enhance accuracy and reduce false positives.
Recurrent Architectures: Considering the integration of recurrent neural community (RNN) additives along CNNs can help seize sequential patterns inside emails greater successfully.

E. Continuous Research and Collaboration

Collaboration: Engaging in collaborative research efforts with academia, cybersecurity organizations, and industry partners to stay at the forefront of phishing threat intelligence.
Threat Intelligence Integration: Incorporating external threat intelligence feeds and APIs to enhance the system's threat detection capabilities.

Conclusion

In an ever-changing cybersecurity landscape, combating phishing email attacks is a daunting challenge. This paper presented a detailed study of phishing email detection using the capabilities of Convolutional Neural Networks (CNNs). Through an in-depth analysis of existing systems, a solid foundation has been laid for an improved system that can reduce the risks associated with attempted arrests for abuse. The proposed system combines cutting-edge deep learning techniques with real-time monitoring, adaptability, and user-friendliness, offering a multifaceted approach to the detection of phishing emails. As we finish this paper, it\'s far vital to emphasize that while generation performs a pivotal role inside the combat in opposition to cyber threats, user attention, training, and vigilance are equally critical. Together, via an aggregate of modern answers and knowledgeable practices, we will create an impressive defense against phishing emails, fortifying the safety of our digital international.

References

[1] “A Deep Learning-Based Phishing Detection System Using CNN, LSTM, and LSTM-CNN” by Zainab Alshingiti 1,Rabeah Alaqel 1,Jalal Al-Muhtadi 1,2,Qazi Emad Ul Haq 3,*,Kashif Saleem 2ORCID and Muhammad Hamza Faheem 3ORCID [2] “Convolutional Neural Network Optimization for Phishing Email Classification” by Cameron McGinley; Sergio A. Salinas Monroy [3] https://www.statista.com/topics/8385/phishing/ [4] https://www.analyticsvidhya.com/blog/2021/03/introduction-to-long-short-term-memory-lstm/#:~:text=LSTM%20(Long%20Short%2DTerm%20Memory,ideal%20for%20sequence%20prediction%20tasks. [5] https://www.tensorflow.org/api_docs

Copyright

Copyright © 2023 Akshay Merugu, Hrishikesh Goud Chagapuram, Rahul Bollepalli. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET56143

Publish Date : 2023-10-13

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here