The abundance of unwanted spam messages complicates the use of Short Message Service (SMS) for efficient communication in modern times. This study investigates developing and utilizing a Naive Bayes Theorem-based Ham/Spam detection system. Because of its ease of use and effectiveness in text classification tasks, the Naive Bayes classifier is used. A collection of SMS messages labeled as” spam” or” ham” (non-spam) makes up the dataset that was used for testing and training. Preprocessing methods, including tokenization, stop-word elimination, and stemming, are employed to extract pertinent features from the text messages. The Naive Bayes classifier learns how words relate to whether they’re in a spam or non-spam message by looking at some examples from the dataset. Utilizing criteria such as accuracy, precision, and confusion matrix on a separate testing set, the classifier’s performance is evaluated. Additionally, the impact of varying parameters such as smoothing techniques and feature selection methods on the classifier’s performance is analyzed. The experimental results used to distinguishing between ham and spam messages in SMS communication.
Introduction
I. INTRODUCTION
A. Overview
SMS is one of the best methods for communication in daily life. Because of its extensive usage every month, the average user received 19.5 spam SMS, an increase of 15 percentage over the previous year. (2022) More than three out of five Americans (58 percentage) said they received more spam texts compared to previous year. Spam messages are useless messages that contain unwanted marketing promotions or serve as a social engineering tool for hackers. Spam messages refer to useless messages that contain unwanted marketing promotions or serve as a social engineering tool for hackers.
B. Naive Bayes
The supervised machine learning method Naive Bayes is derived from the well-known Bayes theorem. This approach is widely applied to high dimensional training datasets for text categorization. For email spam filtering, we will use the multinomial Naive Bayes and holdout strategy.
II. LITERATURE REVIEW
[7] The study proposes a hybrid bagging technique for spam email detection that combines the J48 (decision tree) and Naive Bayes algorithms. Through dataset division and result comparison, the hybrid system achieves a notable accuracy of 87.5 percentage. [2] Additionally, the paper offers a thorough review of recent advancements in machine learning-based spam filtering. It emphasizes the need to consider specific problem characteristics, such as concept drift, and highlights challenges in updating classifiers based on bag-of-words representations. While progress has been made, further exploration is needed for more realistic evaluation settings. [6]The paper investigates various forms of Naive Bayes for spam email filtering. By comparing them on realistic datasets, the study highlights the importance of acknowledging different Naive Bayes variants. The incremental training approach and ROC curves provide valuable insights into performance trade-offs. [10] The paper provides a novel explanation for the remarkable performance of Naive Bayes in classification tasks. It highlights the role of dependence distribution among attributes. Even when strong dependencies exist, Naive Bayes can be the best option if they disperse equally or cancel each other out. The study explores optimality conditions, especially under Gaussian distribution. A valuable contribution to understanding Naive Bayes behavior. [8] The paper highlights the challenges posed by email spam and the impact it has on users. It proposes a model using Bayes’ theorem and Naive Bayes’ Classifier to detect spam messages effectively. By considering IP addresses of senders, this approach aims to improve spam identification. [3]The paper sheds light on the challenges of spam email detection, emphasizing the dynamic environment and the presence of adversarial spammers.
Unlike traditional reviews, it delves into real-world issues and strategies used by spammers. The study’s empirical evaluation highlights the impact of dataset shift, revealing potential performance degradation [9]. The paper introduces the Naive Bayes classifier, a powerful probabilistic approach for classification tasks. It emphasizes its versatility across different domains and provides an implementation. By testing on a sample dataset, the study ensures the correctness of probabilistic computations [4]. The paper addresses the pressing issue of email spam, which poses risks such as phishing and fraud. By applying machine learning algorithms, it aims to identify fraudulent spam emails. The study evaluates various techniques and selects the best algorithm based on precision and accuracy. [5]The research addresses the pressing issue of spam emails by proposing an innovative approach utilizing email content exclusively to construct a keyword corpus, supplemented by text processing techniques to tackle obfuscation methods employed by spammers. The CSDMC2010 SPAM corpus dataset, which includes 4292 emails in the testing set and 4327 emails in the training set, produced encouraging results when the algorithm was tested. A high accuracy rate of 92.8 percent was attained. This research offers a meaningful contribution to the ongoing efforts in combating spam emails, showcasing its effectiveness in filtering potential spam content. [1]This research tackles the problem of SMS spam by utilizing various machine learning techniques such as logistic regression, Support Vector Machine (SVM), Naive Bayes algorithms and neural networks to effectively filter out unwanted text messages. By evaluating these method’s accuracy, the study concludes that neural networks outperform other techniques, serving as the most effective classifier model for distinguishing between ham and spam messages. This research contributes valuable insights into combating SMS spam and highlights the superiority of neural networks in this context.
III. DATASET
The dataset used for this study is named spam.csv, sourced from kaggle.com. The SMS Spam Collection comprises a set of SMS-tagged messages gathered for research on SMS spam. It includes 5,574 messages in English, categorized as either ham (legitimate) or spam. Among these, 4,516 messages are ham, and 653 messages are spam. This collection of SMS identified messages was created specifically for researching SMS spam detection. The dataset have two columns v1 and v2.v1 indicates the message and v2 indicates the ham/spam.v2 column has two values ham and spam.
???????
Conclusion
In this study, using the SMS Spam Collection dataset as our main emphasis, we created and put into use a Naive Bayes classifier for SMS spam identification. By using preprocessing methods and parameter analysis, we were able to differentiate spam from ham transmissions with a notable degree of accuracy. Our findings demonstrate how well the Naive Bayes algorithm performs in this situation. To enhance the model’s performance further, future efforts could involve exploring alternative machine learning strategies and refining feature selection methods.
References
[1] Amani Alzahrani and Danda B Rawat. Comparative study of machine learning algorithms for sms spam detection. In 2019 SoutheastCon, pages 1–6. IEEE, 2019.
[2] Thiago S Guzella and Walmir M Caminhas. A review of machine learning approaches to spam filtering. Expert Systems with Applications, 36(7):10206–10222, 2009.
[3] Francisco Ja´nez-Martino, Roc ˜ ´?o Alaiz-Rodr´?guez, V´?ctor Gonzalez- ´ Castro, Eduardo Fidalgo, and Enrique Alegre. A review of spam email detection: analysis of spammer strategies and the dataset shift problem. Artificial Intelligence Review, 56(2):1145–1173, 2023.
[4] Nikhil Kumar, Sanket Sonowal, et al. Email spam detection using machine learning algorithms. In 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), pages 108– 113. IEEE, 2020.
[5] Pingchuan Liu and Teng-Sheng Moh. Content based spam e-mail filtering. In 2016 International Conference on Collaboration Technologies and Systems (CTS), pages 218–224. IEEE, 2016.
[6] Vangelis Metsis, Ion Androutsopoulos, and Georgios Paliouras. Spam filtering with naive bayes-which naive bayes? In CEAS, volume 17, pages 28–69. Mountain View, CA, 2006.
[7] Priti Sharma and Uma Bhardwaj. Machine learning based spam e-mail detection. International Journal of Intelligent Engineering & Systems, 11(3), 2018.
[8] Thashina Sultana, KA Sapnaz, Fathima Sana, and Jamedar Najath. Email based spam detection. International Journal of Engineering Research & Technology (IJERT), 2020.
[9] Feng-Jen Yang. An implementation of naive bayes classifier. In 2018 International conference on computational science and computational intelligence (CSCI), pages 301–306. IEEE, 2018.
[10] Harry Zhang. The optimality of naive bayes. Aa, 1(2):3, 2004