Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Neha Reddy, Neha Ram , Pallavi L, Dr. K Sree Rama Murthy, Dr. K. Kranthi Kumar
DOI Link: https://doi.org/10.22214/ijraset.2022.44500
Certificate: View Certificate
Conversational toxicity is a problem that might drive people to cease truly expressing themselves and seeking out other people\'s opinions out of fear of being attacked or harassed. The purpose of this research is to employ natural language processing (NLP) techniques to detect toxicity in writing, which might be used to alert people before transmitting potentially toxic informational messages. Natural language processing (NLP) is a part of machine learning that enables computers to comprehend natural language. Understanding, analysing, manipulating, and maybe producing human data language with the help of a machine are all possibilities. Natural Language Processing (NLP) is a type of artificial intelligence that allows machines to understand and interpret human language instead of simply reading it. Machines can understand written or spoken text and execute tasks such as speech recognition, sentiment analysis, text classification, and automatic text summarization using natural language processing (NLP).
I. INTRODUCTION
In the early days of the Internet, people communicated only by e-mail that encountered spam. At the time, classifying emails as good or bad, whether or not spam was a challenge. Internet communications and data flows have changed significantly over time, especially since the development of social networks. With the advent of social media, classifying content as “good” or “bad” has become more important than ever to prevent social harm and prevent people from engaging in antisocial behavior.
A. Purpose
Every day, vast amounts of data are released by social media networks. This massive volume of data is having a big impact on the quality of human existence However, because there is so much toxicity on the Internet, it can be harmful. Because toxic comments limit people's ability to express themselves and have different opinions, as a result of the negative, there are no positive conversations on social networks. As a result, detecting and restricting antisocial behaviour in online discussion forums is a pressing requirement. These toxic comments might be offensive, menacing, or disgusting. Our goal is to identify these harmful remarks.
B. Project Overview
To classify noxious comments, this study will use different methods of machine learning. To handle the problem of text categorization, Some of the methods we use are logistic entertainment, random forest, SVM classifier, multi-navigation database, and XGBoost classifier. As a result, we maintain a data set using six machine learning algorithms, evaluate their accuracy, and compare them.. We choose the model with the highest accuracy and forecast the toxicity based on unseen data based on its accuracy level.
II. LITERATURE SURVEY
A. Existing Systems
2. LSTM (Long Short Term Memory): Long-term memory is abbreviated as LSTM. According to memory, LSDM is a kind of continuous neural network that works better than conventional continuous neural networks. LSTM is certainly better when it comes to memorizing specific patterns. Like other NNs, LSTM can have multiple hidden layers, storing the relevant data in each cell as it passes through each level, and rejecting inappropriate data. LSTM has a memory function that allows you to store data sequences. It also has another function: it works to delete unwanted data, and since we all know that text data has a lot of unnecessary data, LSTM can delete it to save time and money on calculations. As a result, LSTM's ability to memorize data sequences while deleting additional data makes it an effective text classification tool.
B. Proposed System
We will use five machine learning algorithms to classify the data the online harmful comments: logistic regression, random forest, SVM classifier, Naive Bayes, and XGBoost classifier. We'll use logistic regression to determine whether or not a comment is harmful because it either belongs to the poisonous group or does not. SVM classifiers can be used to separate data values, and can also be used in XGBoost and Random Forest systems. Because we can classify ideas into a wide range of destructive and non-toxic, we will use the concept of results in both directions. Because our data are independent of each other and have nothing in common with two different concepts, we use the Nave Bayes classifier to classify them. Because the data is marked, we can immediately use a controlled machine learning algorithm.
2. Random Forest: Random Forest is a controlled machine learning approach to classification and regression problems. Uses an explicit majority for classification and a moderate delay to create a decision tree of different data. One of the main features of the random forest algorithm is the ability to manage a set of data with classified and continuous variables, such as regression and classification. As for the rating, it performs better than its competitors.
STEPS
Step 1: Random Forest takes random entries from the Q-record database..
Step 2: A separate end tree is created for each model.
Step 3: Each end tree will have an output.
Step 4: The final decision on classification and regression is based on an appropriate or intermediate majority.
3. SVM Classifier: The reference vector machine (SVM) is a machine learning technique with classification and regression monitoring. SVM defines a cloud page that classifies the boundaries between two data sets. SVM is often used to classify data, although it can also be used for regression. This is a fast and reliable method that works well with low data. Another advantage of SVM is that it can explore various input functions without increasing system problems, using different types of core functions.
The Bayes theorem is the foundation of the Naive Bayes algorithm. It's primarily utilized for categorization tasks. In classification, we teach the model what each class belongs to using a labeled dataset, and then the model learns and classifies or labels a new dataset that has never been seen before. All of the features or variables in the Naive Bayes model are treated as independent of one another.
4. XGBoost Classifier: Gradual gain, such as Random Forest (another end-of-tree algorithm), is a controlled machine learning approach for topics such as classification (men, women) and regression (expected value). The most common names to implement this method are Gradient Boosting Machines (GBM) and XGBoost. Gradient Boosting is a training group similar to Random Forest. This means that it links the final model to a set of individual models. These individual models have a poor prognosis and are very consistent, but combining several of these weak models into one group will give the best results. Random forest-related end trees are the most popular weak sample used in gradient reinforcement machines.
III. PROJECT REQUIREMENTS
A. Python
Python is an advanced interactive and semantic programming language. It is written dynamically and has a ridiculously simple syntax. It is written in plain English and has no syntax like Java or C. Seriously, it follows a stroke. Using language keywords, such as when, how, in, etc., makes the code much clearer and simpler. It is a more semantic language because it is a structured language with all the complete structures necessary for a specialized synthetic language. Python is an interactive programming language. Python does not have a compiler, so the compiler executes the code directly at runtime. This is similar to how JS and PHP work.
B. Numpy
NumPy is a Python module that uses arrays to perform a wide range of functions. To ensure faster execution, it can be implemented using the C and C ++ libraries. It works faster with the NumPy module than similar performance without it. This ability to fill signs with pure math is still a correction in February. It will not be added automatically when Python is inserted. Must be installed at the same time as the NumPy pip installation command. The Python number is indicated by NumPy.
C. Pandas
Pandas are a free, open source architecture that facilitates and directly manages linked and encrypted data. It includes several data structures and methods for working with numerical data and time series data. This library is based on the NumPy Python library. Pandas are a fast-paced program, and its users benefit from its speed and performance.
D. Matplotlib
For 2D series diagrams, Matplotlib is a great Python imaging library. Matplotlib is a cross-platform NumPy-based data imaging software compatible with the entire SciPy stack. First launched in 2002 by John Hunter.
One of the most important advantages of imaging is that it allows you to see large amounts of data in easy-to-understand images. Line, strip, skater, histogram and many other maps are available on Matplotlib.
E. Seaborn
Seaborn is a great Python imager for programming statistical graphics. It has attractive default styles and a color palette that make statistical maps very attractive. It uses matplotlib software and is closely related to the panda data structure.Seaporn's mission is to develop data visualization as an important part of data recognition and understanding. It provides a database-based API that allows you to switch between multiple visual representations of a variable for a better understanding of the data.
F. Scikit-Learn
Scikit-Learn is a free Python machine learning library. Supports both controlled and uncontrolled machine learning and includes a number of algorithms for classification, regression, grouping, and size reduction. This library was created using many libraries you know, such as NumPy and SciPy. It can also be used with other libraries, such as Panda and Seaborn.
IV. SYSTEM DESIGN
A. Architecture
We upload our dataset first, then perform preprocessing such as stemming and removing stop words, before applying our machine learning models and calculating accuracy for each model. Finally, the most accurate model is chosen.
The best model is then used to estimate toxicity based on unknown data.
V. OUTPUTS
VI. FUTURE SCOPE
Other machine learning models can be used to calculate accuracy, interference loss, and loss recording to improve performance in future research. We consider RNN, BERT, multilayer perceptron and GRU, all of which are deep learning methods. As a result, we can explore a variety of alternative strategies to help improve the end result.
We tested the accuracy of five machine learning methods: logistic rest, ship bay, random forest, SVM classifier, and XGboost classifier. After careful consideration, we can conclude that the logistic delay has the best performance in terms of accuracy, as it is the most accurate model among other models. As a result, we choose our final model based on its precise nature. If there is a logistic regression model, we get the maximum. We will use the logistic delay model as our newest machine learning strategy, as it works best with our data.
[1] Rahul, H. Kajla, J. Hooda and G. Saini, \"Classification of Online Toxic Comments Using Machine Learning Algorithms,\" 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), 2020, pp. 1119-1123, doi: 10.1109/ICICCS48265.2020.9120939 [2] M. Duggan, “Online harassment 2017,” Pew Res., pp. 1–85, 2017, doi: 202.419.4372. [3] M. A. Walker, P. Anand, J. E. F. Tree, R. Abbott, and J. King, “A corpus for research on deliberation and debate,” Proc. 8th Int . Conf. Lang. Resour. Eval. Lr. 2012, pp. 812–817, 2012. [4] J. Cheng, C. Danescu-Niculescu-Mizil, and J. Leskovec, “Antisocial behavior in online discussion communities,” Proc. 9th Int. Conf. Web Soc. Media, ICWSM 2015, pp. 61–70, 2015. [5] B. Mathew et al., “Thou shalt not hate: Countering online hate speech,” Proc. 13th Int. Conf. Web Soc. Media, ICWSM 2019, no. August, pp. 369–380, 2019. [6] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang, “Abusive language detection in online user content,” 25th Int. World Wide Web Conf. WWW 2016, pp. 145–153, 2016, doi: 10.1145/2872427.2883062. [7] E. K. Ikonomakis, S. Kotsiantis, and V. Tampakas, “Text Classification Using Machine Learning Techniques,” no. August, 2005. [8] M. R. Murty, J. V. . Murthy, and P. Reddy P.V.G.D, “ Text Document Classification basedon Least Square Support Vector Machines with Singular Value Decomposition,” Int. J. Comput. Appl., vol. 27, no. 7, pp. 21–26, 2011, doi: 10.5120/3312-4540. [9] E. Wulczyn, N. Thain, and L. Dixon, “Ex machina: Personal attacks seen at scale,” 26th Int. World Wide Web Conf. WWW 2017, pp. 1391– 1399, 2017, doi: 10.1145/3038912.3052591. [10] H. Hosseini, S. Kannan, B. Zhang, and R. Poovendran, “Deceiving Google’s Perspective API Built for Detecting Toxic Comments,” 2017, [Online]. Available: http://arxiv.org/abs/1702.08138.
Copyright © 2022 Neha Reddy, Neha Ram , Pallavi L, Dr. K Sree Rama Murthy, Dr. K. Kranthi Kumar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET44500
Publish Date : 2022-06-18
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here