Fake Review Detection Using Machine Learning

Authors: M. Brindha, A. Arulselvan Gnanamonickam

DOI Link: https://doi.org/10.22214/ijraset.2023.56364

Abstract

Customer reviews on ecommerce platforms and online services are invaluable for both users and vendors. They enhance brand loyalty and provide insight into product experiences. Reviews also empower vendors by boosting sales through positive feedback. Unfortunately, this system can be exploited, with fake reviews being used to manipulate reputations. Instead of a limited dataset, it employ a diverse range of vocabulary sourced from various subjects. Raw data is collected from multiple channels and refined, removing irrelevant, redundant, and unreliable information. This application implement sentiment analysis to categorize reviews, detecting and classifying fake ones. Testing involves Naive Bayes, Logistic Regression, Support Vector Machine, and Random Forest algorithms. The solution\'s core is machine learning, achieving highest accuracy with Random forest and sentiment analysis. Additionally, the application also extract post frequency and reviewer response times for further analysis.

Introduction

I. INTRODUCTION

Posting publicly and boldly has never been simpler thanks to social media and online posting. These viewpoints offer benefits and drawbacks. They can be a pro when they are used to influence others or to provide the appropriate input to the appropriate person who can help resolve the problem. These views are thought to be valuable. This makes it simple for those with bad intents to manipulate the system to appear sincere and post comments endorsing their own goods or disparaging those of competitors, all without disclosing their true identities or the names of the companies they work for. Opinion spamming is the phrase used to describe these individuals and their actions.

Sentiment analysis, another name for opinion mining, is the process of developing a system to gather and analyse comments, reviews, and tweets about the product left in social media posts, as well as online product and service evaluations. Opinion mining has several uses and applications for a wide range of purposes.

Individual consumers: Before making a choice, a buyer can also evaluate the summaries against those of rival products, ensuring they don't pass up any superior options.
Businesses/Sellers: Opinion mining assists vendors in connecting with their target market and learning how they see their offerings and those of their rivals. In the current era, encouraging customers to submit product reviews has shown to be an effective method for promoting a product through the voices of actual customers.

II. PROPOSED SYSTEM

Using Naive Bayes, SVM, Random forest, and logistic regression, the method aims to classify the reviews obtained from freely available datasets from various sources and categories with greater accuracy, including service-based, product-based, customer feedback, experience-based, and the crawled Amazon dataset. In addition to the review specifics, other features are employed to increase accuracy, such as a comparison of the sentiment of the review, confirmed purchases, ratings, frequency of reviews, and product category with the total score. The features are identified and used to build a classifier. Additionally, based on the classified training sets, those traits are given a weight or a probability factor.

Frequency of posts and time taken by reviewers are also extracted. This is a supervised learning technique applying different Machine learning algorithms to detect the fake or genuine reviews

The advantages are :

The highest accuracy is obtained by using random forest.
Different variety of dataset is used,
In addition to sentiment of review, user’s behavioral features like reviewer’s frequency of posts and the time taken for posting reviews are also taken into account,
Behavioral features analysis improves the performance of fake review detection process

III. SYSTEM DESIGN

A. System Flow Diagram

Basic symbols in flow charts usually include input, flow lines, process and output. The output from one process can begin another process as an input and multiple processes can be added to an entire system flow diagram.

Parallelogram : Parallelogram is used to represent input and output of the system.
Rectangle: Rectangle represents the process that needs to be carried out in the system flowchart.
Diamond: Diamond indicates the decision to be performed in the flowchart.
Oval: The oval shape signifies the start and end of the program in the system flowchart.
Flow Line: Flow line, a line with an arrowhead, is used to indicate the flow of data or logic in the system flowchart.

IV. SYSTEM IMPLEMENTATION

The system comprises of the following modules:

A. Data collection

Consumer review data collection- Raw review data was collected from different sources. A dataset of Reviews hotel.csv was created.

B. Data Preprocess

Processing and refining the data by removal of irrelevant and redundant information as well as noisy and unreliable data from the review dataset. The algorithm that is used is Naive bayes to improve accuracy.The entire review is given as input and it is tokenized into sentences using NLTK package.

C. Removal Of Punctuation Mark

Punctuation marks used at the starting and ending of the reviews are removed along with additional white spaces.

D. Word Tokenization

Each individual review is tokenized into words and stored in a list for easier retrieval.

E. Feature Extraction

The preprocessed data is converted into a set of features by applying certain parameters. The following features are extracted:

Normalized length of the review-Fake reviews tend to be of smaller length.
Reviewer ID- A reviewer posting multiple reviews with the same Reviewer ID.
Rating-Fake reviews in most scenarios have 5 out of 5 stars to entice the customer or have the lowest rating for the competitive products thus it plays an important role in fake detection.
Verified Purchase-Purchase reviews that are fake have lesser chance of it being verified purchase than genuine reviews.
Frequency of posts and time taken by reviewers are also extracted.
Thus these combination of features are selected for identifying the fake reviews.

This in turn improves the performance of the prediction models.

F. Sentiment Analysis

Sorting the reviews based on whether they are neutral, good, or negative in terms of emotion. It involves making predictions about whether reviews will be favourable or unfavourable based on the language used in the review, the review's rating, and other factors. The algorithms utilised in the module include Random Forest, SVM (Support Vector Machine), Naive Bayes, and logistic regression.

G. Fake Review Detection

The process of classification places objects in a collection into target classes or categories. Accurately predicting the target class for every case in the data is the aim of classification. Every piece of data in the review file has a weight assigned to it, and based on that weight, it is categorised into two classes: Genuine and Fake.

V. EXECUTION

The information will first be gathered and saved in a CSV file. Subsequently, the data will undergo preprocessing, which involves tokenizing the words and removing spaces and punctuation.The following step is feature extraction, where fraudulent reviews are found once sentiment analysis and the extraction of the review's length, id, frequency, and rating are completed.The user can choose the name of the hotel, its rating, and write a review based on their level of satisfaction using the graphical user interface (GUI). The application uses algorithms such as Naïve Bayes, random forest, logistic regression, and support vector machines to determine whether the review is real or fraudulent. The primary application of naive bayes is in feature extraction..

The primary purpose of the Random Forest method is to increase prediction accuracy. A graph-style visualisation is also utilised to compare the performance of all four algorithms.The primary application of naive bayes is in feature extraction.

Conclusion

The application is completely menu driven and extremely user friendly since it is developed in an efficient front end tool Python. Appropriate error messages are provided to guide the user in a proper and user friendly manner. Python is used in the front end development of the application. Because the programme will be very user-friendly and easy to have reports on, end customers will find it easier to utilise. The project is finished, and the tests went well. It lessens the need for calculations to be made. It is also possible to make additional improvements to the suggested system. Improvements can be made in response to future need. To sum up, a technique for identifying phoney reviews is essential for preserving the credibility of online review sites and guaranteeing that consumers are able to base their judgements on accurate evaluations. The random forest algorithm increases accuracy.The software is intended to detect and eliminate dishonest or fake reviews while maintaining the veracity of user-generated material. In a world where customer decisions are heavily influenced by online reviews, In order to foster confidence and openness in online platforms, it is imperative that efficient techniques for detecting fraudulent reviews be developed and implemented. These solutions promote the calibre of user experiences while acting as protectors against dishonest behaviour. To combat the changing tactics used by people looking to manipulate review sites, companies must, nevertheless, constantly develop and adapt.

References

[1] R. Barbado, O. Araque, and C. A. Iglesias, “A framework for fake review detection in online consumer electronics retailers,” Information Processing & Management, vol. 56, no. 4, pp. 1234 – 1244, 2019.www.ijacsa.thesai.org 605 | P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 1, 2021 [2] S. Tadelis, “The economics of reputation and feedback systems in e-commerce marketplaces,” IEEE Internet Computing, vol. 20, no. 1, pp,12–19, 2016. [3] M. J. H. Mughal, “Data mining: Web data mining techniques, tools and algorithms: An overview,” Information Retrieval, vol. 9, no. 6, 2018. [4] C. C. Aggarwal, “Opinion mining and sentiment analysis,” in Machine Learning for Text. Springer, 2018, pp. 413–434. [5] A. Mukherjee, V. Venkataraman, B. Liu, and N. Glance, “What yelp fake review filter might be doing?” in Seventh international AAAI conference on weblogs and social media, 2013. [6] N. Jindal and B. Liu, “Review spam detection,” in Proceedings of the 16th International Conference on World Wide Web, ser. WWW ’07,2007. [7] E. Elmurngi and A. Gherbi, Detecting Fake Reviews through Sentiment Analysis Using Machine Learning Techniques. IARIA/DATA ANA- LYTICS, 2017. [8] V. Singh, R. Piryani, A. Uddin, and P. Waila, “Sentiment analysis of movie reviews and blog posts,” in Advance Computing Conference (IACC), 2013, pp. 893–898. [9] A. Molla, Y. Biadgie, and K.-A. Sohn, “Detecting Negative Deceptive Opinion from Tweets.” in International Conference on Mobile and Wireless Technology. Singapore: Springer, 2017. [10] S. Shojaee et al., “Detecting deceptive reviews using lexical and syntactic features.” 2013. [11] Y. Ren and D. Ji, “Neural networks for deceptive opinion spam detection: An empirical study,” Information Sciences, vol. 385, pp. 213–224, 2017. [12] H. Li et al., “Spotting fake reviews via collective positive-unlabeled learning.” 2014. [13] N. Jindal and B. Liu, “Opinion spam and analysis,” in Proceedings of the 2008 International Conference on Web Search and Data Mining,ser. WSDM ’08, 2008, pp. 219–230. [14] D. Zhang, L. Zhou, J. L. Kehoe, and I. Y. Kilic, “What online reviewer behaviors really matter? effects of verbal and nonverbal behaviors on detection of fake online reviews,” Journal of Management Information Systems, vol. 33, no. 2, pp. 456–481, 2016. [15] E. D. Wahyuni and A. Djunaidy, “Fake review detection from a product review using modified method of iterative computation framework.”2016. [16] D. Michie, D. J. Spiegelhalter, C. Taylor et al., “Machine learning,”Neural and Statistical Classification, vol. 13, 1994. [17] T. O. Ayodele, “Types of machine learning algorithms,” in New ad-vances in machine learning. InTech, 2010. [18] F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002. [19] T. Joachims, “Text categorization with support vector machines: Learn-ing with many relevant features.” 1998. [20] T. R. Patil and S. S. Sherekar, “Performance analysis of naive bayes and j48 classification algorithm for data classification,” pp. 256–261, 2013. [21] M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048, 2007. [22] N. Suguna and K. Thanushkodi, “An improved k-nearest neighbor clas-sification using genetic algorithm,” International Journal of Computer Science Issues, vol. 7, no. 2, pp. 18–21, 2010. [23] M. A. Friedl and C. E. Brodley, “Decision tree classification of land cover from remotely sensed data,” Remote sensing of environment, vol. 61, no. 3, pp. 399–409, 1997. [24] A. Liaw, M. Wiener et al., “Classification and regression by random-forest,” R news, vol. 2, no. 3, pp. 18–22, 2002. [25] D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, and M. Klein, Logistic regression. Springer, 2002. [26] G. G. Chowdhury, “Natural language processing,” Annual review of information science and technology, vol. 37, no. 1, pp. 51–89, 2003. [27] J. J. Webster and C. Kit, “Tokenization as the initial phase in nlp,” in Proceedings of the 14th conference on Computational linguistics-Volume 4. Association for Computational Linguistics, 1992, pp. 1106–1110. [28] C. Silva and B. Ribeiro, “The importance of stop word removal on recall values in text categorization,” in Neural Networks, 2003. Proceedings of the International Joint Conference on, vol. 3. IEEE, 2003, pp. 1661–1666. [29] J. Plisson, N. Lavrac, D. Mladenic ? et al., “A rule based approach to word lemmatization,” 2004.

Copyright

Copyright © 2023 M. Brindha, A. Arulselvan Gnanamonickam. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET56364

Publish Date : 2023-10-29

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here