Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Khushi Udaysingh Chouhan, Nikita Pradeep Kumar Jha, Roshni Sanjay Jha, Shaikh Insha Kamaluddin, Dr. Sujata Khedkar
DOI Link: https://doi.org/10.22214/ijraset.2023.50123
Certificate: View Certificate
Text preprocessing is the most essential and foremost step for any Machine Learning model. The raw data needs to be cleaned and pre-processed to get better performance. It is the method to clean the data and makes it ready to feed the data to the model. Text classification is the heart of many software systems that involve text documents processing. The purpose of text classification is to classify the text documents automatically into two or many defined categories. In this paper ,various preprocessing and classification approaches are used such as NLP, Machine Learning, etc from patent documents.
I. INTRODUCTION
Patent is a part of intellectual property. Effective patent analysis may bring lots of benefits for the enterprise. One of the main patent mining tasks is patent classification. Text mining and classification has always been a crucial application and topic for research since the foundation of digital documents. Around 80% of all information available is unstructured. Today, with increased digitization we have to deal with numerous text documents daily and thus the need for text preprocessing and classification has become a necessity. Text data contains noise in innumerable forms like emotions, punctuations, text in different case. Text preprocessing is a method to clean the text data so that it can be used in the model to get better performance. Text classification is one of the fundamental tasks in natural language processing (NLP) with broad applications. It is a machine learning technique that assigns a set of predefined categories to the text. The classifiers are used to organize, categorize and structure any kind of text document. Text classification can be done in two ways: manual or automatic.
Manual classification includes a human analyst who analyzes the content of the text and manually categorizes it. This may give better accuracy but is time consuming and strenuous.
Automatic text classification applies machine learning, natural language processing and other AI-guided techniques to classify the text in a faster and in more accurate way. For this project various preprocessing techniques have been implemented such as removing punctuations, lower casing the text, removing stop words and removing whitespaces. After the text is cleaned the text is tokenized and converted into vector form and then implemented with different classification algorithms. Algorithms implemented include OVR (OnevsRest), Random Forest classification, n-gram model, ensemble model and LSTM (Long-Short Term Memory). Finally the analyzed data is visualized to understand the results and observations better.
II. DATASET AND RELATED WORK
For the proposed system Contract Understanding Atticus Dataset (CUAD) v1 dataset has been used. CUAD falls under the legal domain. It was created with dozens of legal experts from The Atticus Project and consists of over 13,000 annotations. The files in CUAD v1 include 1 CSV file, 1 SQuAD-style JSON file, 28 Excel files, 510 PDF files, and 510 TXT files.
Existing projects using CUAD v1 dataset have implemented contract summarization algorithms. And one of the existing projects has experimented with four pre-trained variants of BERT for the purpose of fine-tuning their models.
III. METHOD
For better performance and accuracy the proposed system has used various classification algorithms. All the algorithms that have been used are explained below:
A. Random Forest Classifier
It is a supervised machine learning algorithm which is used for classification. This algorithm constructs a decision tree and the most voted prediction results will be considered as final prediction. Random Forest classifiers are suitable for dealing with the high dimensional noisy data in text classification.
RF is an ensemble of decision tree algorithms. Ensemble simply means combining multiple modes. Ensemble uses two types of methods i.e Bagging and Boosting. Bagging,also known as Bootstrap Aggregation, is an ensemble technique used by random forests. It creates a different training subset from training data with replacement and the final output is based on majority voting.Many hyperparameters are used either to enhance the performance or to make the model faster. Some important hyperparameters are:
B. One-vs-Rest
This is the most commonly used strategy for multiclass classification. It is also referred to as One-vs-All Classifier or OvA. This involves splitting the multi-class dataset into multiple binary classification problem. Then a binary classifier is used to train on each binary classification problem. Then selecting the most confident and better model predictions are made.
The scikit library or sklearn library is the most useful and robust library for machine learning in python. This library provides a separate OnevsRestClassifier class that allows the one-vs-rest strategy to be used with any classifier. It is very easy to use and only requires a classifier, that is to be used for binary classification, that is passed as an argument to the OneVsRestClassifier.
After implementing and experimenting with various binary classifiers as an argument to OvR classifier, the proposed system has got the highest accuracy for the model using Logistic Regression class.
C. N-gram Model
N-grams are continuous sequences of words or symbols or tokens in a document. In NLP n-gram is a continuous sequence of n items generated from the given dataset or sample of text where n can be any numbers like 1,2,3, etc. .
D. Ensemble model
Ensemble learning is a powerful machine learning algorithm that is used across industries by data science experts. The beauty of ensemble learning learning techniques is that they combine the predictions of multiple machine learning models. There are several ensemble learning techniques.Some of them are bagging, boosting, stacking, blending,etc. The proposed system uses stacking and blending and the results can be compared below.
2. Blending: Blending follows the same approach as stacking but uses only a holdout(validation) set from the train set to make predictions. In other words, unlike stacking,the predictions are made on the holdout set only.Below is the explanation of the blending process:
Among these two techniques blending gives better accuracy than stacking for patent classification. Blending gives an accuracy of 99.40% whereas Stacking gives accuracy of 91.79%.
E. LSTM (Long-Short Term Memory)
LSTM is a deep learning model .They are the type of Recurrent Neural Network which have a good capability of memorizing patterns. It is widely used for Sequential data.Components of LSTM include three gates : Forget Gate, Input and Output gate
In modeling, It uses a sequential model where it adds several layers using LSTM which consists of an input layer , hidden layer and output layer. Combination of LSTM and NLP gives accurate results for Text Classification as it can eliminate unused information and it can handle long term dependency which can reduce the cost.It gives the best result among all the implemented models (Random Forest, Ensemble and N-gram model) with highest accuracy.
IV. FLOW OF THE PROJECT
A. Preprocessing
Typically, most data for text classification are collected from the web, through newsgroups, bulletins boards, or broadcasts. They are multi-source, and thus have different formats, different preferred vocabularies and also significantly different writing styles even for documents in the same genre. Thus there is a need for text preprocessing to get rid of the noises in the text corpus and get better accuracy for the model.
Some crucial preprocessing steps include :
Stop words are a set of commonly used words (in this case) legal documents in English. Examples of stop words in English are “a”, “the”, “are”, “is”, “and”, etc. There is a need to eliminate stop words from the text corpus so that we can get the set of words that have more significance since the stop words carry less information.Words like corpus and Corpus mean the same but when not converted to lowercase those two are represented as two different words in the vector space model. Tokens are building blocks of Natural Language. Tokenization is a way of separating a piece of text into smaller units called tokens.
B. Vectorization
Vectorization is a classic approach of converting input data into vectors of real numbers which is the format that ML models support. After preprocessing the clean data is vectorized to a readable format for classification models. There are plenty of ways to perform vectorization.
Some of the ways that the proposed system uses are:
C. Classification
Classification is a supervised learning approach in which it categorizes the data and classifies it accordingly.It can be used to classify both structured and unstructured data.. Below is the table describing the classifiers that have been used for Text Classification in the proposed system.
Table 1. Performance comparison between 5 algorithms on CUAD v1 dataset
Classifier |
Sub-classifiers / class used |
Accuracy obtained |
Advantages |
Disadvantages |
Random Forest |
|
88.23% |
It can handle large datasets with higher dimensionality. It is comparatively less impacted by noise. |
As it uses tree data structure, it creates a lot of trees and thus uses more computational power and resources. It takes longer training time as compared to decision tree classifier. |
N-grams Unigram Bigram Trigram Unigram+Bigram Bigram Trigram
Unigram Bigram Trigram |
|
84.967% 94.11% 66.66% 92.15% 89.54% 96.07%
|
It encodes not only keywords but also word ordering automatically. Model is not biased by hand coded lists and is completely dependent on real data. Learning features is relatively fast and easy.
|
It requires a considerable amount of training text to determine the parameters of the model. It is unsettling that it can only interpret unseen instances with respect to learned training data. |
OvR |
LogisticRegression |
98.0% |
This method is suitable for multi-class classification. |
A major disadvantage of this method is that many models have to be created. |
Ensemble Blending
Ensemble Stacking |
LogisticRegression, KNeighborsClassifier, DecisionTreeClassifier, GaussianNB and OneVsRestClassifier(DecisionTreeClassifier() DecisionTreeClassifier and OneVsRestClassifier(SVC()) |
99.40%
91.79% |
Ensemble models have higher predictive accuracy compared to individual models. Different models can be combined to handle different types of data. Ensemble models are always less noisy and are more stable. |
Ensembling is hard to learn and any wrong selection of models may result in lesser accuracy than the individual models. Ensembling stacking method takes a longer time to execute. Ensembling models are expensive both in terms of space and time. |
LSTM |
|
99.95% |
It can handle long term dependency and it can eliminate unused information which is not required for prediction. |
It is easy to overfit and it requires more memory to train |
D. Visualization
Data visualization is representation of data through use of graphics such as bar charts, line charts or plots.The visual display of information communicates complex data relationships in understandable form. It is an essential part of data analysis as it gives deeper insights into patterns depicted by visualization. Some of the tools that the proposed has used for Data visualization are
The text can also be visualized by finding top words or displaying scatter plots.
Therefore, in this paper a high performance patent document analyzer and classifier is proposed and the results can be compared for various classifiers. The proposed system has implemented five methods for better accuracy and performance i.e. Random Forest Classifier, OvR model, N-gram model, Ensemble algorithm and LSTM(deep learning model). Thus a conclusion can be drawn that LSTM(Long-Short Term Memory) gives the best accuracy among all that is 99.95%. Also in future it is planned to explore many other techniques and also to try large datasets to improve the performance of our system.
[1] Jie Hu, Shaobo Li ,Yong Yao, Liya Yu, Guanci Yang and Jianjun Hu, “Patent Keyword Extraction Algorithm Based on Distributed Representation for Patent Classification”. [2] Gaurika Tyag, “Ensemble Models for classification”. [3] Aishwarya Singh, “A comprehensive guide to Ensemble Learning(with Python codes)”. [4] Erin Yijie Zhang, “Legal Applications of Neural Word Embeddings”. [5] Kavitha Jayaram,Sangeeta K, “A Review: Information Extraction Techniques from research paper ”. [6] Susan Li, “Multi-Class Text Classification with LSTM”. [7] Nithyashree V, “What Are N-grams and How to implement them in python”. [8] Tom Lin, “Performance of Different Word Embeddings on Text Classification”. [9] Erin Yijie Zhang, “Legal Applications of Neural Word Embeddings”. [10] Charu Makhijani, “Advanced Ensemble Learning Techniques”. [11] Andreas Kanavos, Gerasimos Vonitsanos, Alaa Mohasseb, Phivos Mylonas, “An Entropy-based Evaluation for Sentiment Analysis of Stock Market Prices using Twitter Data”. [12] Maria Kalimeri, Vassilios Constantoudis, Constantinos Papadimitriou, Kostantinos Karamanos, Fotis K. Diakonos and Haris Papageorgiou3 , “Entropy analysis of word-length series of natural language texts: Effects of text language and genre”
Copyright © 2023 Khushi Udaysingh Chouhan, Nikita Pradeep Kumar Jha, Roshni Sanjay Jha, Shaikh Insha Kamaluddin, Dr. Sujata Khedkar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET50123
Publish Date : 2023-04-05
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here