Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Fathima Murshida K, Ruby Fathima A C
DOI Link: https://doi.org/10.22214/ijraset.2024.64747
Certificate: View Certificate
NLP is natural language processing or neuro linguis- tic programming.Natural languages like malayalam are highly inflectional and agglutinative in nature.This is problematic when dealing with nlp based malayalam applications.So that inorder to improve performance of malayalam nlp based applications, word embedding improvement on malayalam corpus is needed.The improvement is based on converting the words contained in the malayalam corpus into a standardised means removing all inflectional parts in the words in the existing malayalam corpus ie taking root words only.All that needed is a stemmer.In this project i have used a malayalam morphological analyser for taking root words of all words in the existing malayalam corpus.The advanatge of removing inflectional parts from all words is that we can reduce the sparsity in the existing malayalam corpus.Also there will be a high hike in frequency of words in the resulting corpus,then the space and time complexity of wordembedding representation of the existing corpus will decreases.According to zipfs law by increasing frequency of words performance of neural word embedding will increases. Zipfs Law is a discrete probability distribution that tells you the probability of encountering a word in a given corpus.By applying zipfs law am proposing there will be improvement on malyalam wordembedding.Here using fasttext, word embeddings are performed and capture dense word vector representation of the malayalam corpus with dimensionality reduction from the sparse word co-occurence matrix.The improvement is mainly used for wordnet,analogy,ontology based malayalam applications. Index Terms—Morphological Analyzer,Zipfs law, Preprocess- ing, Testing, Training
I. INTRODUCTION
NLP stands for Neuro-Linguistic Programming. Neuro refers to your neurology, Linguistic refers to language, pro- gramming refers to how that neural language functions. In other words, learning NLP is like learning the language of your own mind.It is the sub-field of AI that is focused on enabling computers to understand and process human languages.Computers can’t yet truly understand Natural lan- guages in the way that humans do — but they can already do a lot.We might be able to save a lot of time by applying NLP techniques to projects.
In this project am using Malayalam language.Malayalam is an Dravidian Indian language spoken in the Indian state of Kerala. It is one of the 22 scheduled languages of India and was designated a Classical Language in India in 2013. The earliest script used to write Malayalam was the Vatteluttu script, and later the Kolezhuttu, which derived from it. The oldest literary works in Malayalam, distinct from the Tamil tradition, are the Paattus, folk songs, dated from between the 9th and 11th centuries. Grantha script letters were adopted to write Sanskrit loanwords, which resulted in the modern Malay- alam script. Malayalam is a highly inflectional and agglu- tinative language.Algorithmic interpretation of Malayalam’s words and their formation rules continues to be an untackled problem. The word order is generally subject–object–verb, although other orders are often employed for reasons such as emphasis. Nouns are inflected for case and number, whilst verbs are conjugated for tense, mood and causativity (and also in archaic language for person, gender, number and polarity). Being the linguistic successor of the macaronic Manipravalam, Malayalam grammar is based on Sanskrit too.Because of its high complexity nature it is challenging to work on malayalam language.
Morphological analyzer and morphological generator are two essential and basic tools for building any language pro- cessing application. Morphological Analysis is the process of providing grammatical information of a word given its suffix. Morphological analyzer is a computer program which takes a word as input and produces its grammatical structure as output. A morphological analyzer will return its root/stem word along with its grammatical information depending upon its word category. For nouns it will provide gender, number, and case information and for verbs, it will be tense, aspects,and modularity. In my project need to remove inflections from each words in the corpus so that the frequency of words in the resulting corpus will increase. So the resulting corpus will contain only the root words. By increasing the frequency of words,complexity of word embedding will decrease.
Because sparsity is more in inflected words.Also with limitted re- souces will get better output provided that not considering or taking any morphosyntactic information.In my project am using only the conceptual similarity,that is conceptually similar words co occurene so that malayalam NLP based wordnet,ontology,analogy applications can improve their per- formance.
In Natural Language Processing we want to make computer programs that understand, generate and, more generally speak- ing, work with human languages. But there’s a challenge that jumps out: we, humans, communicate with words and sen- tences; meanwhile, computers only understand numbers.For this reason, we have to map those words (sometimes even the sentences) to vectors: just a bunch of numbers ,That’s called text vectorization.It is also termed as feature extrac- tion.Different ways to convert text into numbers are Sparse Vector Representations and Dense Vector Representations.
II. LITERATURE SURVEY
A. State of the art
The data for tuning the embeddings used by the authors is the PPDB 5 or the Paraphrase Database by Ganitkevitch, Van Durme, and Callison-Burch . Specifically,they used ver- sion XXL which contains 86 million paraphrase pairs.
An example of a short paraphrase is: “thrown into jail” which is semantically similar to “taken into custody”. Wieting et al. published their pretrained embeddings called Paragram-Phrase XXL 6 , which are in fact tuned GloVe embeddings, alongside their paper. These em- beddings also have a dimensionality of 300 and have a limited vocabulary of 50,000.In order to apply the embeddings, according to Wieting et al., they should be combined with Paragram-SL999 which are tuned on the SimLex dataset.
III. PROPOSED METHOD
The increasing accuracy of word embedding representation of malayalam languae has a great impact on ontology,analogy representation based NLP based applications. Improved Word Vectors on Malayalam language, which increases the accuracy of pre-trained malayalam wordembeddings in NLP Applica- tions . The main aim is to improve word embeddings that they do not need any labeled data.The improvement is by converting malayalam language into a standardised format by removing inflections from each word in malayalam corpus.
Any large volume of text can be used to get the word embeddings by feeding it to the model, without any kind of labeling.In this project am using fastText wordembedding. Word embeddings are derived by training a model on large text corpus.
A. System Architecture
The system architecture mainly consist of five modules. They are Corpus Creation ,Data Preprocessing, Training Using Neural Networks, Model Evaluation,Dataset Visualisation.
IV. RESULTS
Fig. 1. Graphical result of existing dataset after applying zipfs law
Fig. 2. Graphical result of new malayalam dataset after appying zipfs law
It can be visualise that similar malayalam words or syn- onyms occupy close to each other by using tensorflow em- bedding projector. The Eucledian distance and cosine simi- larity of similar words will be lesser than the fastText pre trained word vectors of malayalam language. It can be predict that by using zipfs law malayalam word embedding can be improved. Also sparsity problem in the existing malayalam corpus is decreased by using less resources that is conceptual similarity of words.Space and time complexity is reduced in the new malayalam corpus. The improved word vectors can be used to all NLP malayalam applications to improve their efficiency,accuracy,speed. In future for machine translation improvement on malayalam this model can be used as a base model and also learn morphosyntatic information on that model.
[1] Abdulaziz M. Alayba, Vasile Palade, Matthew Englandand Rahat Iqbal ” Improving Sentiment Analysis inArabic Using Word Representation”, IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR),2018 . [2] Shaosheng Cao Wei Lu, ”Improving Word Embeddings with Convolu- tional Feature Learning and Subword Information”, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing [3] YuvalPinter,RobertGuthrie,Jacob Eisenstein, ” Mimicking Word Embed- dings using Subword RNNs”, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. . [4] Vishwani Gupta,Sven Giesselbach,Stefan R¨uping,Christian Bauckhage, ”Improving Word Embeddings Using Kernel PCA”, Proceedings of the 4th Workshop on Representation Learning for NLP. [5] Procheta Sen,Debasis Ganguly , ”Word-Node2Vec: Improving Word Em- bedding with Document-Level Non-LocalWord Co-occurrences”, Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.. [6] Miguel Ballesteros, Chris Dyer, and Noah A. Smith, ”Improved transition-based parsing by modeling characters instead of words with LSTMs”, In Proc. EMNLP.2015. [7] Marco Baroni and Alessandro Lenci, ”Distributional memory: A general framework for corpus-based semantics”, Computational Linguistics, 36(4):673–721, 2010. [8] Jan A. Botha and Phil Blunsom, ” Compositional morphology for word representations and language modelling”, In Proc. ICML, 2014. [9] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan, ”Joint learning of character and word embeddings”, In Proc. IJCAI,2015. [10] Cicero Nogueira dos Santos and Bianca Zadrozny, ”Learning character- level representations for part-ofspeech tagging”, In Proc. ICML,2014. [11] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida, ”Detecting spammer on twitter”, in Proc. 7th Annu. Collaboration, Electron. Mes- saging, Anti-Abuse Spam Conf., Jul. 2012, p. 12 [12] J. Song, S. Lee, and J. Kim, ”Spam filtering in Twitter using sender receiver relationship”, in Proc. 14th Int. Conf. Recent Adv. Intrusion Detection, 2011, pp. 301317. [13] Chao Chen, Yu Wang, Jun Zhang, Yang Xiang, WanleiZhou, ”Statistical Features-Based Real-Time Detection of Drifted TwitterSpam, Twit- terSpam,”IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 4, APRIL 2017. [14] Sayali Kamble,S.M.Sangve, ””Real Time Detection of Drifted Twit- terSpam Based on Statistical Features”, Features”2018 International Conference on Information, Communication, Engineering and Technol- ogy (ICICET) . [15] Surendra Sedhai , Aixin Sun, ”Semi Supervised Spam Detection Twitter Stream”, ”IEEE Transactions On Computational Social Systems 2017. [16] Hajime Watabe, Mondher Boduazizi,And Tomoaki Ohtsuki, ””Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection”, ”IEEE Transactions On Computational Social Systems 2018.
Copyright © 2024 Fathima Murshida K, Ruby Fathima A C. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET64747
Publish Date : 2024-10-22
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here