Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Sathwik Chettukindi, Uma Maheshwar Rao Dhinthakurthy , Chakrapani Seepathi, Dr. RRS. Ravi Kumar
DOI Link: https://doi.org/10.22214/ijraset.2023.50477
Certificate: View Certificate
The main objective of this project is to develop an application to generate the questions from a given passage or paragraphs. This project is an application of Natural Language Processing (NLP). The application generates different types of questions such as MCQ’s. The current project helps the teachers to prepare questions to the examinations, quizzes etc. It also helps the students to get summary from the text they provide. The summary would be generated from given text then, the application identifies key concept in summary. It identifies key words from sentences and generates MCQ’s. The other options other than correct option called as Distractors. The application also paraphrases the input text. It allows the extraction of summary from the text is done by T5 Transformer Technique. We generate distractors using WordNet. The key words can be extracted by using Key BERT Technique. The questions are displayed or GUI is developed with the help of the Gradio (A user friendly Web-Interface). The application majorly targets every organization. The application develops Frequently Asked Questions (FAQ’s) of the customers to the particular organizations, which in terms gives the customer more information of the organizations. This application overrides the traditional method of preparing questions, and the complexity involved in making questions.
I. INTRODUCTION
Question generation using Natural Language Processing (NLP) involves creating questions automatically from a given text or context. NLP techniques are used to analyse the given text and generate questions that are semantically and grammatically correct. The process of question generation typically involves several steps, including text pre-processing, semantic analysis, and question formulation. Text pre-processing involves cleaning and preparing the text for analysis, including removing stop words, stemming, and tokenization. Semantic analysis involves understanding the meaning of the text and identifying relevant concepts and entities. Question formulation involves using this information to generate questions that are relevant and grammatically correct. There are various applications of question generation using NLP, including educational systems, chatbots, and search engines. In educational systems, question generation can be used to create quizzes and tests for students, while in chatbots, it can be used to generate responses to user queries. In search engines, question generation can be used to provide users with more relevant search results by understanding their intent and generating relevant questions.
II. LITERATURE SURVEY
Objective and subjective question generation is an active research area in the field of natural language processing. A variety of approaches have been proposed to generate both types of questions using NLP techniques. Some approaches use rule-based methods, while others use machine learning algorithms to generate questions. A study by Li et al. (2020) proposed a neural network-based approach for generating multiple-choice questions. Their approach used a combination of convolutional and recurrent neural networks to encode the input text and generate candidate answers for each question [5]. They evaluated their approach on a dataset of history questions and achieved promising results. Similarly, Wang et al. (2019) proposed a method for generating subjective questions that can be answered with a short sentence [7]. Their approach used a sequence-to-sequence model with attention mechanisms to generate questions from a given text. They evaluated their approach on a dataset of reading comprehension questions and achieved competitive results compared to existing methods. Text-to-summary is another active research area in natural language processing that aims to generate concise summaries of longer texts. A variety of approaches have been proposed to generate summaries using NLP techniques, including extractive and abstractive methods. A study by Nallapati et al. (2017) proposed a sequence-to-sequence model with attention mechanisms for abstractive summarization. They evaluated their approach on the CNN/Daily Mail dataset and achieved state-of-the-art results [9]. Another study by Zhang et al. (2018) proposed an extractive summarization approach based on graph neural networks [7]. They evaluated their approach on a dataset of scientific articles and achieved competitive results compared to existing methods.
Speech-to-text is a well-established application of natural language processing that involves converting spoken language into written text. Several approaches have been proposed to perform speech-to-text using NLP techniques, including Hidden Markov Models (HMMs), Deep Neural Networks (DNNs), and Recurrent Neural Networks (RNNs). A study by Hinton et al. (2012) proposed a deep neural network-based approach for speech recognition that outperformed traditional HMM-based approaches [10]. They used a Deep Belief Network (DBN) to pretrain a Deep Neural Network (DNN) for acoustic modelling, achieving state-of-the-art performance on a dataset of spoken digits.
Similarly, Graves et al. (2014) proposed a recurrent neural network-based approach for speech recognition that used Connectionist Temporal Classification (CTC) to align the predicted text with the spoken input [11]. They evaluated their approach on a dataset of spoken sentences and achieved state-of-the-art results compared to existing methods Overall, these studies demonstrate the effectiveness of various NLP techniques for objective and subjective question generation, text-to-summary, and speech-to-text applications. As NLP technology continues to advance, we can expect even more accurate and efficient systems in these areas.
A. Feasibility study
III. PROPOSED SYSTEM
The proposed system is a Multilingual Natural Language Processing system. With the use of polyglot library, it supports various languages. It takes speech as an input, summarizes the text and gives speech as an output using T5 Transformer Technique. It generates different types of questions. The system uses Key Bert technique to identify key words and WordNet to generate distractors in the case of MCQ’s. The proposed system has an optional feature to paraphrase the input
The main objective of this project is to develop an application to generate the questions from a given passage or paragraphs. This project is an application of Natural Language Processing (NLP). The application generates different types of questions such as MCQ’s. The current project helps the teachers to prepare questions to the examinations, quizzes etc. It also helps the students to get summary from the text they provide.
IV. METHODOLOGY
A. Summarization
The summarizer function first preprocesses the text by stripping any whitespace characters and replacing newlines with spaces. It then adds the prefix "summarize:" to the text, which is required by some summarization models to indicate that the input text should be summarized.
The function then encodes the preprocessed text using the tokenizer, with a maximum length of 512 tokens. The encoding output includes the input_ids and attention_mask tensors, which are required inputs for the generate method of the model.
The generate method of the model is then called, with the encoded input tensors, to generate the summarized text. The method uses beam search with a beam width of 3, early stopping, and no repeat n-grams of size 2, to generate the summary. The minimum length of the summary is set to 75 and the maximum length is set to 300.
The generated summary is then decoded using the tokenizer to obtain the final summary text. The postprocess text function is called to capitalize the first letter of each sentence in the summary, and any leading or trailing whitespace is removed before returning the summary text.
The script also defines a set_seed function to set the random seed for reproducibility, and imports the necessary libraries including nltk for tokenization and wordnet for synonym and antonym lookups.
B. Subjective
filter_same_sense_words (original, wordlist): This function takes in an original word and a list of words and filters the list to only include words with the same sense as the original. It uses the Sense2Vec library to determine the sense of the original word and compares it to the sense of each word in the list. It returns a list of filtered words.
get_highest_similarity_score (wordlist, wrd): This function takes in a list of words and a single word and calculates the highest similarity score between the single word and any word in the list. It uses the NormalizedLevenshtein library to calculate the similarity score.
sense2vec_get_words (word, s2v, topn, question): This function takes in a word, a Sense2Vec model, a topn value (i.e. the number of similar words to return), and a question string. It uses the Sense2Vec model to find the best sense of the input word and then returns the topn most similar words that have the same sense as the input word. It filters the similar words to remove duplicates and words that are too similar to the original word or appear in the question string.
mmr(doc_embedding, word_embeddings, words, top_n, lambda_param): This function takes in a document embedding, a list of word embeddings, a list of words, a top_n value (i.e. the number of keywords to return), and a lambda parameter. It uses the cosine similarity between the document embedding and each word embedding, as well as the cosine similarity between each pair of word embeddings, to calculate an MMR (maximal marginal relevance) score for each word. It then selects the top_n words with the highest MMR scores and returns them.
get_distractors_wordnet(word): This function takes in a word and uses the WordNet library to generate a list of hypernyms and hyponyms for that word. It returns a list of possible distractor words.
get distractors (word, origsentence, sense2vecmodel, sentencemodel, top_n, lambdaval): This function takes in a word, an original sentence (from which the word was extracted), a Sense2Vec model, a Sentence Transformer model, a top_n value (i.e. the number of distractors to generate), and a lambda parameter. It uses the sense2vec_get_words () function to generate a list of similar words with the same sense as the input word, and then uses the mmr() function to select the most relevant distractor words from that list. It also uses the get_distractors_wordnet () function to generate additional distractor words from WordNet, and combines the two lists of distractor words to return a final list.
C. Objective
2. Sentence Transformer: Sentence Transformer is a Python library for generating dense vector representations of sentences and paragraphs using pre-trained transformer-based models. These models are trained on large-scale natural language processing tasks such as question answering, natural language inference, and machine translation, and have been shown to be highly effective in a variety of downstream tasks such as sentence similarity, sentiment analysis, and text classification.
3. Objective code explanation: This code defines a function get distractors () that takes a word, an original sentence, and some language models as input, and returns a list of distractors for that word.
The function first generates a list of candidate distractors using a sense2vec model, which is a word embedding model that incorporates information about the different senses of a word. The candidate distractors are generated by finding words that are most similar to the sense of the input word, and belong to the same sense category such as "NOUN", "PERSON","PRODUCT","LOC","ORG","EVENT","NORP","WORK OF ART","FAC","GPE","NUM","FACILITY". The function also filters out candidate distractors that are too similar to the original word or already present in the original sentence.
If no candidate distractors are generated by the sense2vec model, the function returns an empty list.
The function then uses a sentence transformers model to calculate embeddings for the original sentence and each candidate distractor. It applies the Maximal Marginal Relevance (MMR) algorithm to select the top n most diverse distractors from the candidate list. MMR is a heuristic that aims to select a diverse set of items from a larger pool of candidates, by maximizing the similarity of each selected item to a target concept, while minimizing its similarity to previously selected items.
The MMR algorithm is applied by calculating the cosine similarity between the embedding of the original sentence and each candidate distractor, as well as the cosine similarity between each candidate distractor and the embeddings of the previously selected distractors. The function selects the top n distractors with the highest MMR scores and returns them as the final distractor list.
In conclusion, question generation using NLP is a promising area of research that has seen significant advances in recent years. Various approaches have been proposed for generating questions from text, including rule-based systems, neural network models, and unsupervised methods. These approaches have been applied in various domains, including educational systems, chatbots, and search engines, to enhance their functionality and improve user experience. The studies reviewed in this literature survey demonstrate that NLP techniques can effectively capture key information from text and generate relevant and useful questions. However, there are still challenges to be addressed, such as the generation of questions that are diverse, non-redundant, and reflect a range of levels of complexity. Additionally, there is a need for further research to evaluate the performance of question generation systems across multiple domains and languages, and to explore their potential applications in new areas.
[1] Zhou, Q., Yang, N., Wei, F., & Tan, C. L. (2017). Neural question generation from text: A preliminary study. arXiv preprint arXiv:1704.01792. [2] Du, X., Shao, J., & Cardie, C. (2017). Learning to ask: Neural question generation for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). [3] Alsaedi, M., Cetinic, E., Liu, H., & Yang, H. (2019). Automatic question generation for literature review writing support. Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education (ITiCSE). [4] Gao, J., Li, W., Lin, Y., Zhang, M., Liu, Y., & Huang, Y. (2021). Question generation from knowledge graphs with neural machine translation. IEEE Transactions on Neural Networks and Learning Systems, 32(3), 1213-1223. [5] Zhang, L., & Lee, W. S. (2003). Question classification using support vector machines. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, 26(1), 26-32. [6] Zhang, X., Guo, D., Yu, Y., & Wang, X. (2021). A natural language processing approach for question generation. Journal of Ambient Intelligence and Humanized Computing, 12(5), 4555-4566. [7] See, A., Liu, P. J., & Manning, C. D. (2017). Get to the point: Summarization with pointer-generator networks. Proceedings of the 55th annual meeting of the association for computational linguistics, 1073-1083. [8] Nallapati, R., Zhi, F., & Zhou, B. (2016). Summarunner: A recurrent neural network-based sequence model for extractive summarization of documents. Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, 747-756. [9] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitley, N., ... & Kingsbury, B. (2012). Deep neural networks for acoustic modelling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82-97. [10] Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. 2013 IEEE international conference on acoustics, speech and signal processing, 6645-6649.
Copyright © 2023 Sathwik Chettukindi, Uma Maheshwar Rao Dhinthakurthy , Chakrapani Seepathi, Dr. RRS. Ravi Kumar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET50477
Publish Date : 2023-04-15
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here