Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Hrithick Gokul Y, Dinesh N, Hariharan R
DOI Link: https://doi.org/10.22214/ijraset.2022.46830
Certificate: View Certificate
Machine translation has undergone a revolution in recent years. Older translation systems developed by linguists utilizing decades of statistical study have been superseded by new systems developed using deep learning. The core idea of a translation system will be a neural network that takes in input and generates the translation of a given sentence. This paper discusses the model which uses intermediate numerical encoding to learn the mapping from one human language to another. Two neural networks connected end-to-end provide the basis of a neural translation system. The first neural network encodes sequences of words in Spanish into an array of numbers with the same meaning. The second neural network learns to translate the numbers back into a string of words that all have the same meaning in English.
I. INTRODUCTION
Language barriers can still make it difficult to get information in today's globalized world. In certain cases, it is difficult to meet the demand for translation by using solely human translators; as a result, technologies like Machine Translation (MT) are becoming more and more popular because of their capacity to solve this issue. The expanding need for solutions that enable various audiences to access content in several languages is one effect of this scenario. As a result, machine translation (MT) is highlighted as one of the tools with the greatest potential to address this issue and has received attention from both the standpoint of academic and professional contexts. Much like everything else in machine learning, machine translation is fast developing. The tools are becoming simpler to use, GPUs are becoming more potent, and there is more training data than ever before. With today's technology and software, we may construct a language translation system that is effective enough to be used in actual projects. As it turns out, deep learning has completely changed how we think about machine translation. The greatest expert-built language translation systems in the world are being defeated by simple machine learning techniques developed by deep learning researchers. This innovation's underlying technology is known as sequence-to-sequence learning. Techniques for sequence-to-sequence language translation are advancing at such a rapid rate that it is challenging to stay up. Results have greatly improved as a consequence of several recent changes (such as the addition of an attention mechanism or context tracking). We must stay up to date with new advancements as they happen if we intend to use sequence-to-sequence learning in a significant way. This study can provide useful guidance to teachers at different educational levels or areas, as well as to educators, practitioners, and researchers in the areas of personalized learning. The paper is organized as follows. Section 2 describes the method used for carrying out the systematic review of articles, Section 3 presents the related work, Section 4 describes the proposed model, Section 5 describes the whole pipeline of the project, and Section 6 concludes the paper.
II. LITERATURE SURVEY
A. Search Strategy
The following international online bibliographic databases were searched in September 2021: (a) IEEE Xplore, (b) SpringerLink, (c) ACM portal, and (d) Science direct. The search phrases used were "machine translation" AND "translation system". Searches were restricted to book chapters and the titles of articles appearing in English-language journals and conference proceedings as of the year 2020 and onwards.
B. Inclusion Criteria
The articles are chosen based on the following inclusion criteria:
C. Exclusion Criteria
Further filtering took place by excluding the articles based on the following exclusion criteria:
III. RELATED WORK
A. Rule-based Translation
The easiest method is to simply substitute each word in a phrase with its equivalent in the target language. This is simple to put into effect since all we need is a dictionary to check each word’s translation. However, the outcomes are poor since grammar and context are disregarded.
To enhance the outcomes, language-specific rules are added as the following step. For instance, translate typical two-word sentences into a single group. Additionally, as nouns and adjectives typically appear in the opposite order in Spanish from how they do in English, the order of both is switched. Our algorithm ought to be able to translate every phrase if we simply keep adding rules until we can handle every aspect of grammar.
The early machine translation systems operated in this manner. Complex rules were developed by linguists and were coded one at a time. During the Cold War, several of the world's top linguists worked for years to develop translation systems that would make it easier to understand Russian communications. The issue is that there are no clear norms for human language. Special situations, regional variants, and outright rule-breaking abound in human languages. Who invaded whom hundreds of years ago had a greater impact on how we speak English than anybody sitting down and setting grammatical rules. Unfortunately, this only worked for straightforward papers with a basic format, like weather reports. For papers used in the real world, it was unreliable.
B. Statistical Machine Translation
If we provide statistical machine translation systems with adequate training data, they perform significantly better than rule-based systems. These concepts were enhanced by Franz Josef Och, who utilized them to create Google Translate in the early 2000s. The world had now attained access to Machine Translation. Systems that use statistical translation don't attempt to produce an accurate translation, which is their primary distinction. Instead, they create tens of thousands of potential translations and then rank them according to how likely each is to be accurate. By comparing anything to the training data, they may determine how "accurate" it is. Here’s how it works:
Systems for statistical machine translation function effectively, but they are challenging to create and maintain. Experts must adjust and fine-tune a new multi-step translation pipeline for each new language pair we wish to translate. Trade-offs must be made since constructing these many pipelines requires so much labor. There aren't enough Georgian-to-Telugu translations occurring to warrant making a significant investment in that language combination, thus if we ask Google to translate Georgian to Telugu, it will internally convert it into English as a first step. And if we requested it to translate anything from French to English, which is the more popular option, it may use a more sophisticated pipeline.
IV. PROPOSED MODEL
To translate text, the model employs two recurrent neural networks. The first one will be trained to encode statements in Spanish, while the second one will be trained to decode them into English. A recurrent neural network, also known as an RNN, is a slightly modified form of a neural network in which the prior state of the network is used as an input to the subsequent computation. This implies that the outcomes of current computations affect those of prior calculations. With the use of this method, neural networks may discover patterns in a stream of data. We can use it, for instance, to determine the following word in a sentence based on the first few words. RNNs are being employed more and more in many fields of natural language processing since human language is just one huge, complex pattern.
Firstly we must divide the text into sentences. If we try to feed in a full paragraph at once, our neural network will only be able to interpret one sentence at a time and will perform poorly. Though splitting text into sentences may seem simple, there are many various ways that formatting and punctuation may be nested, making it a challenging undertaking. To minimize the number of third-party libraries we need to install for this project, we'll utilize a straightforward sentence splitter built in Python. If this doesn't meet our needs, we may utilize the complex sentence-splitting models included in NLP libraries (like spaCy).
After that, text normalization will be done. The hardest aspect to do correctly, skipping here will lead to poor results in the end. Neural networks are unable to extrapolate beyond what they have observed in their training data. The neural network won't immediately understand that the word "cOMPuTERs" means the same thing as "computers" if it has never seen it before. The text has to be normalized to eliminate as much formatting variance as feasible. We'll ensure that words are capitalized consistently throughout the same sentence, repair any odd punctuation formatting, remove any strange curly quotation marks that MS Word arbitrarily placed, and more. The goal is to ensure that, regardless of how the user types a sentence, it will always be fed in the same manner.
A. Data Collection
To train a translation model millions of pairs of identical sentences that have been translated into two different languages are required. This is called parallel corpora. Our algorithm will learn to translate various types of text better if we have more phrase pairings. We'll require at least tens of millions of training phrases to build an industrial-strength model. Most essential, these phrases should include all forms of human communication, including slang and humor as well as professional documentation.
OPUS (Open Parallel Corpus) has already done the work of collecting sentence pairs in numerous languages. These data sources will be automatically downloaded from OPUS by the training script that we'll execute later, giving us close to 85 million translated sentence pairs. OPUS offers files according to language pairs. Therefore, by obtaining the data for that language pair instead of Spanish-to-English, we may develop, for example, a Finnish-to-Italian translation model.
B. Data Pre-processing
Two text files are downloaded with every data source from OPUS. One text file has a list of sentences in English, while the other contains a list of sentences in Spanish, both in the same sequence. The result will be a large number of text files, so we need to go through several steps to prepare the training data.
C. Translation Model
The model's general structure is as follows, The encoder at each time step receives a single element from the input sequence, processes it, gathers data for that element, and propagates it forward. The intermediate vector is the model's final internal state as created by the encoder. It includes details about the whole input sequence to aid the decoder in making precise predictions. Given the entire sentence, the decoder predicts an output at each time step.
LSTM/GRU cells essentially make up the encoder. When an encoder receives an input sequence, it stores the data as internal state vectors. Only internal states are used, and the encoder's outputs are discarded. If the input sequence is m characters long, it will take m time steps for LSTM to read the full sequence because it only reads one element at a time. The input at time step m is Xm. At time step m of the LSTM, the internal states hm and cm are present. There is just one internal state hm for the GRU. The result at time step m is Ym.
Using word embedding, which turns each word into a vector of fixed length, each Xm (each word) is represented here as a vector. Getting to the interior states now (hm, cm). The LSTM's readings up to time step m are learned. Ho and Co are initialized either randomly or with zeroes (both are vectors). The number of units in the LSTM cell is the same as the dimension of ho and co. The core of the whole input sequence is included in the final state, H4, C4. The predictions made by the LSTM at each time step are the output Ym. In situations involving machine translation, we provide the results after reading the complete input sequence. As a result, we reject Ym at each time step in the encoder since it is useless.
2. Decoder Part During Training
Unlike the encoder component of the model, which operates in the same way during training and testing, the decoder has a separate operating process. We will add START_ at the beginning of the output sequence and _END after the output sequence for the decoder to distinguish the starting and end of the sequence.
The decoder's beginning states (ho, co) are set to the encoder's end values. It may be said that the encoder gathers the data, and the decoder is taught to produce the output based on that data. For the decoder to begin producing the following word, we must first enter the START_. Additionally, we train the decoder to anticipate the _END following the final word in the Spanish text. Here, we employ the teacher-forcing method, in which the actual output, rather than the projected output from the previous time step, is input at each time step. To update the model's parameters, the mistakes are backpropagated over time and the loss is computed on the projected outputs from each time step. The decoder's last states are ignored since we already have the output, therefore they serve no purpose.
Here is a visual representation of the working of the decoder:
3. Decoder Part During Testing
The end states of the encoder are used as the decoder's beginning states. Each time step in the decoder's LSTM processes a single word. The START_ is the first character that enters the decoder. The internal states produced after each time step are sent into the subsequent time step as the beginning states. for e.g, The internal states generated after entering START_ are provided as the beginning states at t=1 and t=2, respectively. Each time step's output is used as input for the following time step. When the decoder anticipates the END_, we are informed that the sequence has ended.
???????
The amount of training data we have and the computing power we have access to are the main constraints on this method. This already outperforms statistical machine translation systems that took two decades to create. It is not necessary to be familiar with any linguistic rules for this. These rules are created by the algorithm itself. As a result, we don\\\'t need professionals to fine-tune our translation workflow at every stage. That is done for us by the computer. Almost any sequence-to-sequence problem may be solved with this method. It turns out that many intriguing issues are problems of sequence to sequence. Techniques for sequence-to-sequence language translation are advancing at such a rapid rate that it is challenging to stay up. There aren\\\'t even Wikipedia pages for many of the newest advances (such as adding an attention mechanism or tracking context) even though they are greatly boosting results. We must stay up to date with new developments as they happen if we intend to use sequence-to-sequence learning seriously. This study assists the user in developing and testing a fundamental translation system with crucial features. This aids in understanding the principles of the encoder-decoder model used in translation and piques the user\\\'s interest in further research in the area.
[1] Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2020. A Survey of Multilingual Neural Machine Translation. ACM Comput. Surv. 53, 5, Article 99 (September 2021), 38 pages. https://doi.org/10.1145/3406095 [2] J. Liu, \\\"Multimodal Machine Translation,\\\" in IEEE Access, doi: 10.1109/ACCESS.2021.3115135. [3] B. Zhang, D. Xiong, J. Xie and J. Su, \\\"Neural Machine Translation With GRU-Gated Attention Model,\\\" in IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 11, pp. 4688-4698, Nov. 2020, doi: 10.1109/TNNLS.2019.2957276. [4] G. Tiwari, A. Sharma, A. Sahotra and R. Kapoor, \\\"English-Hindi Neural Machine Translation-LSTM Seq2Seq and ConvS2S,\\\" 2020 International Conference on Communication and Signal Processing (ICCSP), 2020, pp. 871-875, doi: 10.1109/ICCSP48568.2020.9182117. [5] Rivera-Trigueros, I. Machine translation systems and quality assessment: a systematic review. Lang Resources & Evaluation 56, 593–619 (2022). https://doi.org/10.1007/s10579-021-09537-5 [6] Alanoca, H.A., Chahuares, B.A., Caceres, K.A., Saire, J.C. (2023). Neural Machine Translation for Aymara to Spanish. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol 544. Springer, Cham. https://doi.org/10.1007/978-3-031-16075-2_19 [7] Haque, R., Liu, CH. & Way, A. Recent advances of low-resource neural machine translation. Machine Translation 35, 451–474 (2021). https://doi.org/10.1007/s10590-021-09281-1 [8] Zhang, J., Zong, C. Neural machine translation: Challenges, progress and future. Sci. China Technol. Sci. 63, 2028–2050 (2020). https://doi.org/10.1007/s11431-020-1632-x [9] Ruíz, S., Providel, E., Mendoza, M. (2021). Fake News Detection via English-to-Spanish Translation: Is It Really Useful?. In: Meiselwitz, G. (eds) Social Computing and Social Media: Experience Design and Social Network Analysis . HCII 2021. Lecture Notes in Computer Science(), vol 12774. Springer, Cham. https://doi.org/10.1007/978-3-030-77626-8_9 [10] Gandhi, D., Mehta, J., Bari, P. (2022). Ablation Analysis of Seq2Seq Models and Vanilla Transformers for Spanish to English Translation. In: Rout, R.R., Ghosh, S.K., Jana, P.K., Tripathy, A.K., Sahoo, J.P., Li, KC. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, vol 427. Springer, Singapore. https://doi.org/10.1007/978-981-19-1018-0_60 [11] Jena, O.P., Tripathy, A.R., Patra, S.S., Chowdhury, M.R., Sahoo, R.K. (2022). Automatic Text Simplification Using LSTM Encoder Decoder Model. In: Sahoo, J.P., Tripathy, A.K., Mohanty, M., Li, KC., Nayak, A.K. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, vol 302. Springer, Singapore. https://doi.org/10.1007/978-981-16-4807-6_23 [12] Makin, J.G., Moses, D.A. & Chang, E.F. Machine translation of cortical activity to text with an encoder–decoder framework. Nat Neurosci 23, 575–582 (2020). https://doi.org/10.1038/s41593-020-0608-8
Copyright © 2022 Hrithick Gokul Y, Dinesh N, Hariharan R. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET46830
Publish Date : 2022-09-20
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here