Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Mr. G. Sekhar Reddy, A. Sahithi , P. Harsha Vardhan , P. Ushasri
DOI Link: https://doi.org/10.22214/ijraset.2022.42078
Certificate: View Certificate
Sign Language recognition (SLR) is a significant and promising technique to facilitate communication for hearing-impaired people. Here, we are dedicated to finding an efficient solution to the gesture recognition problem. This work develops a sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sign to word. We propose a novel approach, by using Video sequences that contain both the temporal as well as the spatial features. So, we have used two different models to train both the temporal as well as spatial features. To train the model on the spatial features of the video sequences we use the (Convolutional Neural Networks) CNN model. CNN was trained on the frames obtained from the video sequences of train data. We have used RNN(recurrent neural network) to train the model on the temporal features. A trained CNN model was used to make predictions for individual frames to obtain a sequence of predictions or pool layer outputs for each video. Now this sequence of prediction or pool layer outputs was given to RNN to train on the temporal features. Thus, we perform sign language translation where input video will be given, and by using CNN and RNN, the sign shown in the video is recognized and converted to text and speech.
I. INTRODUCTION
The sign language system is a technique for deaf and dumb individuals to communicate. Those who know sign language can communicate with dumb and deaf individuals and can talk and hear appropriately. Untrained persons, on the other hand, are unable to speak with the dumb and deaf, because a person can communicate with the dumb by learning sign language. For such people, a sign language to text system will be more useful in allowing them to converse with normal people more fluently. Sign language is a physical movement that uses the hands and eyes to communicate with the deaf and dumb. Different hand shapes and movements can be used to describe their feelings. The task is to translate their sign language into text or speech. Here, we are dedicated to finding an efficient solution to the gesture recognition problem. This research uses deep neural networks to create a sign language (SL) recognition framework that directly transcribes films of SL signs to words. As a result, both the temporal and spatial features were trained using two separate models. Convolutional neural networks (CNN) are used to train the model on the spatial properties of video sequences. Recurrent neural networks are used to train the model on temporal features (RNN). As a result, we do sign language translation, in which a video is provided as input, and the sign exhibited in the video is detected and converted into text and speech using CNN and RNN.
II. LITERATURE SURVEY
There has been a lot of research into hand sign language gesture recognition in recent years. The technology used to recognize gestures is listed below
A. Vision-based
In vision-based approaches, a computer camera is used to observe information from the hands or fingers. The Vision-Based approaches just require a camera, allowing for natural human-computer contact without the usage of any additional technologies. By describing artificial vision systems that are implemented in software and/or hardware, these systems tend to complement biological vision. This is a difficult challenge to solve since, to attain real-time performance, these systems must be background insensitive, illumination insensitive, and person and camera agnostic. Furthermore, such systems must be tailored to satisfy the requirements, which include accuracy and resilience.
2. Automatic Indian Sign Language Recognition for Continuous Video Sequence [2] Data Acquisition, Pre-processing, Feature Extraction, and Classification are the four primary modules in the proposed system. Skin Filtering and histogram matching are applied in the pre-processing step, followed by Eigenvector-based Feature Extraction and Eigen value-weighted Euclidean distance-based Classification Technique. In this work, 24 different alphabets were considered, and a 96 percent identification rate was achieved.
III. MOTIVATION
Hearing-impaired people communicate through hand signs, which makes it difficult for normal people to recognize their language. As a result, systems that recognize various signs and deliver information to ordinary people are required. The fundamental issue is that many indicators cannot be expressed in images; nevertheless, video sequences can. The key aim here is to detect the sign in the video sequences and translate it into text and speech that people can understand. Normal people have a hard time understanding hearing-impaired people's language, so a system that understands signs and gestures and relays information to normal people is needed.
IV. EXISTING SYSTEM
A. Typically, Image classification is used for sign language recognition using machine learning algorithms like K nearest neighbour, Decision tree, and Support Vector Machines to classify the sign shown in the image.
B. Some researchers employed a Leap Motion Controller (LMC) sensor to measure the angles between the fingers' joints. Devices such as the Kinect sensor have also been used to extract the skeletal features of people.
C. Various works on gesture recognition have used finger-tip detection. • For hand gesture recognition, this system uses flex sensors, an onboard gyroscope, and an accelerometer to recognize hand gestures. Also employed for gesture recognition are continuous wave radar signals.
D. Previously, the Hidden Markov Model (HMM) was used to model sign language and other sequences, and it is still used for voice recognition systems, but it is inefficient for sign language. Hidden Markov models with limited capacity to capture temporal information have been used in previous methods dealing with continuous SL recognition.
V. PROPOSED SYSTEM
All the signs cannot be expressed in a single image, the system recognizes sign language exclusively from images to compensate for the limitations of the existing system, such as image categorization. As a result, we use CNN and RNN to classify videos. The spatial properties of the hand signs are extracted using CNN. The CNN model's output will be fed into the RNN model for sequence modelling, which will determine which sign is shown in the video. The discovered sign will be translated into text and speech.
VI. SYSTEM MODEL
The architecture provides the entire process flow of the system.
The above architecture describes the entire processes involved in the system to convert signs from the video sequences to text and speech. The signed video uploaded by the user is divided into several frames with extracted hand gestures. Then the frames are given as the input to CNN(convolutional neural network) thus spatial features are extracted and an array of values are returned to RNN(recurrent neural network). It extracts the temporal features and recognizes the sign in the video. Then the sign is converted into the text and speech.
VII. IMPLEMENTATION
A. Algorithms Used
Since a video sequence comprises both temporal and spatial variables, video categorization is a difficult challenge. The spatial features are taken from the video frames, while the temporal features are extracted by linking the video frames with time. To train our model on each sort of feature, we used two different types of learning networks. We utilized a CNN to train the model on spatial features, and a recurrent neural network to train the model on temporal features.
B. Methodology
Two approaches for training the model on temporal and spatial features were used, and they differed in the way inputs were given to the RNN to train it on temporal features.
The trained model was used to make predictions for frames from both the train set and the test set of films. Each video of a gesture was broken down into a series of frames. The video is then represented as a sequence of guesses after CNN has been trained and predictions have been made.
3. Training RNN (Temporal Features) The videos for each gesture are fed to RNN as a sequence of predictions of its constituent frames. The RNN learns to recognize each gesture as a sequence of predictions. After the Training of RNN completes a model file is created (Fig. 7)
VIII. RESULTS
The proposed system was successfully tested to denote its effectiveness and achievability. Thus sign in the video sequence is converted into the text and speech. It is the command-line interface(CLI),fig.8 where the user can upload the sign video. The result will be displayed as follows:
After a user uploads the video, the sign in the video is displayed as text, and the sign spells out the speech as in fig 9.
IX. FUTURE SCOPE
This research can be expanded in the future to recognize continuous sign language motions with greater accuracy. This strategy for individual gestures can likewise be applied to sign language at the sentence level. In addition, the current procedure employs two distinct models: training inception (CNN) and training RNN. Future work could concentrate on fusing the two models into a single model. The proposed system can be developed and deployed utilizing Raspberry Pi in the future. Image processing should be upgraded so that the system can communicate in both directions, i.e., it should be capable of converting conventional language to sign language and vice versa, and it should be able to concentrate on transforming the sequence of gestures into sentences and subsequently text and voice.
X. ACKNOWLEDGMENT
We express our sincere gratitude to our guide, Assistant Professor Mr. G. Sekhar Reddy for suggestions and support during every stage of this work. We also convey our deep sense of gratitude to Professor Dr. K. S. Reddy, Head of the Information Technology department.
We presented a vision-based method for interpreting single hand motions from Sign Language in this paper. To classify the spatial and temporal features, this study used a prediction approach. The geographical features were classified using CNN, whereas the temporal features were classified using RNN. This demonstrates how CNN and RNN may be used to learn spatial and temporal information and interpret sign language gesture videos as text or speech.
[1] “Handshape recognition for Argentinian sign language using problem”. Journal of Computer Science and Technology 16(2016). Ronchetti, Franco, Facundo Quiroga, Cesar Armando Estrebou and Laura Cristina Lanzarini. [2] “Automation Indian Sign Language Recognition for Continous Video Sequence.” ADBU Journal of Engineering Technology 2, no. (2015). Singha, Joyeeta, and Karen Das. [3] “Continuous Indian Sign Language Gesture Recognition and Sentence Formation.” Procedia Computer Science 54(2015): 523-531. Tripathi, Kumud, and Neha Baranwal GC Nandi. [4] “Si language recognition using subunits.” Journal of Machine Learning Research 13 no. Jul(2012): 2205-2231. Cooper, Helen, Eng-Jon Ong, Nicolas Pugeault, and Richard Bowden. [5] “Sign language recognition.” In visual Analysis of Humans, pp. 539- 562. Springer London,2011. Cooper, Helen, Brian Holt, and Richard Bowden. [6] “Learning long-term dependencies with gradient descent is difficult.” IEEE transactions on neural networks 5, no.2(1994): 157-166. Bengio, Yoshua , Patrice Simard, and Paolo Frasconi.
Copyright © 2022 Mr. G. Sekhar Reddy, A. Sahithi , P. Harsha Vardhan , P. Ushasri. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET42078
Publish Date : 2022-04-30
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here