Sign language is a mode of communication that use a variety of hand movements and actions to convey a message. Deciphering these motions might be a pattern recognition challenge. People use a range of gestures and behaviours to communicate with one another. This study is a system for a human-computer interface that can identify american sign language gestures and produce textual output that reflects the meaning of the gesture. To identify and learn gestures, the proposed system would employ convolutional neural networks and long short term memory networks. This will help to break down the communication gap.
Introduction
I. INTRODUCTION
Sign language is a language for the deaf and dumb that uses simultaneous orientation and motion of hand shapes rather than acoustically conveyed different sounds.. Deaf and dumb people depend on sign language interpreters to interact. Finding competent and experienced translators for their dayto-day concerns for the entire lives, on the other hand, is a time-consuming and costly endeavor.
Sign translation is the most fundamental form of communication for persons who are deaf or hard of hearing. Those who are less fortunate endure difficulties in their everyday lives. Our idea is to produce a system that will enable interactions. Sign communication describes the use of your hands to produce shapes or motions by their connection to the head or other physical aspects, along with distinct facial features.
As a reason, a classification system must be capable of detecting various hand orientations or gestures, and also expressions and even hand position. I propose a notion for a simple but extendable system capable of distinguishing static and dynamic ASL motions, focusing on the characters a-z. It was chosen since the majority of the disabled utilize American Sign Language.
II. EXISTING LITERATURE
Throughout my research, I came across a number of publications focusing on Translation System for Dumb and Deaf People, as well as its numerous components and methodologies.
Sakshi Goyal and Ishita Sharma (2015) create a Real - time system Identification System that collects the data, which is subsequently divided into numerous frames and characteristics such as Guassian difference in Centroids a Feature Extractor. [1]
Iker Vazquez Lopez (2017) created Language Transcriptor, a software which interprets hand movements in photographs by applying image identification and analysis algorithms. The identification of gestures is divided into three phases: hand location, hand segmentation, and categorization. [2] Prof. Radha S. Shirbhate and Mr. Vedant D. Shinde (2020) built a system to develop a Sign Language Classification using various Computer Vision Algorithms such as SVM and KNN to produce an automated sign language sign language recognition implemented in real using multiple tools. [3]
Mohammad Elham Walizad and Mehreen Hurroo (2015) developed a Signs Language Recognition in that Convolutional Networks and machine vision are used in this system. Splitting is used to assess the whole skin tone region. Segmentation strategies are employed. Because the images generated by OpenCV are all shrunk to the same size, there is no visible difference between shots of different movements. [4]
III. PROPOSED SYSTEM
In this approach, the camera gathers images and stores them in the database under unique folders for each letter and number.
The image will be captured in RGB format, then converted to Grayscale since it just stores intensity information, making it much easier to apply a threshold to convert it to a binary image. Grayscale thresholding is then used to easily turn the pictures to binary images. For this, I'll use the Gaussian Filter because it has a median filter and is faster than others.
Thresholding is essential for reducing background noise and retaining only the hand in the image. After that, the CNN layer enters the image, matches the sequence, and changes that to texts.
There are two CNN layers in total. The very first CNN layer will categorize 26 symbols, while the second layer will classify similar-looking symbols.
IV. WORKING SYSTEM
The framework is based on a vision. All of the signs are represented with bare hands, which eliminates the need for any manmade gadgets for interaction.
A. Data Set Generation
In order to generate the dataset, I utilised the Open computer vision (OpenCV) package. To begin, we took around 800 photographs of each ASL symbol for training reasons and approximately 200 images of each symbol for testing purposes.
First, simply capture each frame displayed by our machine's camera. I designate an area of interest (ROI) in each frame, which is symbolized by a blue delimited square, as illustrated in the figure below. Then convert the RGB ROI from the picture sequence to monochromatic color.
B. Gesture Classification
Pre-processing — Colored pictures might include several characteristics, which will require a huge amount of time and resources to train the model. We can remedy this by turning the original colored image to a black and white one.
To forecast the user's final symbol, the technique employs two levels of algorithms.
CNN Layer 1
a. Putting the Gaussian blur filter and threshold to the OpenCV frame to obtain the transformed picture post extraction of features.
b. The transformed picture is sent to the CNN model for prediction, and if a letter is recognised for further over 50 frames, it is printed and used to build the word.
c. The blank symbol is used to represent the space between the words.
2. CNN Layer 2
I recognize several groups of letters that provide comparable consequences when detected.
And then use classifiers designed specifically for those sets to differentiate between them.
During our testing, we discovered that the following symbols were not displaying properly and were displaying additional symbols as well:
For D : R and U
For U : D and R
For I : T, D, K and I
For S : M and N
So, in order to handle the aforementioned instances, we created three distinct classifiers for categorizing these sets:
{D,R,U}
{T,K,D,I}
{S,M,N}
C. Training/Testing
To reduce superfluous noise, we transform our RGB input frames to grayscale and applying Gaussian blur. To separate our hand from the backdrop, we use adaptive threshold and scale our photos to 128 × 128. After performing all of the procedures listed above, we submit the pre-processed input photos to our model for training and testing. The prediction layer calculates the likelihood that the picture will fall into one of the classifications. As a result, the output is normalised between 0 and 1, and the total of each value in each class equals 1. We accomplished this by use the softmax function.
At start, the prediction layer's output will be considerably off from the real value. To improve it, we trained the networks with labelled data. Cross-entropy is a performance metric used in classification. It is a continuous function that is positive when the value is not the same as the labelled value and zero when the value is the same as the labelled value. As a result, we maximised the cross entropy by bringing it as near to zero as possible. To do this, we modify the weights of our neural networks at our network layer. Tensor Flow has a function for calculating cross entropy. As we discovered the cross entropy function, we optimised it using Gradient Descent; in fact, the best gradient descent optimizer is known as the Adam Optimizer.
D. Predictability
In this section, a graphical user interface (GUI) will be created. In this scenario, we'll develop a frame that will accept inputs, analyse them, and then forecast the outcomes using the model we built, which will be presented in the GUI.
V.IMPLEMENTATION
When the count of a detected letter above a certain threshold and no other letter is within a certain distance of it, we print the letter and add it to the current string (In the code we kept the value as 50 and setting the difference threshold as 20).
Otherwise, we clear the current dictionary, which contains the number of detections of the current symbol, to reduce the possibility of predicting the erroneous letter.
When the count of detected blanks (plain backgrounds) exceeds a certain threshold and the current buffer is empty, no spaces are recognized.
In the other situation, it anticipates the end of the word by printing a space, and the current phrase is attached to the one below.
VI. RESULTS
I got 95.8 percent accuracy in our model utilizing only layer 1 of our technique, and 98.0 percent accuracy when layer 1 and layer 2 were merged., which is higher than the majority of existing research articles on American sign language. The majority of the research publications concentrate on the use of devices such as Kinect for hand detection.
On the other hand, while the majority of the above 21 projects make use of Kinect devices, our main goal was to design a project that could be utilized using widely available resources. A sensor like Kinect is not only not widely available, but also prohibitively costly for the majority of the audience to purchase, but our solution makes use of a standard camera on a laptop, which is a significant advantage.
The confusion matrices for the findings are shown below.
VII. FUTURE ENHANCEMENT
In addition to the model, by trying with various background removal methods, we hope to obtain improved accuracy even in the situation of complicated backgrounds. We are also considering upgrading the pre-processing to better anticipate gestures in low-light circumstances.
Conclusion
As part of this initiative, a way for assisting dumb and deaf people in communicating more easily has been developed, and there should be no communication hurdles between us and them.
The convolution neural network\'s goal is to find the correct categorization. A sign language recognition system is a powerful tool for developing expert knowledge, detecting edges, and merging incorrect information from several sources.
References
[1] Sakshi Goyal, Ishita Sharma, Shanu Sharma “Sign Language Recognition System For Deaf And Dumb People”, International Journal of Engineering Research & Technology (IJERT), Vol. 2 Issue 4, April – 2013.
[2] Iker Vazquez Lopez-Hand gesture recognition for sign language transcription, Boise State University, Research and Economic Development, 2017.
[3] Prof. Radha S. Shirbhate1, Mr. Vedant D. Shinde2, Ms. Sanam A. Metkari3, Ms. Pooja U Borkar4, Ms. Mayuri A. Khandge5 “Sign language Recognition Using Machine Learning Algorithm”, International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020.
[4] Mehreen Hurroo, Mohammad ElhamWalizad Sign Language Recognition System using Convolutional Neural Network and Computer Vision”, International Journal of Engineering Research & Technology (IJERT), Vol. 9 Issue 12, December-2020
[5] Mayuresh Keni, Shireen Meher, Aniket Marathe “Sign Language Recognition System”, International Journal of Engineering Research & Technology (IJERT), ICONECT\' 14 Conference Proceedings
[6] H.C.M. & W.A.L.V.Kumari, & Senevirathne, W.A.P.B & Dissanayake, Maheshi, “Image Based Sign Language Recognition System For Sinhala Sign Language”, Conference Paper · April 2013.
[7] A Robust Sign Language And Hand Gesture Recognition System Using Convolution Neural Networks – D Prakhya, M Sri Manjari, A Varaprasadh, NSV Krishna Reddy, D Krishna.
[8] Hemlata Dakhore , Manali Landge , Shivani Patil, Tanushree Patil , Shrutika Zyate , Ashwini Moon , Raveena Lade, “Sign Language Recognition Using Machine Learning”, International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211 Volume 9, Issue 6, June -2021, Impact Factor: 7.429.
[9] Lean Karlo S. Tolentino, Ronnie O. Serfa Juan, August C. Thio-ac, Maria Abigail B. Pamahoy, Joni Rose R. Forteza, and Xavier Jet O. Garcia, “Static Sign Language Recognition Using Deep Learning”, International Journal of Machine Learning and Computing, Vol. 9, No. 6, December 2019.