Sign Language Detection without Sensors

Authors: Aman Bhoskar, Moin Azam, Sohail Behlim, Asit Sonawane, Prof. Rambhau Lagdive

DOI Link: https://doi.org/10.22214/ijraset.2023.53354

Abstract

The time period time device assists communication between the hearing impaired and the general public. We are proud to present the design and implementation of language learning models. In most cases, we create a trust model that distributes signatures regularly. Also, this method is a good way for beginners to use sign language to practice the language. As part of this study, various human-computer interactions for gesture recognition were investigated and analyzed. It has been found that the best way to detect human movement is layered image processing. Even with very little background, the system can detect certain characters in the language with 70-80% accuracy.

Introduction

I. INTRODUCTION

We've seen a lot of hearing loss in the last few years. Face-to-face communication is an important channel for understanding information, ideas, and connections. The deaf use two forms of communication, including reading, writing and language. It needs to be addressed because the lack of communication skills has become a barrier to the success of the deaf. Therefore, the most popular form of communication is language.

To provide a good way to engage and communicate in AR/VR, hand tracking is an essential element and has been the subject of scientific research in this area. Many devices today include voice control, especially on the smartphones we use every day. These voice controls are based on speech recognition algorithms. Voice command has many disadvantages due to issues such as background noise. Gestures are one way to solve this problem.

That's why we created a motion detection project that can be used to integrate hand gestures or gestures into language recognition. Computer vision is a technique for recognizing gestures that uses some transformations and filters for the candidate. The collected data is then processed to produce the desired results. In this study, we try to facilitate fingerprint recognition and file processing with Google MediaPipe Hand, which works with digital cameras.

II. LITERATURE SURVEY

The grammar of American Sign Language (ASL) is a language that has many similarities with spoken language. [5] English is different from this. In American Sign Language, meaning is expressed through gestures and facial expressions. Deaf and hard of hearing people in North America often speak their own language as their first language.

Non-native speakers communicate using the language, but there is a gap in communication because most people do not understand [1].Modern technology can close this gap. The use of techniques such as image processing and machine learning can be used to create machines that translate sign language into text or speech. This system makes it easy for illiterate people to talk to anyone and it's good for them. This article briefly describes the extensive research that has been done in this area.

Although widely used, language recognition (SLR) is rarely used in day-to-day work due to its complexity and high requirements [2]. The authors of this study explored the various methods that can be used to create automated hand translators by reviewing the methods and models used to create functional models of translators from different locations. In order to improve the automatic signing of Spanish translators, this study aims to explore various potential applications of artificial intelligence technology.

Gesture recognition is an intelligent, natural, and effective form of human-computer interaction (HCI). Two important applications of gesture recognition are sign language recognition (SLR) and gesture control. [3] SLR works on automatic translation to help deaf people communicate with normal people.

Since sign language is a highly structured and often symbolic representation of human movements, SLRs also provide a solid foundation for the development of a general description based on HCI. In this article, research on gesture recognition is discussed and applicable models are reviewed.

Vision-Based Deep Learning Research for Sign Language Recognition [4]. Gesture recognition systems are believed to provide a better and more efficient way of human-computer interaction. Programs include sign language testing, virtual prototyping and clinical training.

Language is one of the forms of communication in the deaf and hard of hearing community. Most recent work has focused on the analysis of static parameters in video or images recorded in controlled environments. Glove sensors or dark gloves are required for such signers. To complete the task, gloves will be worn throughout the segmentation process.

III. COMPONENTS

A. Functional Requirements

The system should start with PyCharm in the default UI on all Windows machines. The system should be able to compare the user input from the camera with the database to recognize the language. After using these ideas and developing software, the system will respond to users with text messages to inform the deaf about language discovery. If a problem occurs while launching the application, users will be notified and encouraged to report the problem to the developer. If the connected camera is not found, the system should inform the user both in writing and visually.

B. Assumptions and Dependencies

The camera connection will determine how SLD works. The programme won't function if there isn't a camera connection. It is anticipated that the user employs the proper standard sign language for the system to function properly. The users are accountable for performing accurate sign languages for translation. The application will alert the user if performed sign language cannot be found. In these circumstances, the user can also easily generate their own dataset for new words.

TensorFlow
Open-CV
Num-PY
Media-Pipe

IV. METHODOLOGY

A. Media Pipe

Google provides Mediapipe, an open-source, cross-platform framework for building pipelines that process analytics from multiple sources, including audio and video. Facial recognition and predictive analytics are just two of the solutions that work in MediaPipe. In this study, we will use MediaPipe Hands for hand tracking. An example of the predictive model, OpenPose, is proposed. From the input image, OpenPose can determine the reliability of the 2D coordinate point function.

In contrast, MediaPipe uses inverse measurements to calculate finger coordinates from palm views. Our MediaPipe library uses less by using a narrower detection scope than OpenPose. Additionally, we can improve the accuracy of finger shape prediction, which is the problem OpenPose is working on. 21 The 3D coordinates of X, Y and Z can also be determined and obtained from a frame taken by a monocular camera. The X and Y coordinates of the output 3D coordinates are normalized to the width and height of the bounding box.

Our MediaPipe library uses less by using a narrower detection scope than OpenPose. Additionally, we can improve the accuracy of finger shape prediction, which is the problem OpenPose is working on. 21 The 3D coordinates in X, Y and Z can also be calculated and obtained from a frame taken by the monocular camera. The width and height of the bounding box are used to normalize the X and Y coordinates of the output 3D values. Taking the wrist position as the origin, the Z coordinate represents the depth data, the closer to the camera the smaller the value, the greater the distance to the camera.

This approach provides more efficient hand and finger tracking and can be used in low workload environments such as mobile situations.

1) Palm detection Model

We use a standard injection method, optimized for real-time mobile applications, compared to Blaze Face, which is also used in MediaPipe, to find the job. The difficulty of manual control is challenging and challenging. While our model can be applied to any size hand, it should be able to recognize both closed and self-closing hands. While the human face exhibits similar patterns, such as around the lips and eyes, the hand is somewhat difficult to identify by visual inspection alone. Our solution offers a different approach to the above problems.

First, we train the palm detector without using a hand detector because for hard objects like fists and palms, it is easier to guess connected boxes than detecting the hand with the index fingers. The most restrictive method (NMS) is also suitable for holding hands, such as holding hands, as the palms are small objects. In addition, the palm can be simulated by ignoring other examples and simply using the square bounding box, which reduces the number of connections by 35 times. Second, we use an encoder for general feature recognition, even for small objects. Decoder feature extractor compared to FPN. Finally, we will stop the loss of focus during training to support several anchors (tight boxes) from different heights.

2) Hand Landmark Model

Once the palm has been identified across the image, our next hand model will return to 21 3D hand coordinates in the sensing area, which is a direct collaboration, pointing to key points in the area. The Handmark pattern is resistant to self-closing and occasional blurring of some visible hands as it learns the inner hand representation. As shown here, we used 30K real-world images with 21 3D coordinates to obtain the ground truth data for our model (we obtained the Z-value of the depth image on the map if all coordinates were available). We also take handcrafted models for various backgrounds and transform them into 3D hand models to cover potential areas of the hand and provide additional attention to the geometric features of the hand.

In this step, we will use the mediapipe library and the CV2 preprocessing library to build a model for detecting hand landmarks for various computer vision applications. This issue statement has many applications in industry, including virtual reality-related business and gaming applications. A framework called

MediaPipe is a library that allows developers to build multi-device, cross-platform application ML pipelines using video, audio, and data sources. The MediaPipe library provides a large collection of Google's most diverse datasets to train human recognition and tracking algorithms. They follow the bones of bones, edges or scars, following the important functions of many bodies.

All coordinates are normalized in three dimensions. Created by Google developers using Tensorflow lite, the template makes it easy to edit and manipulate the dataflow. A pbtxt file that describes the nodes of the MediaPipe pipeline in the schema is often called a pbtxt file. C++ data connections to nodes. The Mediapipe base calculator class is an extension of this document.

Similar to how video streams receive streaming contracts from other nodes in the diagram, this class enables them to be connected by receiving streaming contracts from other nodes. This class automatically frees itself when other channels are chained. Each stream sent to each computer uses a storage device that can store multiple files.

A container in which the calculator can add auxiliary data such as constants or static objects can be forced into the display. A simple dataflow pipeline format makes this possible.

The hand system has a back-end pipeline with two integrations of the learning machine: the model using palm sensing, and b is the design model. The submodel, the landmark model, takes the shape of the palm from the palm sensing model. This approach also reduces the use of data augmentation such as rotation, translation, and scaling. The traditional approach is to detect the area of ??the current frame after detecting the hand from the frame. But this palm uses different methods to solve machine learning pipeline problems.

Hand checking is a time consuming process because you have to take pictures, start working and hold different sizes of hands. First, train the palm latch, which guesses that the box is tied around tight things like fists and palms, which makes it easier to recognize fingers together. This is done by checking hands directly from the current frame. Second, the encoder-decoder is used to extract larger content.

3) Stage 2: CNN

As explained on the Neural Networks Learning Center page, neural networks are part of machine learning and are the foundation of deep learning algorithms. They consist of a series of nodes, each of which has an input layer, one or more hidden layers, and an output layer. Each ball has a corresponding link and weight. Every node with more output than specified becomes active and starts sending data to the top layer of the network. Otherwise, no data is sent to the next layer of the network.

While we focus only on fed communication in this post, there are other types of neural networks for different applications and data types. Convolutional neural networks (often called CNNs) are mainly used in classification and computer vision applications, while recurrent neural networks (for example, often used in speech and natural language processing to make comparisons). Before the advent of CNNs, recognizing objects in images required manual labor to extract ideas. Convolutional neural networks, on the other hand, now provide a more efficient way to classify images and identify objects, using matrix multiplication and other elements of linear algebra to find pattern in images.

4) Stage 2: FNN

A feedforward neural network is a kind of artificial neural network that has no connection between nodes. Recurrent neural networks, in which special features are repeated, are the evolution of neural networks. The feedforward model is the simplest neural network as the input is processed in only one direction. Although data can flow over multiple connections, it always moves forwards, not backward.

Conclusion

Our suggested methodology demonstrates that MediaPipe may be effectively employed as a tool to precisely detect complicated hand gestures, with an average accuracy of 99 percent in the majority of the sign language sample. Although image processing techniques for modeling sign language have advanced over the past few years, the methods still demand a lot of CPU effort. The training of a model requires a lot of time. From that angle, this effort will give us fresh perspectives on the issue. The model is sturdy and economical because it requires less computational power and may be used with smart devices. This framework may be efficiently customized for every regional sign language dataset, and maximum accuracy can be reached, according to training and testing with a variety of sign language datasets. Better than the current state-of-the-art, faster real-time detection shows how effective the model is. By including word detection for sign language from videos utilizing Media Pipe and several algorithms, the work can be expanded in the future.

References

[1] https://ijcrt.org/viewfull.php?&p_id=IJCRT2103503 [2] https://www.sciencedirect.com/science/article/pii/S1877050921000442 [3] LITERATURE SURVEY ON HAND GESTURE TECHNIQUES FOR SIGN LANGUAGE RECOGNITION Ms. Kamal Preet Kour, Dr. (Mrs.) Lini Mathew Department of Electrical Engineering, NITTTR, Chandigarh (India) [4] Sign Language Detection Amey Chavan, Shubham Deshmukh, Favin Fernandes Department of Electronics and Telecommunication, Vishwakarma Institute of Technology, Pune [5] J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney. RWTH-PHOENIX-Weather: A Large Vocabulary Sign Language Recognition and Translation Corpus. In Language Resources and Evaluation (LREC), pages 3785-3789, Istanbul, Turkey, May 2012. [6] Sik-Ho Tsang, “Review: Inception-v3-1st Runner Up (Image Classification) in ILSVRC 2015”, September 2018, https://sh-tsang.medium.com/review-inception-v3-1st-runner-up-image-classification-in-ilsvrc-2015-17915421f77c [7] A. Das, S. Gawde, K. Suratwala and D. Kalbande, \"Sign Language Recognition Using Deep Learning on Custom Processed Static Gesture Images,\" 2018 International Conference on Smart City and Emerging Technology (ICSCET), Mumbai, 2018, pp. 1-6, doi: 10.1109/ICSCET.2018.8537248. [8] M. Xie and X. Ma, \"End-to-End Residual Neural Network with Data Augmentation for Sign Language Recognition,\" 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chengdu, China, 2019, pp. 1629-1633, doi: 10.1109/IAEAC47372.2019.8998073. [9] Kumar Mahesh, “Conversion of Sign Language into Text,” International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 9 (2018) pp. 7154-7161. [10] Singha, Joyeeta& Das, Karen. (2013), “Recognition of Indian Sign Language in Live Video,” International Journal of Computer Applications. 70. 10.5120/12174-7306. [11] MehreenHurroo , Mohammad Elham, 2020, Sign Language Recognition System using Convolutional Neural Network and Computer Vision, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 09, Issue 12 (December 2020) [12] R. A. Pranatadesta and I. S. Suwardi, \"Indonesian Sign Language (BISINDO) Translation System with ORB for Bilingual Language,\" 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), 2019, pp. 502-505, doi: 10.1109/ICAIIT.2019.8834677. [13] Sahoo, Ashok & Mishra, Gouri&Ravulakollu, Kiran. (2014). Sign language recognition: State of the art. ARPN Journal of Engineering and Applied Sciences. 9. 116-134. [14] A. S. Nikam and A. G. Ambekar, \"Sign language recognition using image based hand gesture recognition techniques,\" 2016 Online International Conference on Green Engineering and Technologies (IC-GET), 2016, pp. 1-5, doi: 10.1109/GET.2016.7916786.

Copyright

Copyright © 2023 Aman Bhoskar, Moin Azam, Sohail Behlim, Asit Sonawane, Prof. Rambhau Lagdive . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET53354

Publish Date : 2023-05-30

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here