Computer interaction using hand gestures is the process of automatically recognizing hand gestures that enables people to interact with the computer. Usually, interaction with a computer is done using input devices like keyboards, mice, etc., and the need to develop contactless human-computer interaction methods is the most desired feature during these pandemic days. Gesture recognition is a significant step towards achieving contactless new types of human-computer interaction. A deep learning approach, convolution neural network, is used to recognize the hand gestures and mapped operation to the respective gesture is performed.
Introduction
I. INTRODUCTION
With the recent expansion of computer science technologies like smartphones and the internet, the human-computer interaction field got a new meaning, the link between user activity and multiple computers - where computers became any device that runs programs [1]. We use computers in different forms for several applications. Interaction of humans with the computers is a great field of interest and is known as HCI - Human Computer Interaction. We argue that HCI has emerged as a design-oriented field of research, directed at large towards innovation, design, and construction of new kinds of information and interaction technology. But the understanding of such an attitude to research in terms of philosophical, theoretical, and methodological underpinnings seems however relatively poor within the field [2]. Different input and output devices are designed over the years with the resolution of enabling the communication between computers and humans, the two most known are the keyboard and mouse.
These to have been used normally for interaction between computers. But in this new age of Intelligence there is need of developing new ways of interaction between Humans and Computers for the ease of communication of Humans with computers.
Deep learning is a subclass of machine learning, which is essentially a neural network with three or more layers. These neural networks attempt to simulate the actions of the human brain, although far from matching its ability, allowing it to learn from huge amounts of data. While a neural network with a single layer can still make estimated predictions, additional hidden layers can help to optimize and improve the accuracy. We have used convolutional neural network to recognize the hand gesture.
II. METHODOLOGY
A. Dataset
The data set is a collection of images of alphabets from the American Sign Language, separated in 29 folders which represent the various classes [3]. The training dataset contains 87,000 images of height 200 pixels and width 200 pixels. There images are classified into 29 classes, of which 26 are for the letters A-Z and for SPACE, DELETE and NOTHING These 3 classes are very helpful in real- time applications, and classification.
F is mapped to SUPER
V is mapped to VICTORY
L is mapped to LOSER
A is mapped to PUNCH
B. CNN Architectures
CNN is a kind of deep learning classification methodology which has many successful records in image analysis and classifications tasks [4].
The convolutional neural network built takes in input data images of dimensions 200 x 200. It starts with a sequential layer onto which the number of filters, strides, the kernel size of appropriate values are provisioned with. The ‘ReLU’ activation function is used, along with batch normalization and max pooling with a pool size of 2 x 2.
we have referred a pre-trained model called VGG-16 and used with custom convolutional neural networks (CNN's). The output layer of the VGG-16 model is input to our custom layer. A convolutional layer with 32 filters, kernel size of (3,3) and default strides of (1,1) is added.
The Image Data Generator module is used for data augmentation of images which is done prior to letting the model work on the images in order to better generalize the model’s working on real life images. This technique will help in training the data to improve the accuracy at each epoch.
The ‘Adam’ optimizer is used to evaluate the accuracy taking into consideration of the ‘Categorical cross entropy’ loss function. The dataset is split into the training and the validation set and the built model is fit on the training dataset with validation taken into account.
Convolution is a mathematical operation on two functions to produce a third function that conveys how the shape of one is reformed by the other. The convolution layer transforms the input image in order to extract features from it. In this process, the image is convolved with a kernel which is a small matrix, with its width and height smaller than the image to be convolved, known as convolution matrix.
Followed by convolution layer, pooling layers are used to lower the dimensions of the feature maps. Thus, it lessens the number of parameters to learn, the amount of computation performed and time taken to train the network. The pooling layer encapsulates the features present in a region of the feature map produced by a convolution layer. There are many non-linear functions to implement pooling like average and max pooling. The most common pooling technique used is Max pooling, which partitions the input image into a set of rectangles and, for each such sub-region, outputs the maximum
The non-saturating activation function called Rectified Linear Unit (ReLU) is used, which effectively eliminates the negative values from an activation map by replacing them with zero. Without affecting the receptive fields of the convolution layers, it introduces nonlinearities to the overall network. Mathematically, it is expressed as,
Based on the recognised hand gesture, specific operation is performed using OS module and text is converted to voice using gTTS (Google Text-to- Speech), a Python library and CLI tool to interface with Google Translate's text-to-speech API module in python [5]. The following operations are performed.
Conclusion
A Convolutional Neural Network is built on 9600 training image samples (having 4 classes) along with hyperparameter tuning (for optimal results) and its performance is evaluated on the test set (consisting of about 2400 image samples). The accuracy values obtained are fairly good with a training accuracy of about 95% and a testing accuracy of about 94%.
The operations which are mapped to respective hand gestures can be personalised as required and more number of gestures can be added to the existing system.
In future, the human-computer interaction domain will have a rapid growth, making space for innovation and research. The old human-computer interaction devices will become legacy and replaced by new ones.
References
[1] J. Lazar, J. H. Feng and H. Hochheiser, Research methods in human-computer interaction, Morgan Kaufmann, 2017.
[2] Fallman, “Design-oriented human-computer interaction,” in SIGCHI conference on Human factors in computing systems, 2003.
[3] “ASL Alphabet,” [Online]. Available: https://www.kaggle.com/datasets/grassknoted/asl-alphabet.
[4] P. S. Neethu, R. Suguna and D. Sathish, “An efficient method for human hand gesture detection and recognition using deep learning convolutional neural networks.,” Soft Comput, vol. 24, March 2020.
[5] “gTTS,” [Online]. Available: https://pypi.org/project/gTTS/.
[6] Bachman, F. Weichert and G. Rinkenauer, “Review of three-dimensional human-computer interaction with focus on the leap motion controller,” Sensors, vol. 18, no. 7, 2018.
[7] Zhan, “Hand Gesture Recognition wuth Convolution Neural Networks,” in IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI), 2019.
[8] H. Aashni, S. Archanasri, A. Nivedhitha, P. Shristi and S. Joythi Nayak, “Hand Gesture Recognition for Human Computer Interaction,” Procedia Coomputer Science, vol. 115, pp. 367-374, 2017.