AI Based Virtual Mouse with Hand Gesture and AI Voice Assistant Using Computer Vision and Neural Networks

Authors: J Kumara Swamy, Mrs. Navya V K

DOI Link: https://doi.org/10.22214/ijraset.2023.55412

Abstract

The use of hand gesture recognition in controlling virtual devices has become popular due to the advancement of artificial intelligence technology. A hand gesture-controlled virtual mouse system that utilizes AI algorithms to recognize hand gestures and translate them into mouse movements is proposed in this paper. The system is designed to provide an alternative interface for people who have difficulty using a traditional mouse or keyboard. The proposed system uses a camera to capture images of the user’s hand, which are processed by an AI algorithm to recognize the gestures being made. The system is trained using a dataset of hand gestures to recognize different gestures. Once the gesture is recognized, it is translated into a corresponding mouse movement, which is then executed on the virtual screen. The system is designed to be scalable and adaptable to different types of environments and devices. All the input operations can be virtually controlled by using dynamic/static hand gestures along with a voice assistant. In our work we make use of ML and Computer Vision algorithms to recognize hand gestures and voice commands, which works without any additional hardware requirements. The model is implemented using CNN and mediapipe framework. This system has potential applications like enabling hand-free operation of devices in hazardous environments and providing an alternative interface for hardware mouse. Overall, the hand gesture-controlled virtual mouse system offers a promising approach to enhance user experience and improve accessibility through human-computer interaction.

Introduction

I. INTRODUCTION

The world is full of technology driven factors in our day to day life. We have so many technologies, throughout the world computer technologies are growing simultaneously. They are used to perform various tasks which cannot be performed by humans. In fact they are ruling the human lives because they have a potential to do the tasks which cannot be done by humans. The interaction between human and computer can be done with output device like mouse. The mouse is a device used for interacting with a GUI which includes pointing, scrolling and moving etc.

The hardware mouse in computers and touchpads in laptops will require a huge amount of time to perform complex tasks, incase we are carrying hardware mouse wherever we go it would be damaged sometimes.

After decades the technology has made the mouse functionality from wired into the wireless to improve the functionality and for the easy movements in hassle free manner.

As the technologies started growing there came the speech recognition technique. This recognition is mainly used for the voice recognition purpose for searching something with the help of their voice and for translation purposes but it can take time for recognition to perform mouse functions. Later the human computer interaction evolved with the eye tracking techniques for controlling thecursor of the mouse. The major drawback of this technique is that some may wear contact lens or some may have long eyelashes so it may take some time to capture their eye movement.

Different types of attempts taken by many developers for developing the models for human gesture recognition. Those models require expensive gloves and sensors for capturing and color cap for marking the positions of the fingertips.

The technologies are still emerging, one of the vast technologies artificial intelligence is playing a major role in every sector. Artificial intelligence makes human life fast and comfortable.

To overcome the problems faced in the existing approaches we are going for the latest algorithms and tools in artificial intelligence.

Hand gesture controlled virtual mouse using artificial intelligence is a technology that allows users to control the movement of their computer mouse using hand gestures, without the advent of a physical mouse.

This technology uses a camera vision based approach to track the movements of the user’s hand and to perform mouse functions on the computer screen. The system works by capturing video input from a camera pointed at the user’s hand.

The computer vision algorithms then analyze the video feed to identify the user’s hand and track its movement. This information is given to machine learning models which have been trained to recognize specific hand gestures, such as pointing or swiping, and translate them into corresponding mouse movements.

II. LITERATURE SURVEY

The literature survey was carried out to find various papers published in international journals related to various Bitcoin price prediction algorithms, and associate the best algorithm for the same.

Some work which is related to the AI virtual mouse had been performed previously in that glove were used by the user to recognize and collect data from the system. Later another system used colored pieces of paper which are attached on hands for gesture recognition.

But these systems are not very feasible for performing mouse operations accurately. In a glove based approach recognizing the gloves is not viable and it might be allergic for users who have sensitive skin type. Also wearing gloves for a long time is difficult. It might sweat and result in skin rashes and allergic reactions. In the case of colored tips for gesture recognition and detection will not always give best results.

Now some others have made contributions that use Google’s work with the mediapipe framework. The current gesture controlled virtual mouse uses hand gestures to perform mouse functions, in which we have control over the mouse cursor and perform certain mouse operations like left click, right click, drag and drop, volume control and brightness control etc.

Efforts have been made for hand gesture recognition with camera-based detection of the hand gesture interface.

The vision based technique has been tried out in this system. Utilized a webcam for gesture recognition and detection. And no external devices like sensors and gloves were used. Completely focuses on leveraging the YOLOv5 algorithm and Artificial Intelligence (AI) to recognize hand gestures and improve HCI.
A method for performing mouse functions without any electrical equipment like sensors. It requires a webcam alone. And mouse functions like clicking and dragging files are carried out through hand gestures. The suggested model performance is low with accuracy and lacks more mouse functionality.
In this model the hand’s center is determined, and the hand’s calculated radius is discovered. And using the convex hull technique, fingertip points have been determined. The hand gesture is used to control every mouse movement. And the problem of this approach is the frame must first be saved before being processed for detection, which takes longer than what is needed in real-time.
The system can create coloured masks utilizing techniques for color variation. Later mouse functions are carried out using hand gestures. This approach is difficult in its implementation.
This study focuses on the advanced study of robots with gesture controls. The first section gives an idea of the art for hand gesture identification as it relates to how they are seen and captured by common video cameras. Based on estimations of the smoothed optical flow, we extract a collection of motion features. Face detection is used to produce a user-centric representation of this data, and an effective classifier is trained to differentiate.
They created a machine-user interface that uses straightforward computer vision and multimedia techniques to accomplish hand gesture detection. However, a significant disadvantage is that skin pixel identification and hand segmentation from stored frames must be completed before working with gesture comparison techniques.
They described a system in this study for recognizing hand movements that relies on a mobile phone's camera and a connected mobile projector as a visual feedback medium. Other mobile applications can easily link to their framework to learn gesture recognition. The suggested architecture enables the quick and simple creation of research prototypes that support gestures, diverting the user's focus away from the device and towards the content.

III. PROPOSED SYSTEM

A high-quality finger and hand tracking system, MediaPipe Hands. Machine learning (ML) is used to determine 2D and 3D landmarks of a hand from a single image. Whereas the present state-of-the-art methods often require robust PC settings for inference, scalable and our solution delivers real-time performance on a mobile phone to many hands. It is our sincere desire that making this hand perception capabilities available to the broader research and development community would inspire the development of novel use cases, leading to the creation of novel applications and the discovery of novel research topics.

A. Palm Detection Model

In this a single shot detector model for hand identification and position like detection using face mesh is tailored for mobile real-time usage. mediapipe lite model and full model must recognise hands in a wide range of sizes, with a huge scale span (20x) relative to the picture frame, and in occluded and self-occluded states, making hand detection a challenging problem. Because of the lack of high contrast patterns in the hand region (as is present in the face, for example in the eye and mouth area), reliable visual detection of hands is more challenging. However, precise hand localisation is aided by other information, such as the, torso and arm, or human traits. The mediapipe method is interdisciplinary in nature and uses a variety of techniques to address the aforementioned issues. Training a palm detector instead of a hand detector is the first step since it is far simpler to estimate the bounding boxes of rigid objects like palms and fists than it is to identify hands with movable fingers. For social and self-scenarios, such handovers, the non-maximum suppression method is especially useful due to the small size of the palms involved. Modeling palms using square anchor boxes (connections in ML language) and ignoring other aspect ratios can further minimise the number of anchors by a factor of 4-5. In addition, even for little objects, a feature extractor based on a codec pair is used in order to grasp the whole scene context (similar to the Retina Net approach). Last but not least, because to the sizeable scale variance, palm model minimised attentional drift during training to keep a large number of hooks.

By combining these methods, mediapipe model are able to increase palm identification accuracy to 95.7% on average. To put things in perspective, a baseline of just 86.22% is achieved when using a standard cross entropy loss and no decoder.

B. Hand Landmark Model

Mediapipe approach overcomes these problems by employing a variety of Once the palm has been recognised over the whole image, or direct coordinate prediction, or mediapipe hand landmark model uses regression, to pinpoint 2D and 3D hand-knuckle positions inside the observed hand areas. The model is able to develop a consistent internal hand posture representation and is resilient even when only a portion of a hand is seen or when the hand is partially obscured by the model's own body.

The medipipe have personally labelled over 30,000 photos from the actual world with 21 3D coordinates to use as ground truth data. mediapipe additionally render a high-quality synthetic hand model over different backdrops and map it to the associated 3D coordinates to better cover the available hand positions and give extra oversight on the nature of hand geometry.

C. System Structure For Voice Assistant

The suggested concept for an efficient method of constructing a Personal voice assistant makes use of a Speech Recognition library with several in-built capabilities, allowing the assistant to comprehend the command provided by the user and replying to the user in voice, using Text to Speech operations. Once an assistant has recorded a user's voice instruction, underlying speech-to-text techniques can be used :

Conceptual Design
One component of the system architecture is a microphone for capturing speech patterns.
The second is the transcription of audio files.Comparing the input to a set of rules that have already been established.
Resulting in the expected outcome.

D. Simple Procedure

The primary process of a voice assistant is depicted in the diagram below. The process of converting spoken words into writing is called "speech recognition." The computer then uses the character set of the command to locate and run the relevant script. However, that is not the only layer of intricacy. No matter how much time you put in, there is still another component that greatly affects whether or not a product is seen. The voice recognition equipment is easily distracted by background noise. A possible explanation for this is because people have trouble telling the difference between your speech and background noises like a dog barking or a helicopter passing overhead

The voice recognition equipment is easily distracted by background noise. A possible explanation for this is because people have trouble telling the difference between your speech and background noises like a dog barking or a helicopter passing overhead

E. Process Flow Diagram

In the form of Siri, Google Voice, and Bixby, voice assistants are already built into our mobile devices. Sales of smart speakers like the Amazon Echo and Google Home are increasing at the same rate as sales of smart phones were a decade ago, according to a recent NPR research, which found that around one in six Americans currently has one. However, you may feel as though the voice revolution is still far off at work. One obstacle is the rise of open offices, since no one wants to be "that guy" who won't stop yelling at his computer. There are 3 separate parts to this Assistant.

The very first is the capability of the assistant to recognise the user's voice and act on it. Second, processing the user's input by determining its meaning and applying it appropriately. Finally, the assistant provides the user with the outcome in real time by voice. The assistant will begin by collecting data from the user. The assistant will take the user's analogue voice input and transform it into digital text.

IV. ALGORITHMS AND TOOLS USED

For the purpose of hand and finger detection we are using the one of the effective open source library mediapipe, it is one type of the framework based on the cross platform features which was developed by google and Opencv to perform some CV related tasks. This algorithm uses machine learning related concepts for detecting the hand gesture and to track their movements.

A. Mediapipe

Google created the open-source MediaPipe framework to enable the development of cross-platform, real-time computer vision applications. For processing and analyzing video and audio streams, it offers a number of pre-made tools and components, such as object detection, pose estimation, hand tracking, facial recognition, and more.Developers can quickly construct intricate pipelines using MediaPipe that combine numerous algorithms and processes and execute in real-time on a variety of h/w platforms, like CPUs, GPUs, and specialized accelerators like Google's Edge TPU. Additionally, the framework has interfaces helps us interacting with other well-liked machine learning libraries, including TensorFlow and PyTorch, and it supports several programming languages, like C++, Python, and Java.

For computer vision and ML tasks, MediaPipe is a comprehensive library that offers a many of features. Here are a few of the library's main attributes and features :

Video and Audio Processing: MediaPipe provides tools for processing and analyzing video and audio streams in real-time. This includes functionalities such as video decoding, filtering, segmentation, and synchronization.
Facial Recognition: MediaPipe can detect and track facial landmarks, including eyes, nose, mouth, and eyebrows, in real-time. This functionality is useful for applications such as facial recognition, emotion detection, and augmented reality.
Hand Tracking: MediaPipe can track hand movements in real-time, allowing for hand gesture recognition and interaction with virtual objects.
Object Detection: MediaPipe can detect and track objects in real-time using machine learning models. This functionality is useful for applications such as augmented reality, robotics, and surveillance.
Pose Estimation: MediaPipe can estimate the poses of human bodies in real-time, allowing for applications such as fitness tracking, sports analysis, and augmented reality.

For a variety of tasks, such as object detection, position estimation, facial recognition, and more, MediaPipe offers tools for training and deploying machine learning models. All in all, MediaPipe is a potent tool kit that gives programmers the ability to easily create sophisticated real-time computer vision and ML applications.

B. Opencv

A computer vision and ML software library called OpenCV is available for free download. Its objective is to aid programmers in the development of computer vision applications. Filtering, feature identification, object recognition, tracking, and other processing operations for images and videos are all available through OpenCV. Python, Java, and MATLAB are just a few of the numerous programming languages that it has bindings for. It is written in C++.Robotics, self-driving cars, AR, medical image analysis, and other fields are just a few of the fields where OpenCV can be employed. A wide range of algorithms and tools are included in the library, making it simple for programmers to build sophisticated computer vision applications.

The steps listed below can be used to broadly classify OpenCV's operation:

Loading and Preprocessing the Image/Video: OpenCV can load images or videos from a variety of sources such as files, cameras, or network streams. Once the image or video is loaded, it can be preprocessed by applying filters or transforming the image to a different color space, such as converting a color image to grayscale.
Feature Detection and Description: OpenCV can detect and extract features from an image or video, such as edges, corners, and blobs. These features can be used to identify objects or track their motion over time. OpenCV also provides algorithms for describing these features, which can be used to match them across multiple frames or images
Object Detection and Recognition: OpenCV can be used to detect and recognize objects in an image or video. This can be done using a variety of techniques, such as template matching, Haar cascades, or deep learning-based methods.
Tracking: OpenCV can track objects in a video stream by estimating their position and motion over time. This can be done using a variety of algorithms, such as optical flow, mean-shift, or Kalman filtering.
Image and Video Output: Finally, OpenCV can be used to display or save the processed images or videos. This can be done by showing the images in a window, writing the video frames to a file, or streaming the video over a network. In general, OpenCV offers a large variety of tools and techniques for working with image and video data, making it a potent library for computer vision applications.

V. IMPLEMENTATION

A. The Camera Used in the OpenAI Virtual Mouse System

The proposed system uses web camera for capturing images or video based on the frames. For capturing we are using CV library Opencv which is belongs to python web camera will start capturing the video and Opencv will create a object of video capture. To AI based virtual system the frames are passed from the captured web camera.

B. Capturing the Video and Processing

The capturing of the frame was done with the AI virtual mouse system until the program termination.Then the video captured has to be processed to find the hands in the frame in each set. The processing takes places is it converts the BRG images into RGB images,which can be performed with the below code,

image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)

image.flags.writeable = False

results = hands.process(image)

This code is used to flip the image in the horizontal direction then the resultant image is converted from the BRG scale to RGB scaled image.

C. Rectangular Region for Moving through the Window

The windows display is marked with the rectangular region for capturing the hand gesture to perform mouse action based on the gesture. when the hands are find under those rectangular area the detection begins to detect the action based on that the mouse cursor functions will be performed. The rectangular region is drawn for the purpose of capturing the hand gestures through the web camera which are used for mouse cursor operations.

Mouse Functions Depending on the Hand Gestures and Hand Tip Detection Using Computer Vision:

For the Mouse Cursor Moving around the Computer Window.
To Perform Left Button Click operation
To Perform Right Button Click operation
To perform a double click operation
To perform scrolling operation
To perform drag and drop operation
To perform multiple item selection
To perform volume controlling
To perform brightness controlling
For No Action / neutral gesture to be Performed on the Screen

F. Voice Assistant

The System Structure For Voice Assistant

Conceptual Design : One component of the system architecture is a microphone for capturing speech patterns.
The second is the transcription of audio files
Then Comparing the input to a set of rules that have already been established.
Resulting in the expected outcome.

The voice assistant feature has been included to launch gesture recognition through voice commands. And added certain features to improve the user engagement and they can assess whatever they need with less amount of effort and in hassle free manner. The voice assistant features which can be performed through the voice commands are:

a. To launch and end the gesture recognition

b. To search for something over internet

c. To find a location what we are looking for

d. To get an idea about Date and time

e. To copy and paste contents

f. Sleep / wakeup the voice assistant

g. To exit voice assistant

C. Merits

AI virtual mouse system is useful for many applications , as it can be used to reduce the space for using the physical mouse, and it can be used in situations where we cannot use the physical mouse. system eliminates the usage of devices, and it improves the human-computer interaction.
Amidst the COVID-19 situation, it is not safe to use the devices by touching them because it may result in a possible situation of spread of the virus by touching the devices, so the proposed AI virtual mouse can be used to control the PC mouse functions without using the physical mousesystem can be used to control robots and automation systems without the usage of devices 2D and 3D images can be drawn using the AI virtual system using the hand gestures.

Conclusion

AI virtual mouse using hand gestures is an innovative and exciting technology that has the potential to revolutionize the way we interact with computers. Here with the aid of a real-time camera, we have created a system to manage the mouse pointer and carry out its function. It offers users a more natural, intuitive, and accessible way to control the cursor on the screen, without the need for a traditional input device, a mouse. Furthermore, with additional voice assistant support, AI virtual mouse using hand gestures can further enhance the user experience. Voice assistant which is integrated with the virtual mouse system will provide users with even more control over their devices. Users can given voice commands to do a range of tasks, such as opening applications, navigating through menus, and performing web searches, in addition to controlling the cursor on the screen using hand gestures. As technology continues to evolve, we can expect to see even more innovative solutions that enhance the user experience and improve accessibility for all. This assistant does a good job at carrying out some of the duties the user specifies. In addition, this assistant may do a wide variety of tasks, including as text message delivery to the user\'s mobile device, YouTube automation, and information retrieval from Wikipedia and Google, all in response to a single voice query. The OpenAI Based voice assistant has allowed us to automate several services with the voice commands. The majority of the user\'s work, such as online searching, is simplified by this tool. Our goal is to make this tool so capable that it can take the position of human server administrators entirely. The project was constructed utilizing modules from open-source software that have the support of the Anaconda community, so any changes may be implemented quickly. The project\'s modular design allows for further customization and the installation of new features without impacting the workings of the existing system.

References

[1] Krishnamoorthi, M., Gowtham, S., Sanjeevi, K., & Revanth Vishnu, R. (2022). Virtual mouse using YOLO. In the international conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (pp. 1-7). [2] Roshnee Matlani., Roshan Dadlani., Sharv Dumbre., Shruti Mishra., & Abha Tewari. (2021). Virtual Mouse Hand Gestures. In the International Conference on Technology Advancements and innovations (pp. 340-345). [3] Shriram, S., Nagaraj, B., Jaya, J., Sankar, S., & Ajay, P. (2021). Deep Learning Based Real -Time AI Virtual Mouse System Using Computer Vision to Avoid COVID-19 Spread. In the Journal of Healthcare Engineering (pp. 3076-3083). [4] Varun, K.S., Puneeth, I., & Jacob, T.p. (2019). Virtual Mouse Implementation using OpenCV. In the International Conference on Trends in Electronics and Informatics (pp. 435-438) [5] Mayur, Yeshi., Pradeep, Kale., Bhushan, Yeshi., & Vinod Sonawane. (2016). Hand Gesture Recognition for Human-Computer Interaction. In the international journal of scientific development and research (pp. 9-13). [6] Guoli Wang., et.al. (2015). Optical Mouse Sensor-Based Laser Spot Tracking for HCI input, Proceedings of the Chinese Intelligent Systems Conference (pp. 329-340). [7] Baldauf, M., and Frohlich, p. (2013). Supporting Hand Gesture Manipulation of Projected Content with mobile phones. In the European conference on computer vision (pp. 381-390). [8] P. Nandhini, J. Jaya, and J. George, ?Computer vision system for food quality evaluation—a review,? in Proceedings of the 2013 International Conference on Current Trends in Engineering and Technology (ICCTET), Coimbatore, India, July 2013. [9] A. Haria, A. Subramanian, N. Asokkumar, S.Poddar, and J. S. Nayak, Hand gesture recognition for human computer interaction,? Procedia Computer Science, vol. 115, pp. 367–374, 2017. [10] Tharsanee, R.M., Soundariya, R.s., Kumar, A.S., Karthiga, M., & Sountharrajan, S. (2021). Deep Convolutional neural network-based image classification for COVID-19 diagnosis. In Data Science for COVID-19 (pp. 117-145). Academic Press.

Copyright

Copyright © 2023 J Kumara Swamy, Mrs. Navya V K. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET55412

Publish Date : 2023-08-19

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here