Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Dr. Nisha Auti, Atharva Pujari, Anagha Desai, Shreya Patil, Sanika Kshirsagar, Rutika Rindhe
DOI Link: https://doi.org/10.22214/ijraset.2023.51825
Certificate: View Certificate
Automatic Speech Recognition (ASR) technology has revolutionized human-computer interaction by allowing users to communicate with computer interfaces using their voice in a natural way. Speaker recognition is a biometric recognition method that identifies individuals based on their unique speech signal, with potential applications in security, communication, and personalization. Sentiment analysis is a statistical method that analyzes unique acoustic properties of the speaker\'s voice to identify emotions or sentiments in speech. This allows for automated speech recognition systems to accurately categorize speech as Positive, Neutral, or Negative. While sentiment analysis has been developed for various languages, further research is required for regional languages. This project aims to improve the accuracy of automatic speech recognition systems by implementing advanced audio signal processing and sentiment analysis detection. The proposed system will identify the speaker\'s voice and analyze the audio signal to detect the context of speech, including the identification of foul language and aggressive speech. The system will be developed for the Marathi Language dataset, with potential for further development in other languages.
I. INTRODUCTION
Speaker recognition, also known as voice recognition, is a technology that enables automatic identification of an individual based on their unique voice characteristics. It is commonly used in various applications, such as security systems, virtual assistants, and biometric authentication. The system architecture of speaker recognition comprises of various components, including an audio input device, a feature extraction module, a machine learning algorithm, and a speaker model database.
The first step in speaker recognition involves capturing the audio input and converting it into digital form through pre-processing techniques. The feature extraction module then extracts relevant features from the signal that can be used to identify the speaker, such as pitch, spectral content, and other voice characteristics. These extracted features are then fed into a 1-Dimensional Convolutional Neural Network (CNN) that compares them to a database of speaker models to identify the most probable speaker. Techniques like clustering or classification can be used by the CNN to improve the accuracy of the identification process.
Once the speaker has been identified, the system can proceed to perform sentiment analysis on that piece of audio. Sentiment analysis is the process of identifying and extracting the sentiment behind a given piece of audio file in .wav format. This process classifies speech into different categories like Neutral, Aggressive, Negative, or Abusive to generate a Graded Sentiment Analysis report. The ability to analyze sentiment can be beneficial in applications such as call center quality monitoring, customer feedback analysis, and market research.
Overall, speaker recognition is an essential technology that has numerous applications, including speech recognition in virtual assistants, secure access to highly secure areas, voice dialing, banking, and computers. The system architecture of speaker recognition is a combination of audio input and signal processing, machine learning, and a database of speaker models, making it a powerful tool for automatic identification of individuals by their unique voice characteristics and performing sentiment analysis on the audio file.
II. BACKGROUND
Speaker recognition is the process of identifying a person by their voice, in this case, a speaker. In audio signal processing, we can use various techniques and algorithms to extract speech from an audio signal. Sentiment analysis of audio signals is a method for determining whether an audio message has positive or negative sentiment. This can be used to improve speech recognition systems by providing a more accurate transcription of the original audio messages.
Speaker recognition algorithm is one of several techniques used in our work which includes feature extraction methods like Mel-frequency cepstrum coefficients (MFCC) and Mel-frequency cepstrum transform (MFCT).Another method for recognizing speakers is by using neural networks. This technique has been shown to outperform acoustic modeling for most applications because it allows for more complex models than acoustic models without sacrificing recognition accuracy.
Speaker recognition methods have two important modules which are Enrolment and Verification. At the time of enrolment, the subject's audio signals are captured to extract important features from them and finally, a model is created which is also termed as voice print. For the next phase that is verification , a voice sample or commonly termed as utterance is balanced against the previous model. The important difference between identification methods and verification methods is that , for verification only one voice print is considered to be balanced against a single voice pattern whereas identification methods tend to balance the utterance against multiple models.
In order to perform speaker recognition and sentiment analysis over audio signals, there are several technologies that are used such as speech recognition systems, text-to-speech synthesis engines and acoustic models.These technologies can be used for different purposes such as voice command systems where it speaks commands or reminders to users or for entertainment purposes where it reads books or plays music.
Speaker recognition is a very powerful tool for identifying speakers in any given environment, but it does not just provide accurate results; it also has an impact on the outcome of an application's performance. That's because speaker recognition depends on many factors: environment (temperature, lighting, background noise), acoustics (room size, acoustic properties), and even age (acoustics change as we get older). All of these variables can have an impact on how well your system performs when it comes time to identify speakers.
A. Limitations of Existing System
The major limitation is that the system can only differentiate between the voices as male or female, but it cannot distinguish the identity of the person who the voice belongs to. The tonal frequencies of marathi Language are studied at a very small scale which constitutes a major limitation that we are trying to overcome. Another Limitation is that the system cannot implement recognition as well as sentiment analysis of the speech at the same time and there is no mechanism of deployment for the existing models.
Features of Neural Network based Speaker Recognition models-
III. Methodology
A. System Architecture
2. MP3 to Wave Conversion: MP3 files undergo compression, resulting in the loss of some original audio data during encoding. To enhance audio quality, conversion of MP3 files to WAV is preferred, as WAV files are uncompressed and hold all the initial audio data. WAV files are popular in professional audio editing software, allowing more in-depth adjustments to levels and effects. These files preserve all original audio data and are perfect for archiving and maintaining high-quality audio recordings.
3. Processed User Input: Converting an MP3 file to Wave format involves decompressing the original audio data that was lost during MP3 compression, resulting in a higher quality and more accurate representation of the original audio. The Wave format is ideal for professional audio editing and production due to its greater accuracy and precision. The converted Wave file can be used for mixing, mastering, or as an archive copy of the original recording. However, it's important to note that Wave files have a larger file size than MP3 files due to their uncompressed nature.
4. Feature Extraction: To perform feature extraction on audio signals in Wave format, the raw audio data is analyzed to extract relevant characteristics that represent the audio signal. These features can include frequency content, amplitude, duration, and temporal structure. Feature extraction methods include time-domain, frequency-domain, and cepstral analysis. Extracted features include spectral, temporal, and statistical features, such as spectral centroid, zero-crossing rate, and mean signal energy. These features are useful for audio classification and signal processing applications.
5. Robust Processing: To enhance the quality and accuracy of audio data, robust processing of wave format audio signals involves applying techniques that can mitigate various types of distortions and noise that can occur during the recording, transmission, or storage of audio data. Techniques for robust processing include noise reduction, error correction, dynamic range compression, equalization, and source separation. These techniques can improve the signal-to-noise ratio, reduce errors, adjust the frequency response of the audio signal, and separate individual sound sources from a mixture of sound sources.
6. CNN Algorithm: The ResNet-based model architecture takes a 1D audio signal with a specified sample rate and channel as input. Residual blocks with varying numbers of convolutional layers and filters are used to process the input, with skip connections aiding in gradient flow. After passing through residual blocks, the output is downsampled, flattened, and passed through two fully connected layers before being fed into the output layer. The model is compiled using the Adam optimizer and sparse categorical cross-entropy loss function, and includes early stopping and model checkpoint callbacks. The build_model function takes the number of classes and model name as input arguments.
7. Speaker Recognition: To evaluate the performance of the speaker recognition model, a test dataset (test_ds) is created from the audio files and their corresponding labels. The audio samples are then batched, shuffled, and augmented with noise using the add_noise function. The audio waveforms are transformed into their frequency domain representation using audio_to_fft for each batch, and the model is used to predict the speaker labels with model.predict. A sample of audio files (SAMPLES_TO_DISPLAY) is randomly selected from the batch, and their predicted labels (y_pred) are compared with their true labels (labels). If the predicted label matches the true label, the message "Welcome" is printed, and the predicted label is printed along with the message "The speaker is." Otherwise, only the predicted label is printed with the message "Sorry.
8. Sentiment Analysis: The analyze_sentiment function is designed to perform sentiment analysis on an audio file. The function takes a filename as input, loads the audio file using the SpeechRecognition library, and transcribes the speech to text using the Google Speech Recognition API with Marathi language specified. The TextBlob class from the textblob library is used to analyze the sentiment of the transcribed text and returns a polarity score between -1 (negative sentiment) and +1 (positive sentiment). Finally, the sentiment score is printed with a label indicating whether the sentiment is positive, negative, or neutral.
9. Graded Sentiment Analysis: To analyze the sentiment of the transcribed text, the code uses the SentimentIntensityAnalyzer object from the nltk library. It loads the audio file of the speaker using the SpeechRecognition library and records the audio data with the record method of the AudioFile object. Then, the recognize_google method of the SpeechRecognition library is used to transcribe the speech, specifying the language of the speech as Marathi. The polarity_scores method of the SentimentIntensityAnalyzer object is then used to obtain sentiment scores, including a compound score that ranges from -1 (very negative) to 1 (very positive). The sentiment scores are printed to the console using the print function. The accuracy of the sentiment analysis may depend on the quality of the transcribed text and the accuracy of the speech recognition.
B. Algorithm
a. Open WavePad and load the MP3 file you want to convert.
b. Go to the "File" menu and select "Save As".
c. Choose "Waveform Audio File Format (WAV)" as the output format.
d. Select the output location and file name.
e. Click "Save" to begin the conversion process.
Once the conversion is complete, use the editing tools to split the audio file at desired points and save the split files to the desired location.
2. Speaker Recognition
Step 1: Data Preparation
Prepare labeled audio recordings of speakers for training and evaluation
Split the data into training and validation sets
Step 2: Feature Extraction
Extract speech features from audio recordings, such as MFCC, PLP, or other relevant features
Normalize the feature values if needed
Step 3: Model Definition
Step 4: Model Training
Compile the model by specifying the loss function, optimizer, and evaluation metrics
Train the model using the prepared training data and validation data
Monitor the training progress and tune hyperparameters as needed
Step 5: Model Evaluation
Evaluate the trained model on a separate test set or cross-validation set
Compute performance metrics such as accuracy, precision, recall, F1-score, etc.
Step 6: Prediction/Inference
Use the trained model to make predictions on new, unseen audio recordings.
3. Sentiment analysis
Step 1: Import Libraries
Import the TextBlob library in Python
Import SpeakerRecognition as sr from Python
Step 2: Sentiment Analysis
C. Mathematical Model
The residual block function takes an input `x` of shape `(n, c)` and produces an output `y` of the same shape, where `filters` is the number of filters and `conv_num` is the number of convolutional layers in the block. The operations in the block can be expressed mathematically as:
2. `for i in range(conv_num - 1):`
`x = Conv1D(filters, 3, padding='same')(x)`, where `x` is the result of applying a 1D convolution with filter size 3 to `x` with `filters` output filters and same padding.
`x = Activation(activation)(x)`, where `activation` is the activation function used in the block (default is ReLU).
3. `x = Conv1D(filters, 3, padding='same')(x)`, where `x` is the result of applying a 1D convolution with filter size 3 to `x` with `filters` output filters and same padding.
4. `y = Add()([x, s])`, where `y` is the sum of `x` and `s`.
5. y = Activation(activation)(y)`, where `activation` is the activation function used in the block.
The model architecture is a series of residual blocks with increasing number of filters, followed by average pooling, flatten, and dense layers. Each residual block takes an input of shape `(n, c)` and produces an output of the same shape. The operations in the model can be expressed mathematically as:
IV. TECHNOLOGY STACK
V. FUTURE SCOPE
The project described above has the potential to be extended in various ways to support more advanced applications. For instance, it can be trained to perform speaker and sentiment analysis in multiple languages. Real-time analysis of audio streams can also be implemented, especially for use cases such as call centers, video conferencing, and customer support.
Additionally, the project can be integrated with popular voice assistants such as Siri, Alexa, and Google Assistant to provide personalized and accurate responses based on the user's voice. It can also be utilized in speech therapy applications, helping individuals with speech disorders or disabilities by providing feedback on their speech patterns, tone, and emotions. The project can also be utilized in various other applications such as political analysis, content analysis, and online reputation management. For instance, in political analysis, the project can be used to analyze the tone and emotions of politicians in their speeches, providing insights into their public image and their effect on the general public. In mental health applications, the project can be used to recognize and analyze the emotions expressed by patients in their speech, providing valuable insights for diagnosis and treatment. Overall, the project can be further developed and extended to cater to various future applications and advancements.
Our proposed system is a deep neural network-based model for speaker recognition and tonal speech detection. It is designed to recognize speakers based on their speech signals and then analyze their speech to detect emotional tones, such as joy, sorrow, aggression, and others. The system is specifically developed for Marathi, one of the Indian Regional Languages, and other audio signals. While there have been numerous studies on tonal speech recognition systems for Asian Continental Languages and Indo-European Languages, very little work has been done for Indian Regional Languages in the tonal speech recognition system. Therefore, our system fills this gap and provides a secure and protected environment for speaker authentication and voice identification. As the area of speech recognition and speaker identification is continually evolving, there are vast possibilities for future enhancements. With improved speech recognition accuracy and speech quality, speech recognition can provide a secure terrain to various services that we use in our daily life by employing voice authentication. Communication will become easier and more reliable for everyone as technology advances in this field.
[1] L. Kaushik, A. Sangwan and J. H. L. Hansen, \"Sentiment extraction from natural audio streams,\" 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 8485-8489, doi: 10.1109/ICASSP.2013.6639321. [2] Douglas O\'Shaughnessy , “ Automatic Speech Recognition ”, Plenary in IEEE Chilecon 2015, 978-1-4673-8756-9/15/$31.00 © 2015 IEEE. [3] T. C. Nguyen, J. Chaloupka and J. Nouza, \"Study on incorporating tone into speech recognition of Vietnamese,\" 2015 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), Liberec, Czech Republic, 2015, pp. 1-6, doi: 10.1109/ECMSM.2015.7208688. [4] S. Maghilnan and M. R. Kumar, \"Sentiment analysis on speaker specific speech data,\" 2017 International Conference on Intelligent Computing and Control (I2C2), Coimbatore, India, 2017, pp. 1-5, doi: 10.1109/I2C2.2017.8321795. [5] Srujana K , R.Ramesh , Kiran G., Ch. Manikanta, “ Artificial Intelligence Speech Recognition System using MATLAB ” 978-1-5386-3243-7/17/$31.00 ©2017 IEEE. [6] X. Zhao and Y. Wei, \"Speaker Recognition Based on Deep Learning,\" 2019 IEEE International Conference on Real-time Computing and Robotics (RCAR), Irkutsk, Russia, 2019, pp. 283-287, doi: 10.1109/RCAR47638.2019.9044086. [7] M. Wang, T. Sirlapu, A. Kwasniewska, M. Szankin, M. Bartscherer and R. Nicolas, \"Speaker Recognition Using Convolutional Neural Network with Minimal Training Data for Smart Home Solutions,\" 2018 11th International Conference on Human System Interaction (HSI), Gdansk, Poland, 2018, pp. 139-145, doi: 10.1109/HSI.2018.8431363. [8] M. Hamidi, H. Satori, N. Laaidi and K. Satori, \"Conception of Speaker Recognition Methods: A Review,\" 2020 1st International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Meknes, Morocco, 2020, pp. 1-6, doi: 10.1109/IRASET48871.2020.9092118. [9] Y. Chandra and A. Jana, \"Sentiment Analysis using Machine Learning and Deep Learning,\" 2020 7th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 2020, pp. 1-4, doi: 10.23919/INDIACom49435.2020.9083703. [10] Feng Ye 1,2 and Jun Yang 1,* , “ A Deep Neural Network Model for Speaker Identification ” , Appl. Sci. 2021, 11, 3603. [11] M. Bansal, S. Yadav and D. K. Vishwakarma, \"A Language-Independent Speech Sentiment Analysis Using Prosodic Features,\" 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 2021, pp. 1210-1216, doi: 10.1109/ICCMC51019.2021.9418357. [12] Sakshi Dua 1, Sethuraman Sambath Kumar 1, Yasser Albagory 2, Rajakumar Ramalingam 3, Ankur Dumka 4,5, Rajesh Singh 6, Mamoon Rashid 7,*, Anita Gehlot 6, Sultan S. Alshamrani 8 and Ahmed Saeed AlGhamdi 2 , “ Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network ” , Appl. Sci. 2022, 12, 6223.
Copyright © 2023 Dr. Nisha Auti, Atharva Pujari, Anagha Desai, Shreya Patil, Sanika Kshirsagar, Rutika Rindhe. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET51825
Publish Date : 2023-05-08
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here