This article describes a neural network-based text-to-speech (TTS) synthesis system that can generate spoken audio in a variety of speaker voices. We show that the proposed model can convert natural-language text-to-speech into a target language, and synthesize and translate natural text-to-speech. We quantify the importance of trained voice modules to obtain the best generalization performance. Finally, using randomly selected speaker embeddings, we show that speech can be synthesized with new speaker voices used in training and that the model learned high-quality speaker representations. We have also introduced a multilingual system and auto-tuner that allows you to translate regular text into another language, which makes multilingualization possible for various applications.
Introduction
I. INTRODUCTION
Voice cloning uses a computer to generate voice from a real person and uses a neural network to clone that person's unique voice. This project uses a TTS system trained on a dataset consisting of text and speech. This allows the system to learn letters, words, and sentence sounds (such as waveforms). However, the resulting audio is the same as that represented by the training dataset. This means that the TTS system must be trained on the target speech to generate a specific speech. The text is then converted to normal speech. Synthetic speech can be generated by concatenating recorded speech segments. Additionally, synthesizers can combine speech models and other features of the human voice to create a fully "synthesized" speech output.
A. Voice Cloning
Voice cloning uses a computer to generate voice from a real person and uses a neural network to clone that person's voice. This model consists of an encoder and decoder and uses a vocoder to convert text to speech. After receiving the text data, the model recognizes the endpoints and evaluates the speech according to the condition whether the speech is clearly recognized. Also used auto tuner to clear the pitch and smooth the voice.
It currently consists of over 60 languages. According to the paper, modern multilingual text-to-speech systems require large amounts of data to train or process just a few languages, but deep learning techniques enable this model to train on small amounts of data and achieve high performance synthetic and stable voice cloning between multiple languages (English, German, French, Chinese, Russian).
B. Tortoise (Text-To-Speech) Synthesis
The goal of this paper is to make a TTS system that can induce natural speech for a variety of speakers in a data-effective manner. Speech synthesis is a technology that allows a computer to convert written text into speech via a microphone or telephone.
Tortoise is a text-to-speech synthesis system which describes a system, which produces synthetic speech. The program is organized on priority basis as followed as:
Powerful multi-voice functionality.
Very realistic prosody and intonation.
C. Auto Tuner
Auto-Tune uses a proprietary device to measure and alter the pitch of vocal and instrumental music recordings and performances. Training data consists of performance pairs that are identical except for pitch. Such pairs are needed for model training, but are difficult to find naturally.
Therefore, we construct the input signal by detuning high-quality vocal performances and synthesize the input signal by training a model to predict shifts that restore the original pitch.
II. WORKING
Architecture diagrams create visual representations of software system components. In software systems, the term architecture refers to various functions, their implementation, and their interactions. It shows the general structure of a software system and the relationships, limits, and boundaries between individual elements.
V. ACKNOWLEDGEMENT
This paper was supported by Alard College of Engineering & Management, Pune 411057. We are very thankful to all those who have provided us valuable guidance towards the completion of this Seminar Report on “Autotuned voice cloning enabling multilingualism” as part of the syllabus of our course. We express our sincere gratitude towards the cooperative department who has provided us with valuable assistance and requirements for the system development. We are very grateful and want to express our thanks to Prof. Sakshi Shejole for guiding us in the right manner, correcting our doubts by giving us their time whenever we require, and providing their knowledge and experience in making this project.
Conclusion
In this research paper, we have successfully studied Auto-tuned voice cloning which enables Multilingualism. In future, we are planning to use this model in Google Maps, Transportation services for creating a familiar voice to sound very natural and able to understand instructions fast and easily.