Hello everyone and welcome to the most anticipated deep dive where we talk about how we built the first production-ready CLVC model and scaled it to over a million users!
This is a long post, so please bear with me.
We built a custom Text-to-Speech model which is able to speak 13 languages (10 in production), The model is able to do cross lingual voice cloning meaning, with just 20 mins of data, the model was able to clone the voice in 13 languages (English, Hindi, Bengali, Punjabi, Gujarati, Marathi, Kannada, Oriya, Tamil, Telugu). We called this NeoDub.
Samples: checkout NeoDub page
Mr Beast in Hindi?
Shyam (Dubverse Speaker) in multiple languages?
English:
Punjabi:
Tamil:
Telugu:
or A real news Reporter form Jagaran (a prominent Indian News Network) only proficient in Hindi can now speak in 7 more Languages (rest are here)
Hindi:
Bengali:
Gujarati:
Tamil:
First lets dive into the aftermath of the speakers.
Dubverse Launched 21 speakers in 12 languages namely, out of which two speakers got the most attraction namely Shaan and Sunidhi, both are native Hindi speakers capable to speak in 10 more languages via NeoDub.
After doing a Mean opinion score analysis (across different versions) for the speakers against Azure and GCP speakers, Dubverse Speakers came on top.
The speakers are being used at the core of our webapp, and show similar usage analytics for the Dubverse’s speakers against GCP and Azure.
The speakers were also provided as an api to different partners which published over 15K news articles in the very first month.
In this blog we will discuss the journey of research and productionizing NeoDub. The first step in this journey is about Survey Literature.
Survey Literature:
This includes learning about the SOTA TTS systems out there and to foreseen the deployment part for these models like RTF, I have went through atleast 30-40 papers (in total, as this is a continuous process) on TTS, to start with it includes Tacotron, Tacotron2, Mellotron, FastSpeech, Talknet2, GlowTTS, YourTTS etc.
I tried training (tiny experiments) Tacotron2, Mellotron, FastSpeech and Talknet2. At the end settled for Talknet2, for its training speed (you can expect the model to be trained within 4-5 hours end to end on small data). It is blazingly fast, and you can do style transfer off the shelf!
less training time means, more experiment/iterations, which helps in rapid learning and getting to the final results. All the tiny experiments were done on Ljspeech model.
Resources:
Now that we have the baseline understanding of the model, we need the other resources i.e. Data and Compute, previously I was using Colab.
Now I had access to the gcp V100s for my next experiments, these gpus are way faster for training as compared to colab’s T4, and I can access them without any connection error.
for the Data part, we have collected the open-source datasets from IIT Madras, which included around 10 hours of data per speaker (Male and Female) in 13 languages. In additional to this, we had Ljspeech and LibriTTS datasets.
Model Understanding:
Talknet2 Model consists of three components:
Duration predictor: the same text can have different duration and speed by different speakers, it is necessary to predict the duration of the generated audio. It does by predicting a duration vector, which consists of the duration of each phoneme and blank symbol (~) .
you can train an asr system to forced align the phonemes and blank symbol to get the ground truths.Pitch predictor: The same Duration vector can be spoken in different pitch(variation), and thus this variability can be predicted by a pitch predictor. We used crepe to extract the pitch for training as ground truth.
Mel Generator: This module takes in the Pitch and Duration tensors and generates the Mel Spectrogram.
Vocoder: A pre-trained hifiGan is used to generate Audio waveform from the MelSpectogram.
Talknet2 is easy to train since all of these modules used 1D depth-wise separable convolutional architecture, it is very fast and easy on the training compute.
The fun part was, you can take in a pre-recorded audio and text, pass it through ASR to get the duration vector, pass it through to get the pitch, and pass it through the mel-generator model and you will have style transfer in no time to another speaker.
Applied AI for NeoDub:
Version 0.1 (Jan 2022):
The Talknet baseline model, which is trained on Ljspeech dataset (24 hours) from scratch, and is able to give out good results for american english.
When trained on hindi speakers, the model lacks the stability, the voice quality was not that good. The main reason because the speaker data was limited for 6 hours only. Additional to this the pronunciations were bad for common words.
Version 0.2 (Jan 2022):
Using Espeak-ng to convert the graphemes to phonemes, this reduced the word pronunciations errors by a big margin. But still the audio quality was shaky and not of production quality
Version 0.3 (March 2022):
Pretraining the model on english data, now that we have a common ground (Espeak-ng) phonemes (170 in total), English and Hindi can be converted into same phonemes, we can use this to train the model first on english data 24hrs, this gives the model the stability and then train it again on Hindi dataset.
This resulted in good quality audio output, good enough to put into production. Hence we started collecting our own data from the professionals and released the first 4 original Dubverse Hindi speakers.
for Production, the Talknet model is designed to be really fast using 1D separable convolution layers, which reduces computation and number of parameters with same level of the model depth. which results in 11x RTF right without any optimisations, with 4 workers on a single T4 it worked like a charm.
This version used previous datasets to provide stability and fine-tuned on collected speakers (1 hour).
Version 0.4 (May 2022):
The current model still has shrieking and artefacts present in the audio generation. to solve this we attached GST module to the Mel generator. These artefacts were only present in the Dubverse collected data and not in the open source datasets, the main reason for this is data quantum. Dubverse speakers quantum was only 1 hour.
GST module solved this problem effectively.
(replace tacotron with our model in the figure below)
GST takes in the same target audio clip, we randomly chop this to 3-5s, it creates a fixed size embedding, which is used as query over the model learned tokens. which is givn input to the model.
Some great insights from GST paper:
It can provide speaker information hence (language information, more in Version 2.0)
In case speaker information is provided, it can also use as residual encoder, which means information which is not present in the text and any other inputs, like noise.
It also replicates tonality, thus carries pitch information as well.
In our model, since the model is single speaker (no speaker information is needed), pitch we are providing as separate so GST won’t learn that. Thus GST will be used to provide the residual information, which can help us to get rid of the shrieking voices.
Version 1.0 (Aug 2022):
The previous versions were all single speaker models, and were quite difficult to manage in production to solve this issue, we made this into a multi-speaker model, by introducing a speaker embedding. More over this made the model very effective now a single model has seen more data in different variations. Hence the pronunciations mistakes also goes down after this update.
This update is also interesting because you can select different speaker embeddings in the duration, pitch and Energy modules, and a different speaker embedding in the Mel generator, thus resulting different variations of the same sentence.
Instead of using self learned embedding for the speaker embeddings we used x-vector. it captures the speaker identity, along with speed rate and some pitch information. this was computed at each audio clip level and was fed to the model.
Version 1.1 (Sep 2022):
we introduced language embedding for the same and energy predictor to the model architecture, now all the languages and speaker were trained in a single model.
Version 2.0 (Jan 2023):
The previous model was multilingual and multispeaker, and was working in 13 languages, it is fast and is in production.
if change the language embedding in Version 1.1 to english for a hindi speaker, it won’t work, the reason being speaker embedding also carries the language information (initial hypothesis) , later we found pitch was also carrying speaker and hence language information.
To carry the language information only by the language embedding, we’ll have to limit the speaker information to speaker embedding only and need to introduce the speaker embedding in the later stage of the model layers and not initially.
To remove the speaker information, we deployed gradient reversal layer(GRL) ,which worked really good.
we basically attach a new task of classifying the speakers from the nth layer, where we introduce the speaker embedding. The model will try to reduce the loss of speaker classification. During backpropagation, the GRL will reverse the sign of the gradient, which will make the model to get rid of the speaker specific information.
Now the same model can speak in 13 languages!
The same was deployed for the pitch, duration and energy models.
Version 2.5 (Feb 2024, today):
The model has gone through multiple evaluation cycles, while in production and have been used by 1 million users on the Dubverse.ai platform, also being the only model available in Oriya too! With more than 20 Dubverse IP speakers available in 12 languages on product.
Limitations
Voice cloning with this architecture requires a quick finetuning on atleast 5 mins of data which after augmentations becomes around 20 mins of data. This only works on specific speakers data which is in the vicinity of speakers of the training data. The finetuning needs computational resources (needs to train with additional languages data). with the upcoming new models this is possible with only 3-5s of audio clips.
Since the model is trained on low resources data, the language understanding and the acoustic knowledge of the language is limited here.
There is a dependency on the Espeak-ng phonemizer which is rule based, which in turns causes most of the pronunciation errors.
What Next?
NeoDub is still supervised in nature, which limit its ability to be trained on large unlabelled data. The current rise in self-supervised learning paradigm enables training on large unlabelled data, in the above Neodub case this can be done by replacing the Duration Vector with VqVae or Wav2vec2 + kmeans, which means we would not need text, to get the duration vector, and hence the mel generator can be trained on thousands of hours of unlabelled data for a better quality of audio.
Checkout our open source efforts on the same by visiting MahaTTS.
Until next time,
Jaskaran
Follow me for more Deep Learning content on Linkedin Twitter