Improving Indic Text to Speech using ChatGPT
Our plans on fixing the pronunciation issues using LLMs ⚡️
I don’t think I need to talk about how OpenAI has taken the world by storm. I have been seeing a lot of prompt handbooks as well as prompt injection manuals across the domains.
We decided to give it a shot to solve some burning problems at Dubverse.
The problem
We are working with one of the Nation’s biggest brands for a text-to-speech project. The way it works is simple - they give us the files in text format, we generate the voiceovers, and our vendors listen to them and try to fix pronunciation issues. The fix is usually done using changing the phonemes or the spelling on the Dubverse web app.
We were working with Indian English, hence had a lot of proper nouns the TTS system would pronounce incorrectly. The reviewers sometimes could correct pronunciations using some creative ones like the following.
devotees :: deivoties
deity :: deeitie
Dalhousie :: dal housy
There were some instances where I had to sit with the team till 3 am to attempt to fix them. We tried working with POS taggers to extract relevant words and then used Voxabot (an advanced SSML editor) to get the correct pronunciation. But we could only get a handful of them corrected.
I am the go-to automation guy at Dubverse; hence I decided to work on a fix. I gave LLMs a shot as we had already tried phonemes and spellings.
Solution 1: Few Shot Learning on GPT3
The premise here is simple. LLMs are good at recognising patterns and sticking to them.
For instance, if you want to generate a band name + a song written by them, you could go two ways. Head to the openai playground (account + subscription needed) and type the following.
Generate a band name:
This would give you a band name, and then you would take the band name (say: Linkin Locals) and write in the playground again the following.
Generate a song name written by Linkin Locals
This is a no-brainer method, also called Zero Shot generation. But we can do better. What if I showed it what band names look like and then asked it to follow the pattern and give me the outputs? Let’s see what a prompt here would look like.
Create a new song title a new band name.
Band name: Linkin Park
Song title: Numb
###
Band name: Maroon5
Song title: Animals
###
Whoa, this will give us both the Band name and the song title using just one query; this will also follow the “###” pattern and give a new output that will consistently be formatted as
Band name: The Nights
Song title: Ride of a Lifetime
###
You can use Python (or any programming language) logic on this to parse the outputs.
And that’s what we did. We gave it the manually corrected spellings and asked it to predict the spelling of a new word. Here is what the prompt would have looked like
Following are the spelling changes in a document. Generate the spelling for a new word based on the spellings seen.
devotees :: deivoties
deity :: deeitie
dalhousie :: dal housy
cities ::
This would roughly mean understanding the spelling changes and giving the spelling of cities that would work with our TTS system.
I ran this through the corrected dataset I had, and well, this experiment didn’t work. I tried a lot of prompts and different ways to give it information, but all failed ):
Solution 2: Hindi + ChatGPT
I’ll admit this — I am a sucker for tech gossip. I followed the prompt injection stuff on Reddit before Riley Goodside made it mainstream.
Using GPT3, I had already tried giving it Hindi spellings with a technique known as Self Ask, but I guess it was limited by the amount of Hindi training data it had seen.
After working out the corrections with a colleague for hours, I gave the language idea another shot, a zero-shot attempt (see what I did there?).
Inspired by the prompts on the awesome ChatGPT Prompts GitHub repository, I crafted something similar.
I want you to act as an English pronunciation assistant for Hindi speaking people. I will write you words in Hindi and you will only answer their pronunciations, and nothing else. The replies must not be translations of my words but only pronunciations. Pronunciations should use English Latin letters for phonetics. Do not write explanation of my replies. My first word is “हेलो”
This surprisingly worked! My best guess is believing that by using the zero-shot translation capabilities, I was able to get it to give me how the word would be pronounced literally as opposed to making it generate some made-up spellings.
We still use this prompt internally to fix some pronunciations which seem impossible to correct. The long-term plan is to integrate something like this in the product itself, so we can delight our users at scale (:
Do visit our website and follow us on Twitter.
Join our Discord community to get the scoop on the latest in Audio Generative AI!
We also launched NeoDub today. It enables you to clone your voice and speak any language!
withoutwax,
Tanay