When Whisper 1.0 Gets It Wrong: An Inside Look at Speech-to-Text Failures

Discover the untold truth of OpenAI's Whisper. Join us on a journey through its failed cases in this technical study.

Apr 19, 2023

OpenAI dropped a bomb on the speech recognition scene with their speech to text model, Whisper. This multi-task audio model is capable of working its magic on a whopping 97 different languages. Whisper represents a major leap forward in the field of ASR, and it's sure to be a game-changer for all you language lovers out there. But, let's face it, we've all seen those appealing blogs that talk about the architecture and use cases of Whisper, but this ain't one of those blogs.

We're going to go into the world of Whisper's unsuccessful cases in this article because let's be real, nothing is perfect, not even Whisper. So relax, get a cup of coffee, and get ready to learn about the less glamorous side of whisper 1.0. But don't worry, it's not all bad news. In fact, these failed cases can help us become more conscious of the potential issues and come up with better use cases or integrations. Either way, you're in for a treat. 😋

So, how did they do it?

Well, the creators behind the Whisper model used a technique called weak supervision during training, pumping 680,000 hours of multilingual and multitask data into the machine. And get this - they pulled all that data straight from the web! I'm talking about web scraping in charge!

They then pushed every effort during the pre-processing phase to ensure that the data was perfect. Each audio recording was resampled at 16,000 Hz and divided into 30-second segments for training. Additionally, to ensure that the audio and text pairings were in the same language, they trained language detection algorithms.

And the result of all this hard work?

A 680,000-hour dataset that is made up of 117,000 hours of non-English audio data in 96 different languages, 125,000 hours of X-->en translation, and 438,000 hours of English transcription and translation.

Yes, it is what you read !!!

The limitations have been found on low-resource languages. Please note that the results may vary due to randomness and the above limitations may not always occur.

# Default experimentation setting using OpenAI whisper python package for audio transcription is demonstrated below.

import whisper
task = 'transcribe'
language = 'Kn'
transcribe_args={'task': task, 'language': language}
model = whisper.load_model('large-v2')
transcript = model.transcribe(audio_path, **transcribe_args,
                                  prompt = None)

# Default experimentation setting using OpenAI whisper API for audio transcription is demonstrated below.

import openai
audio_path = '/content/audio.wav'
audio_file = open(audio_path, "rb")
transcript = openai.Audio.transcribe("whisper-1",audio_file, 
                                     response_format = 'verbose_json',
                                     prompt = None)

Repeated Outputs and Missing Actual Segments

Whisper encounters a problem with repeating output segments after moments of silence or non-speech activity, and sometimes it even misses out on actual output segments altogether. You know, those moments when someone pauses to cough or when the background noise gets played a little too loud.

Let me walk you through an example,

whisper failed translation result on a Kanada video

Merging of Non-Voice Activity and Voice Active Speech: A Folly Leading to Erroneous Timestamps

One of the major issues with Whisper is that it outputs a chunk of text after a period of silence or other non-voice activity, and combines the timestamps for both the silence and subsequent speech active segments. It's like receiving Whisper results of an audio where a person is having a conversation with someone who keeps interrupting themselves with random noises or coughing fits.

To make matters worse, the timestamps get all messed up; it includes the timestamps of the coughing along with the speech in a single segment, making it difficult to use for subtitling or dubbing.

An example for better understanding,

Inaccurate Starting Timestamps

Whisper's got another problem that'll make you want to pull your hair out. Sometimes, when you start recording, and there's nothing but silence... And if that happens, you can consider saying accurate timestamps a goodbye because the starting timestamps can become disorganised, which can affect the accuracy of the rest of the timestamps.

It's like trying to navigate through a maze blindfolded. You'll never know where you are, and you'll keep running into walls.

Challenges in Identifying Repeated Words

On certain occasions, the Whisper speech to text system may experience difficulty in identifying repeated words that are present within a given segment. As a result, this may lead to inaccuracies and errors within the transcribed output as words may be omitted in the output.

whisper failed result on a English video

Lost in Translation: The Tragic Consequences of Contextual Inaccuracy

In the pursuit of translation, with poor audio and setting condition on previous text = True, (default), Whisper's identification of words may lead to subsequent output that is contextual off-base, causing the listener to be barking up the wrong tree.

This deviation from the intended context can be illustrated by the following example:

Whats next?

So there you have it. While OpenAI's Whisper may appear to be solving a lot of your language problems, bear in mind that it is still an AI solution, prone to mistakes just like any other, especially for low resource languages. But don't be disheartened; in these mistakes lies the opportunity to learn, adapt, and conduct further study on how to overcome the flaws of whisper 1.0. So let's accept the shortcomings of Whisper 1.0 and use them to fuel our creativity and innovation. Because, in the end, it's not about having a perfect machine, but about how we utilise it to make the world a better place.

In fact, several open-source repositories have already been developed to enhance the results further and address various issues. Some of the active repositories build on top of whisper 1.0 include:

WhisperX: Improves Whisper's timestamp accuracy through forced alignment with phoneme-based ASR models and VAD preprocessing. It's designed for multilingual use cases and the authors claim to produces more accurate transcriptions and timestamps.
Whisperer: With this repository, you can automatically create speaker-separated text-audio datasets from raw audio files. The tool splits audio files by speakers, labels the speakers across files, and allows configurable audio splitting, effectively eliminating non-speech problems
Whisper Timestamped: This repository presents a method for improving word-level timestamp accuracy and confidence scores in multilingual automatic speech recognition using Whisper models. The approach predicts word timestamps and assigns a confidence score to each word and segment, based on the probabilities of subword tokens.
Stable-ts: Claims to improve word-level timestamps, which allows for more natural segment grouping, and includes the option to suppress silence using silero-VAD.

Thank you for reading this! We hope you learned something new today.

At Dubverse, we have developed solutions to address these unsuccessful cases for our clients. Do visit our website and follow us on Twitter.

We also launched NeoDub sometime back. It enables you to clone your voice and speak any language!

Join our Discord community to get the scoop on the latest in Audio Generative AI!

Join Dubverse Black Community on Discord

See you later!!!

T. Pranav