Automatic Speech Recognition With Whisper

A look at decoding, spelling mistakes and hallucinations, fine-tuning, and more

Published in

Better Programming

9 min readFeb 14, 2023

Source from Shutterstock by durantelallera

Whisper is a new voice recognition model from OpenAI. The speech community is enthused about it because it is free and open source. Many blogs were already published about it. This article won’t be another how-to guide for Whisper. Instead, it will focus on less-discussed topics like decoding, dealing with spelling mistakes and hallucinations, adding a language model, fine-tuning, and more.

When I took part in the HuggingFace event, the concept for this article was born. The event’s goal was to fine-tune the Whisper model to build state-of-the-art speech recognition systems in your chosen languages. It was a successful and well-planned event. I learned a lot about Whisper, both theoretically and practically, as well as some interesting insights that I’d like to share.

But before we get into all that, let’s ask: What’s the big deal with Whisper? What makes it unique? What distinguishes it from previous SOTA models?
Whisper differs for three reasons:

The massive amount of data on which it trained (more on this in the section below).
Whisper was trained in a completely supervised manner, unlike some previous SOTA models, such as Wav2Vec2, which trained using self-supervision.
Whisper was trained to produce almost everything required for an ASR pipeline (in a single model). It performs voice activity detection, language detection, speech-to-text, translation, and alignment (but only partial alignment) for 96 languages.

If you haven’t already, I highly recommend reading the Whisper article:

Robust Speech Recognition via Large-Scale Weak Supervision

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio…

arxiv.org

Whisper Training Data

Since around 2016, ASR models have consistently improved and achieved a word error rate (WER) lower than human-level WER. Yet, we rarely see such accurate transcriptions in the wild, for example, when transcribing a random YouTube video. Why is this the case?

Many ASR models are trained or fine-tuned using the well-known Librispeech dataset. But, Librispeech’s 960 hours represent a small subset of the vast space of all possible speech (see image below). These ASR models are typically tested on the Librispeech test set, so they are trained and tested on the same distribution.

The approach for whisper was to train on as many as 680,000 hours of speech from the internet, as diverse as possible. Whisper is performing out-of-distribution inference when tested on the Librispeech test set. Whisper’s ability to generalize to different datasets and domains is due to the massive amount of diversely distributed data. It is equal to 77 years of listening.

The image is taken from https://www.youtube.com/watch?v=fZMiD8sDzzg

Whisper Architecture

Whisper is a Transformer-based encoder-decoder model. It maps audio spectrogram features to a sequence of text tokens.

Image from article: https://arxiv.org/pdf/2212.04356.pdf

First, the raw audio inputs are converted to a log-Mel spectrogram by a feature extractor. The Transformer encoder then encodes the spectrogram to form a sequence of encoder-hidden states. Finally, the decoder auto-regressively predicts text tokens, conditional on the previous tokens and the encoder's hidden states.

Adding an External Language Model

It is well understood that adding a language model to an ASR model improves its performance.

Does it make sense to add such a model to Whisper? How can it be done?

Shallow fusion

When we use a Wav2Vec2 CTC + n-gram language model (LM), the LM is an external LM, meaning it is not part of the same model as the Wav2Vec2 model. The external LM is trained independently from Wav2Vec2. Adding an external LM is called Shallow Fusion.

For more details on adding an external LM to Wav2Vec2, see my previous article.

Deep fusion

When we use a sequence-to-sequence model such as Whisper, the decoder is an internal language model. It is part of the same model as the encoder. This is called deep fusion: the internal LM is trained together with the encoder, as part of the Whisper model, in an end-to-end fashion. Generally, deep fusion models outperform models with shallow fusion.

Spelling mistakes

Playing with Whisper, especially with languages other than English, we can encounter spelling mistakes in the transcriptions. When scoring a model with the Word Error Rate (WER) metric, this can significantly increase the WER.

If the decoder essentially plays the role of a language model in Whisper, why do we get these errors?

This appears to be a limitation of the current Whisper model. One possible explanation is that the model was trained on these errors. Perhaps the filters responsible for excluding transcriptions made by another ASR were ineffective for non-English languages when the training dataset was created. These mistakes wouldn’t make sense if the training data only contained valid text in the correct spelling.

What can we do?

Fine-Tune the specific language using a dataset with validated transcriptions.
Allow Whisper to return several possible candidate sentences (hoping that some do not contain spelling errors) and rank them using an external language model.
Adding an external LM to the Whisper decoder. This is not directly supported, but the TokenDecoder class can be extended to select tokens based on a language model. For more information, see this discussion.

Hallucinations

This is a strange and amusing Whisper phenomenon. It may occasionally output a transcription that does not correspond to anything said in the audio or a very long and repetitive text that resembles the transcription but repeats it many times.

What causes this, and how can it be avoided?

Hallucinations, in my experience, can occur when there is a long period of silence in the audio. Long periods of silence seem to confuse the decoder for some reason, so try to avoid them. This is possible with a Voice Activity Detector (VAD) model that divides the audio into chunks without long silences. While Whisper can detect voice activity, other VAD models perform better. Whisper users recommend using an external VAD (for example, the Silero VAD).
Check the length of your input audio samples. The Whisper model can only process 30 seconds of audio at a time. Any audio that is longer than 30 seconds is truncated during training. While the audio is shortened, the text transcript remains unchanged. As a result, there is a discrepancy between the audio and the reference transcription. To avoid this, the best practice is to remove all samples from your train set that are longer than 30 seconds.
Consider a larger Whisper model size (more parameters).
If you’re using the Hugging Face framework, make sure you have the most recent transformers version installed (I was using version transformers version 4.24.0 and got rid of the hallucinations by upgrading to the latest version).
If you get a repetition loop with the hallucinated text, and non of the above helps, you can try changing the “compression_ratio” option (More on this in the decoding section below).

Whisper Decoding

Decoding strategies

Different decoding strategies can produce different transcriptions and help to avoid failure cases. But first, what exactly are the various decoding strategies?

Greedy search — simply chooses the token with the highest probability as its next token. The major drawback of greedy search is that it misses high-probability tokens hidden behind a low-probability token.

Beam search — Beam search reduces the risk of missing hidden high-probability token sequences by keeping the most likely number of beams of hypotheses at each time step and eventually choosing the hypothesis with the highest overall probability.

Best-of-n sampling: Decoding with temperature sampling can control the randomness and diversity of transcription output. The basic concept is similar to natural language processing in that the softmax function with a temperature parameter is used to “stretch” or “shrink” the predicted probability distribution over the set of possible tokens, and then a token is sampled from this modified distribution. High temperatures result in more random sampling and a more diverse set of transcriptions, whereas low temperatures result in less random sampling and a more predictable set of transcriptions.

Decoding heuristics

In Whisper, how are these decoding strategies used?

The Whisper paper describes a complex decoding strategy, which includes some heuristics to try to improve transcription reliability. In practice, these heuristics improve transcription quality but make inference much slower (up to 6 times slower), so you should consider this tradeoff.

The heuristics described in the paper are used when calling the transcribe function:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

When calling the decode function, these heuristics are not performed, and the decoding depends on the Decoding Options specified:

import whisper

model = whisper.load_model("base")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

The heuristics described in the paper works as follows: For each audio segment, a beam search with default parameters is performed. If the results are unsatisfactory, they fall back to decoding by sampling with temperature. They start with the temperature being 0.2 and check if the transcription is adequate.

Otherwise, we gradually increase by 0.2 (up to 1). Because of this loop, inference time can be up to six times slower when calling the transcribe function.

What qualifies as a good transcription? One that meets two requirements:

Transcription average probability is above a threshold
A text string with repeated tokens will compress more than a string with more unique tokens. As a result, calculating the compression rate of the decoded transcription can be used to identify and avoid transcriptions with many repetitions.

After achieving a good transcription, one more check is performed. If the no-speech probability for this segment is greater than a certain threshold and the average transcription probability is less than another threshold, the audio segment is considered no-speech, and no transcription is returned (the segment is skipped).

Fine-Tuning Whisper

HuggingFace created excellent tutorials, and you can find almost everything you need to know about fine-tuning Whisper in this repository.

Here are some takeaways from my fine-tuning experience:

How to fine-tune

Whisper was trained in a supervised manner, so no architecture changes are required for fine-tuning. Simply load the pretrained model and begin training on your dataset.

Overfitting

Whisper will quickly overfit when fine-tuning on a small dataset. What options are there?

Add more training data by using the training and validation splits of your dataset to train your model (in HF split=train+validation)
Add more training data by combining multiple datasets to give a larger training corpus. Mixing datasets in HF can be done according to this guide.
Add regularization by setting the dropout to a low non-zero value to prevent overfitting. In HF, after loading the model, do the following:

model.config.dropout = 0.1

Add augmentations. Many participants in the fine-tuning event reported that it reduces WER.

Evaluation vs. Training Time

During training, you perform evaluations from time to time. I found it weird that a single evaluation step takes much longer than a single training step, even with the same batch size. Why is this the case?

When training, the whisper decoder is not auto-regressive, but rather it uses self-attention mechanisms to consider the entire input sequence at once when making predictions.

When evaluating, the model does one forward pass of the encoder and then auto-regressively generates tokens in the decoder. It does as many forward passes of the decoder as the number of tokens generated. This makes an evaluation step significantly slower than a training step.

For this reason, I did only a few evaluations while training, and each evaluation was limited in the number of samples it evaluated.

I hope you find this helpful,

Happy Whispering!