You're unable to read via this Friend Link since it's expired. Learn more
Member-only story
Fixing YouTube Search with OpenAI’s Whisper
How to use OpenAI’s Whisper for better speech-enabled (audio) search

OpenAI’s Whisper is a new state-of-the-art (SotA) model in speech-to-text. It can almost flawlessly transcribe speech across dozens of languages and even handle poor audio quality or excessive background noise.
The domain of spoken word has always been somewhat out of reach for ML use cases. Whisper changes that for speech-centric use cases. We will demonstrate the power of Whisper alongside other technologies like transformers and vector search by building a new and improved YouTube search.
Search on YouTube is good but has its limitations, especially when it comes to answering questions. With trillions of hours of content, there should be an answer to almost every question.
Yet, if we have a specific question like “what is OpenAI’s CLIP?” instead of a concise answer, we get lots of very long videos that we must watch through.
What if all we want is a short 20-second explanation? The current YouTube search has no solution for this. Maybe there’s a good reason to encourage users to watch as much of a video as possible (more ads, anyone?).
Whisper is the solution to this problem and many others involving the spoken word. This article will explore the idea behind a better speech-enabled search.
The Idea
We want to get specific timestamps that answer our search queries. YouTube does support time-specific links in videos, so a more precise search with these links should be possible.

To build something like this, we first need to transcribe the audio in our videos to text. YouTube automatically captions every video, and the captions are okay — but OpenAI just open-sourced something called “Whisper”.