OpenAI Whisper is a new artificial intelligence system that can achieves human level performance in speech recognition. This system was developed by OpenAI, an artificial intelligence research lab. The goal of this system is to improve the quality of speech-to-text systems. With a 1.6 billion parameters AI model that can transcribe and translate speech audio from 97 languages. Whisper was trained on 680,000 hours of audio data collected from the web and showed robust zero-shot performance on a wide range of automated speech recognition (ASR) tasks. This will benefit many applications, such as virtual assistants, smart speakers, and more.
This video can help you understand the benefits of the Whisper.
OpenAI introduced Whisper on September 21, 2022, in this article. This will accelerate the use of artificial intelligence in applications that need to make use of technology. Here are some examples:
You record in any language, and the API extracts the text.
In this example, the API extracts text from a YouTube video.
Let’s experiment using the OpenAI Whisper API in Python to extract the text from the YouTube video.
# Author: Lawrence Teixeira # Date: 02/11/2022 # Requirements to run this script: #pip install git+https://github.com/openai/whisper.git #pip install pytube # import the necessary packages import pytube as pt import whisper # download mp3 from youtube video (Indroductrion to Whisper: The speech recognition) yt = pt.YouTube("https://www.youtube.com/watch?v=Bf6Z5bjlHcI") stream = yt.streams.filter(only_audio=True) stream.download(filename="audio.mp3") # load the model model = whisper.load_model("medium") # transcribe the audio file result = model.transcribe("audio.mp3") # print the text extracted from the video print(result["text"])
Text extracted from the video “Introduction to Whisper: The speech recognition.”
“Whisper is an open source deep learning model for speech recognition that was released by Oppenai last week. Oppenai’s tests of Whisper show that it can do a good job of transcribing not just English audio, but also audio in a number of other languages. Developers and researchers who have worked with Whisper and seen what it can do are also impressed by it. But the release of Whisper may be just as important for what it tells us about how artificial intelligence AI research is changing, and what kinds of applications we can expect in the future. Whisper from Oppenai is open to all kinds of data. One of the most important things about Whisper is that it was trained with many different kinds of data. Whisper was trained on 680,000 hours of data from the web that was supervised by people who spoke different languages and did different tasks. A third of the training data is made up of audio examples that are not in English. Whisper can reliably transcribe English speech and perform at a state-of-the-art level with about 10 languages, an Oppenai representative told VentraBeat in written comments. It can also translate from those languages into English. Even though the lab’s analysis of languages other than English isn’t complete, people who have used it say it gives good results. Again, the AI research community has become more interested in different kinds of data. This year, Bloom was the first language model to work with 59 different languages. Meta is also working on a model that can translate between 200 different languages. By moving toward more data and language diversity, more people will be able to use and benefit from deep learning’s progress. Make your own test since Whisper is open source. Developers and users can choose to run it on their laptop, desktop workstation, mobile device, or a cloud server. OpenAI made Whisper in five different sizes. Each size traded accuracy for speed in a proportional way, with the smallest model being about 60 times faster than the largest. Developers who have used Whisper and seen what it can do are happy with it, and it can make cloud-based ASR services, which have been the main choice until now, less appealing. And Lobs expert Noah Giff told VentraBeat, At first glance, Whisper seems to be much more accurate than other SaaS products. Since it is free and can be programmed, it will probably be a very big problem for services that only do transcription. Whisper was released as an open source model that was already trained, and that anyone can download and run on any computer platform they want. In the past few months, commercial AI research labs have been moving in the direction of being more open to the public. You can make your own apps. There are already a number of ways to make it easier for people who don’t know how to set up and run machine learning models to use Whisper. One example is a project by journalist Peter Stern and GitHub engineer Christina Warren to make a free, secure, and easy to use transcription app for journalists based on Whisper. In the cloud, open source models like Whisper are making new things possible. Platforms like Hugging Face are used by developers to host Whisper and make it accessible through API calls. Jeff Bootyer, growth and product manager at Hugging Face, told VentraBeat, It takes a company 10 minutes to create their own transcription service powered by Whisper and start transcribing calls or audio content, even at a large scale. Hugging Face already has a number of services based on Whisper, such as an app that translates YouTube videos. Or, you can tweak existing apps to fit your needs. And fine-tuning, which is the process of taking a model that has already been trained and making it work best for a new application, is another benefit of open source models like Whisper. For example, Whisper can be tweaked to make ASR work better in a language that the current model doesn’t do as well with. Or, it can be tweaked to understand medical or technical terms better. Another interesting idea would be to fine-tune the model for tasks other than ASR, like verifying the speaker, finding sound events, and finding keywords. Hugging Face’s technical lead, Philip Schmidt, told VentraBeat that people have already told them that Whisper can be used as a plug-and-play service to get better results than before. When you put this together with fine-tuning the model, the performance will get even better. Fine-tuning for languages that were not well represented in the pre-training dataset can make a big difference in how well the system works.”
As you can see, the text is exactly what was spoken. Note that in this example, we use the intermediate model. Here are the models that we can use to increase the accuracy.
Available models and languages
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.
For English-only applications, the
.en models tend to perform better, especially for the
base.en models. We observed that the difference becomes less significant for the
Whisper’s performance varies widely depending on the language. The figure below shows a WER breakdown by languages of Fleur’s dataset using the large model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D of the paper.
Conclusion: Although there is still some controversy around how well AI Whisper works, the concept behind it is something to think about. With more and more businesses moving towards automated marketing and customer service, AI Whisper could be a valuable tool for those looking to get ahead in the industry. Have you tried using AI Whisper or any other similar tools? Let us know in the comments!
Follow the official Whisper references:
That’s it for today!