Whispering into the Future: Reinventing Speech-to-Text Transcriptions with Go and Whisper

Video size:

Abstract

You have to build the transcriptions feature as good as Google and Zoom but we don’t have a similar budget”. That was a challenge we faced and solved with hundreds of coffees and incredible hair loss, our solution is cost-effective, open-source, and fits within our Go stack.

Summary

Pratim Bhosale is a developer advocate at SurrealdB. Today's talk will focus on the costs associated with speech to text transcription. We'll end the session with a live demo to show Whisper CPP in action.
Whisper is an opensource automatic recognition system developed by OpenAI. It converts spoken language into written text. Companies like Snap Inc. and Shopify are already using the Whisper API. Whisper CPP is a lightweight implementation of Whisper.
In order to use Whisper CPP in your Golang projects, we will be using go bindings. Go bindings are a way to call functions or use data structures from other programming languages within your go code. This makes sure that there is seamless integration of Whisper into your application.
Transcription services are used in a plethora of use cases like meetings, interviews, podcasts and customer support interactions. The versatility and efficiency of Whisper makes it a valuable tool for developers working on a variety of projects. Check out how you can build your own application using Whisper.
10,000 years later, the docker image has finally built. We are now going to run the container. I've added a short YouTube link which is around 30 seconds. It has now started the transcription. This will take some time and the larger the video, the more time the transcription will take.
And with that, we come to an end of this session. If you have any questions regarding the session, please feel free to reach out to me on Twitter at Bosley pratim. Hope you have a great day and hope you like the rest of the conference.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Everyone. Hope you're having a good time at Conf 42. Welcome to my session on reinventing speech to text transcriptions. I'm Pratim Bhosale, developer advocate at SurrealdB. In today's talk, we will be covering a couple of areas, starting with the costs associated around speech to text transcription APIs, an introduction to Whisper and Whisper CPP. We'll also understand go bindings and how they relate to our discussion. We are going to see some use cases of transcription services and we'll end the session with a live demo to show Whisper CPP in action. So without further ado, let's dive in. What is speech to text transcription? Speech to text transcription is nothing but reinventing spoken language into written text. All your voice assistants use speech to text, be it CD, Alexa, or even ok Google I started exploring speech to text APIs when I wanted to have subtitles for my meetings, but most of the applications that I tried were paid and then the developer instinct kicked in me and I thought of building one for myself. I explored the solutions that were given by Google and Amazon, but they were super, super expensive and that's when I decided to go ahead with an open source solution. One of the reasons that these apps were getting too expensive after a certain point was their pricing point, and I didn't want to spend a lot of money. That's where Openei's Whisper came into picture. But before we go ahead and understand what Whisper is, let's take a look at the pricing table of both Google's API and Amazon's speech to text API's costs. But to give you a calculation of how much it would cost me to use Google's speech to text API, the screenshot on the screen should speak for itself. If you cross the given threshold, you might have to shell out at least $1,000 every month on the API transcriptions. This was definitely not helping my case, and I decided to explore Whisper. So let's see what Whisper is and what Whisper CPP is. Whisper is an opensource automatic recognition system developed by OpenAI. It has been trained on a vast amount of multilingual and multitask supervised data collected from the web. It is one of the most underrated models of OpenAI. Companies like Snap Inc. The creator of Snapchat, Shopify, and a lot of other companies are already using the Whisper API. You can see the architecture of the Whisper model on the screen. The Whisper architecture is basically a method used to convert spoken language into written text. It works in a step by step manner using a specific type of computer model called a transformer. The speech is divided into small parts, each 30 seconds long, and then changed into a format that can be understood by the model. This format represents the speech in a visual way, showing its features and patterns. The model also has two parts to it, an encoder and a decoder. The encoder processes the speech and decoder converts it into the text. This model can do more than just transcribing the speech to text. It can also identify the language being spoken, provide information about when certain phrases are spoken, transcribe speech in multiple languages, and even translate speech into English. This is done using a special symbol that helps the model understand what tasks to perform. And here's where the Whisper CPP comes in. Picture well, Whisper CPP is nothing but a lightweight implementation of Whisper. It is a c implementation of Whisper model, which allows for faster execution and lower resource consumptions compared to other implementations. Now you must be thinking where is go in this? Where does go come into picture? Well, let's talk about Go bindings. In order to use Whisper CPP in your Golang projects, we will be using go bindings that are provided by the project. Before we go ahead and understand how go bindings are being used, let's understand what go bindings are. Well, go bindings are a way to call functions or use data structures from other programming languages within your go code. This is useful when you want to leverage existing libraries or APIs written in another language while still writing your main application code. The process usually involves a bridge between the go code and the code in the target language using the foreign function interface FFI of the target language. This also makes sure that there is seamless integration of Whisper CPP into your Golang application. I build a basic CLI application which would convert the audio from a YouTube video into text. Let me take you through the code so that you get a better understanding of how whisper CPP works. We will then head over to the demo. I will be explaining the major function from the code, which is the transcribe function. The transcribe function that you see on the screen is responsible for initializing the transcription model using Whisper new. We are passing it the path of our audio file as well as of the model. Now this is the Whisper model from OpenAI. We go ahead and create a new context for our model. For those who are not familiar with what a context is, let me explain it to you in context of the Whisper architecture. Context usually refers to the surrounding information or the environment that helps the model better understand the speech. It is processing when transcriptions speech having a broader context allows the model to more accurately recognize and interpret the words and phrases being spoken. This is because the meaning and pronunciation of words can be influenced by the words that come before and after them. For example, when transcribing a conversation, knowing the topic being discussed, or having the access to previous sentences can help the model better predict the words and phrases likely to be spoken next. This additional information can lead to improved transcription accuracy and overall performance. The next step in building our application is to decode the was file a WAF file a WAV file is the audio format accepted by Whisper. We decode it into a slice variable called data, but we first need to check if the sample rate of the audio and the number of channels are the same as accepted by Whisper. If not, we will be returning an error. We then pass the data variable to context process, which would do the actual transcription. The final step of this is to print the results for this demo. I have also dockerized the dependencies to avoid cross platform issues. You can see the commands to compile whisper CPP on the screen. Now, before we dive into the demo, let's talk about use cases. Transcription services are used in a plethora of use cases like meetings, interviews, podcasts, customer support interactions, voice assistants, real time closed captioning, and many more. The versatility and efficiency of Whisper makes it a valuable tool for developers working on a variety of projects. Now let's move ahead to our demo and check out how you can build your own application using Whisper. CPP let me show you the 32nd video that we are going to transcribe. I found it on YouTube and it's a quick 32nd video and we are going to check it out before we see its transcription. So you're running a little late today and you haven't had your fresh cup of coffee yet. No matter the weather or traffic, we deliver fresh coffees and bagels. The Java cafe Yep, that was it. And now let's get back. Okay, let's go ahead and build our Docker image first. This will take a little bit of time. 2000 years later so finally 10,000 years later, the docker image has finally built. We are now going to run the container. I've added a short YouTube link which is around 30 seconds. A longer YouTube video link would have taken me long time, so I'm adding a shorter video. It has now started the transcription. This will take some time and the larger the video, the more time the transcription will take. The transcription is finally done and now you can see the result. You can see exactly all the words that were mentioned in the video as a part of your transcription. You can see it says, so you're running a little late today and you haven't had your fresh cup of coffee yet. No matter the weather or traffic, we deliver fresh coffee and bagels. And there was music then. It also says the Java cafe. Yeah, and that was our 32nd video, which has been completely transcribed. And with that, we come to an end of this session. Thank you so much for going through my session and giving me this opportunity. If you have any questions regarding the session, please feel free to reach out to me on Twitter at Bosley pratim. Hope you have a great day and hope you like the rest of the conference.

Slides

Download slides (PDF)

See all 17 talks at this event!

Conf42 Golang 2023 - Online

April 20 2023

Whispering into the Future: Reinventing Speech-to-Text Transcriptions with Go and Whisper

Video size:

Abstract

Summary

Transcript

Slides

Pratim Bhosale

Developer Advocate @ SurrealDB

Join the community!

Featured event

2025

2024

Info

Conf42 Golang 2023 - Online

April 20 2023

Whispering into the Future: Reinventing Speech-to-Text Transcriptions with Go and Whisper

Video size:

Abstract

Summary

Transcript

Slides

Pratim Bhosale

Developer Advocate @ SurrealDB

Join the community!