Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone. Hope you're having a good time at Conf 42. Welcome to
my session on reinventing speech to text transcriptions.
I'm Pratim Bhosale, developer advocate at SurrealdB.
In today's talk, we will be covering a couple of areas,
starting with the costs associated around speech to
text transcription APIs, an introduction
to Whisper and Whisper CPP. We'll also understand go bindings
and how they relate to our discussion. We are going to see some use
cases of transcription services and we'll end the
session with a live demo to show Whisper CPP in action.
So without further ado, let's dive in.
What is speech to text transcription? Speech to text
transcription is nothing but reinventing spoken language
into written text. All your voice assistants use
speech to text, be it CD, Alexa, or even
ok Google I started exploring speech to text APIs
when I wanted to have subtitles for my meetings,
but most of the applications that I tried were paid and
then the developer instinct kicked in me and
I thought of building one for myself. I explored the
solutions that were given by Google and Amazon, but they were super,
super expensive and that's when I decided to go
ahead with an open source solution. One of the reasons
that these apps were getting too expensive after a certain point was
their pricing point, and I didn't want
to spend a lot of money. That's where Openei's Whisper came
into picture. But before we go ahead and understand what
Whisper is, let's take a look at the pricing table
of both Google's API and Amazon's speech to text API's
costs. But to give you a calculation
of how much it would cost me to use Google's speech to text API,
the screenshot on the screen should speak for itself. If you cross
the given threshold, you might have to shell out at least $1,000
every month on the API transcriptions. This was
definitely not helping my case, and I decided to explore
Whisper. So let's see what Whisper
is and what Whisper CPP is. Whisper is an
opensource automatic recognition system developed by
OpenAI. It has been trained on a vast amount of
multilingual and multitask supervised data collected
from the web. It is one of the most underrated models
of OpenAI. Companies like Snap Inc. The creator of
Snapchat, Shopify, and a lot of other companies are
already using the Whisper API. You can see the architecture of
the Whisper model on the screen. The Whisper architecture is
basically a method used to convert spoken language
into written text. It works in a step by step manner
using a specific type of computer model called a transformer.
The speech is divided into small parts,
each 30 seconds long, and then changed into a format
that can be understood by the model. This format represents
the speech in a visual way, showing its features and
patterns. The model also has two parts to
it, an encoder and a decoder. The encoder processes
the speech and decoder converts it into the text.
This model can do more than just transcribing the speech
to text. It can also identify the language being spoken,
provide information about when certain phrases are spoken,
transcribe speech in multiple languages, and even translate
speech into English. This is done
using a special symbol that helps the model understand what
tasks to perform. And here's where the Whisper CPP
comes in. Picture well, Whisper CPP is nothing but a
lightweight implementation of Whisper. It is a c implementation
of Whisper model, which allows for faster execution
and lower resource consumptions compared to other implementations.
Now you must be thinking where is go in this?
Where does go come into picture? Well,
let's talk about Go bindings. In order to use
Whisper CPP in your Golang projects,
we will be using go bindings that are provided by
the project. Before we go ahead and understand how go
bindings are being used, let's understand what
go bindings are. Well, go bindings are a way
to call functions or use data structures from other programming languages
within your go code. This is useful when you want
to leverage existing libraries or APIs written in
another language while still writing your main application code.
The process usually involves a bridge between
the go code and the code in the target language using the foreign
function interface FFI of the target language.
This also makes sure that there is seamless integration of
Whisper CPP into your Golang application.
I build a basic CLI application which would convert
the audio from a YouTube video into text. Let me take
you through the code so that you get a better understanding of how whisper
CPP works. We will then head over to the demo.
I will be explaining the major function from the code,
which is the transcribe function. The transcribe function
that you see on the screen is responsible for initializing the transcription
model using Whisper new. We are passing it
the path of our audio file as well as of the model.
Now this is the Whisper model from OpenAI. We go
ahead and create a new context for our model. For those
who are not familiar with what a context is,
let me explain it to you in context of the Whisper
architecture. Context usually refers to the
surrounding information or the environment that helps
the model better understand the speech. It is processing when
transcriptions speech having a broader context allows the
model to more accurately recognize and interpret the
words and phrases being spoken. This is
because the meaning and pronunciation of words can be
influenced by the words that come before and after them.
For example, when transcribing a conversation,
knowing the topic being discussed, or having the access to
previous sentences can help the model better predict
the words and phrases likely to be spoken next.
This additional information can lead to improved transcription
accuracy and overall performance. The next step in
building our application is to decode the was file a WAF file
a WAV file is the audio format accepted
by Whisper. We decode it into a slice variable called
data, but we first need to check if the sample rate of
the audio and the number of channels are the same as
accepted by Whisper. If not, we will be returning an error.
We then pass the data variable to context process,
which would do the actual transcription. The final step
of this is to print the results for this demo.
I have also dockerized the dependencies to avoid cross
platform issues. You can see the commands to compile whisper
CPP on the screen. Now, before we dive into the
demo, let's talk about use cases.
Transcription services are used in a plethora
of use cases like meetings, interviews,
podcasts, customer support interactions, voice assistants,
real time closed captioning, and many more. The versatility and
efficiency of Whisper makes it a valuable tool for developers
working on a variety of projects.
Now let's move ahead to our demo and check out how you
can build your own application using Whisper. CPP let
me show you the 32nd video that we are going to transcribe. I found it
on YouTube and it's a quick 32nd video and we are
going to check it out before we see its transcription.
So you're running a little late today and you haven't
had your fresh cup of coffee yet. No matter the
weather or traffic, we deliver fresh coffees and bagels.
The Java cafe Yep,
that was it. And now let's get back.
Okay, let's go ahead and build our Docker image first.
This will take a little bit of time.
2000 years later so
finally 10,000 years later, the docker image has finally built.
We are now going to run the container.
I've added a short YouTube link which is around 30
seconds. A longer YouTube video link would have taken
me long time, so I'm adding a shorter video.
It has now started the transcription. This will
take some time and the
larger the video, the more time the
transcription will take. The transcription is finally done and
now you can see the result. You can see exactly all
the words that were mentioned in the video as a
part of your transcription. You can see it says, so you're running
a little late today and you haven't had your fresh cup of coffee
yet. No matter the weather or traffic, we deliver fresh coffee and
bagels. And there was music then. It also says the Java
cafe. Yeah, and that was our 32nd video, which has been
completely transcribed. And with that, we come to
an end of this session. Thank you so much for going through
my session and giving me this opportunity. If you have any questions regarding
the session, please feel free to reach out to me on Twitter at Bosley
pratim. Hope you have a great day and hope you like the rest of the
conference.