Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everybody, my name is Mitri, I'm from Ukraine and I'm
working at Lightsource AI as a software engineer.
As a rough lead software engineer. Little disclaimer,
if you love big money and you are a brilliant rust engineer,
you are very welcome to try out and be a part of late source.
But today I'm going to talk but my a little
bit different passion. It's multimedia programming
and especially video and audio programming.
With Rust, ive been working on a project that
allow you to create video for one point ive years
and things. Video is actually my journey of creating
videos and audios with Rust. So at
the very first I am going to start from the scratch.
What is a video? Today, video are the
most popular type of content on the Internet.
You know that today Internet consists
of like 90% of the videos with kittens.
But in fact today people are spending enormous amount
of time in the watching videos by
stats IO in general,
general user of Internet spending more than 10 hours
per week watching videos. Not so much you say,
but that's actually dozens or hundreds of billions
hours of watching videos per week.
That's a lot. But technologies behind
the videos are completely outdated.
I even want to talk about the
adobe after effects or professional software written in c
that can be replaced with more safe rust.
I'm actually talking about what real users are using to create
videos for TikTok, for example. By the way,
did you manage how we appear in a world where vertical video
is the correct one? I didn't yet,
but people are using really weird
software today to create videos. They are using mobile apps,
mobile video editors, mobile graphics,
video editors to create videos,
and they are really, really far away from being
efficient.
When I firstly understand that people that the most
popular video editor renders video through the
web, through the browser, through the HTML and CSS, I was like,
oh my God, I need to somehow fix this.
And it's actually a great time for Rust
because Rust today probably
the only correct way,
I won't be afraid to say this, to work with audio
or video, because when we will start
figuring out more deeply about what is a video, you will
notice that the main problem of video is memory and
garbage collector. Languages that will handle
videos or audios in runtime is mostly impossible.
But today we will start literally from scratch. And I will
try to describe you how I
did and how you can work with videos in rust.
So let's start from what is a video, like what is
actually a video file and how it works. And if we will take the
most common container for videos.
It's the mp4 and mp4. It's not a codec,
it's only a container for media. Input can
contain different type of codecs, information for different types
of codecs. And the mp4 file actually contains
from the images stream each image. You can see
it on your slide. On the slide right now has the presentation and
decoding timestamp because, I'm sorry, because encoded
itself contains of really depends on the ordering
of frames. You can have
the frame, for example, you can see the presentation frame is 01234.
But the decoding frame is different
because the encoder can use the information from the
next frame in order to decode the current one that's in
the grant schema scene. But you can understand right now that
each pill from this graph is actually a separate
image, like PNG or RGB, actually like in a different
format. But we will talk about things a little bit later.
And also we have an audio
in the video and we
have also the audio stream here. The audio stream
consists of also the frames. The frames are a
little bit different because the audio consists of samples.
Sample is much smaller in amount of
time. We record relevant 40 48,000
samples per rate. And we also capture the
frames of the audio and connect
it with the same architecture of presentation
and decoding timestamps. And that's actually the base for
all the codecs you can find today on the Internet, on the open source,
everywhere. Because decodec itself is mostly
about the mass, it's about how
to make the information of
the specific images sequence to
eat less space from your hard drive. There are
a lot of codecs today you may know, but like Ampec,
four half AV and a lot
of different codecs, some like with transparent images, some is
better for streaming, some better for web,
some better for professional video editing and
used mostly by the cinema creators.
But most of them are actually like the giant specs
of the mathematics that are used inside the
codec to make the video more light
and in order to more easily transfer it
through the net and the stream and so on and so forth. So for example,
the most popular codec, the second most
popular codec today, and probably the one that should
be used everywhere, it's a hefk like
high efficiency video codec. It's h
two six ive. It has specification over 700
pages. And there are plenty amount of implementation
of these codecs. And you can use it, you can use the specification to build
your own codec. You can use an open source implementation of the codec,
but actually it defines the mass of how,
for example, when I'm moving somewhere, you can see the video
should not save all
of the frames of my movement.
It's actually enough to only capture the position
and where the position will be like in a second and then render
it in.
It's perfectly enough to capture the one position, the another one,
and then like render something in between. And that's
actually what codecs are doing mostly.
But you likely as a user or developer who makes
video with code, don't really want to implement 700
specification by yourself. And that's where Libav
helps. Libav, it's an open
library. You may notice that a C library for
audio and video, and you can use it. It's actually
an abstraction over all the codecs you may actually
find on Internet. It supports literally everything, all the audio,
all the video codecs and so on and so forth. And you may know it
by another more popularly used name, it's Ffmpeg.
Originally Ffmpeg was a CLI wrapper around the Libav,
but right now it's maintained it within
one repository. And Libav is actually part of FFmpeg,
but doesn't really matter. But using FFmpeg you
can actually render any kind of video from
any kind of image and audio source. As we noticed at the very
start with literally one comment things
is perfectly enough to generate a video. So basically, like we said
that we want to have a video with 60 FPS like full
hd, have an input of a seconds
of n images. So basically on your file system you can
have pick then the number of the frames PnG
file that then will be decoded and
transformed into the frame using the h
265264 codecs and specific pixel format.
And as an output you'll get the test mp4 file.
And this actually works. And that's how today
most of videos from the free
open source, not open source, free for use
editors on mobile and on web are doing.
But that's really really far away from being
really efficient just because we don't really want to waste a lot of time
on rendering images. And we can do it
directly because Libav has the
public C API, we can use it from rust in order
to make the manual encoding. How it works pretty
easy. You can have the image, an image will be the
BMP image, it's like raw
image sources. Then you need to convert it to
YuV image. We will talk a bit later
about it, then you send it to the encoder,
which is like implementation captured by Libav you
will get a packet. Packet is a compressed
frame of the video. You assign
a specific timestamp when you want it to be encoded,
decoded when you want to be read, and then
you put it into the media container, the same you are doing with the
next image. And the same way you
get a new packet and send it to
your file. The only problem, that you
must encode all the frames one by one
in seconds. Just because that's pretty straightforward. That codex
are using specific math to understand how they should
encoded the images. And here is more about
this YUV image, which is
like it's actually a legacy from
the past times. And that's very interesting story I would like to
talk about a little bit. So basically,
igbric u by images are instead
of like RGB color channels, it's just a different representation
of colors where you have the luma
or the brightness channel, which actually defined
the black and white, and the two additional coral
color planes. That a result gives you the perfectly
colored image. And that's actually a legacy from the
analog television. When companies
television companies face the problem that they need to
support both the black and white television and
the color television, instead of replacing
completely giving the complete breaking change or
putting the new three cables,
they managed to create a new algorithm that allow you to
send to use the same first cable
and two additional one to create the colored image or
still fall back to the black and white if you need to.
Yeah, that's like kind of neat solution, kind of
neat engineering.
I'm sorry, kind of neat engineering solution.
But as a result, right now you need to do something like
this for each frame of your audio. So basically from
pretty any kind of image, when you will render it,
you will need to create a loop
over all the pixels of the image, and you
will need to make a loop over all the pixels of the image. It could
be like for 4k video, it will be the loop over
8 million pixels and convert it using specific
mass. But to be honest,
uv takes less space on a hard drive because
it can contain less information for color planes.
But when you know this, when you know how to convert
the image from the RGB to the
UIV, and then you can send it to encoder and then get a
video, you know that you can create a video. And here
is a problem. You need to render images.
And that's an interesting question, how to render
images. That's a problem because today,
surprisingly, browser is still the most popular
way to render any kind of static content. There are a
lot of developers, a lot of front end engineers
that are producing a lot of content, and they're
using browser literally for everything, even for video
rendering. And that's becoming really ridiculous because
if you will, trying to find the similarity, trying to
find the most common over these two images, you will likely
fail, I suppose. But in fact you can
see that on the left side you can see the streamer that has some
fancy background, some notification appear on the screen
with animations, progress bars, and on the
right side you can see something completely different. You can see the GitHub
preview images for social media graph like
open graph, but in fact appeared that both of
these graphics is rendered on the browser. The Ops,
the open broadcast software is using browser to render all
the animatable content for notification and everything,
and all the plugins are using it. And the GitHub is
also using the headless browser to make this
open source graph previews, which is
really far away from being efficient.
But we are trying like ive been trying to find
a way to make the frames and the image rendering
more efficient. And if we're all thinking about this,
we appear that we need a format that is
dived. It must be like because the image has a fixed
dimensions, it must be fixed, it must be animatable,
easy to animatable, it must be like friendly to DX, it must
be easy to understand, it must support GPU for fast
and efficient rendering, it must have a specification to render
it and to understand by a user. The most important
part, it must be debuggable, which means that nobody wants
open source, nobody wants to cover
or trying to understand the drills of your source
code, debug it. To understand the problem, it must be easy to
figure this out and to write and to render
them. And it appears to be a really hard to
find something like this that covers all of these criteria.
But surprisingly,
there was always, not always actually, but for
a long time there was a format that is perfectly fine,
meets all of these criteria, and is much more
like much better than pretty
much anything else. To render any kind of static content.
It's SVG, it's scalable vector graphics, and it's used everywhere,
especially in a web. To render any kind of like
to render it may be confusing,
it may be like horrific. These puzzles are always horrifying developers,
but in fact SVG is pretty self contained format
and it allows you to render literally everything. Like for
example here you can see that we
are rendering the SVG using the macro of the rust.
This is a public API of my framework. We are coming more
and more to the actual demo of my framework.
And actually you define the SVG file. You define a rectangle
that will be a background because it have the self width and
self hate. Then you define a simple animation.
Here you can see that from zero to
the five second, you will translate from white
to some other color. And then step by step have another colors.
Then you have a text with specific form family on a specific
coordinate, and have you another text. And it, ta da
da da, appears to render something like this.
Things is the editor of Frames. And yeah,
welcome to my framework. That's mostly
the first public demo of my framework I'm working
on, but that's how it works. It renders the
SVG, it renders timeline, and bonus, it actually
renders an SVG. So it's super debuggable,
super easy to preview, easy to
construct, and easy to understand format.
And what is even more important, that is specificated
and pretty widely popular because like in figma, you can literally
construct your frame, then run like right click
on a frame, copy it as SVG, pass directly to
your rust macros, then use any kind
of rust expression inside of the frame definition
and you will get the preview of the video.
That is like the really important part of making
videos. You must make the progress
of making videos really fun and really smooth.
And with rust it's really possible,
because how are we dealing with
it? We have the video definition, you seen this
with an SVG definition that is internally
transformed to the ast sends to the Wassen bridge,
which sends the correct frame to the editor app.
The frame definition can use some APIs
from the code of frames. By the way, it's the name of
my framework, like animation, like subtitles
and so on and so forth. And we create the editor
app, which gets the SVG and shows in the correct time.
It's pretty simple. And we have also the renderer leap
that takes the same video definition and creates the images
from SVG, sends them to the encoder, and then gives
you the real world file. And like creating the WaSm
bridge in Rust is really simple. You just define a macro
and you have the completely working WaSm
based editor that consumes the video in like 60
FPS for easy. But right now you
need to also render your svgs,
which may not be really easy task to
do at a first glance when you just try
to figure this out. Because SVG,
despite being pretty popular, it's still web based
specification and it's really hard to render it.
But thanks to the perfect rust and the awesome
rust community. When I just started the project,
there already was a library for SVG rendering.
It was created by Rosario Falcon and it has like 1500
tasks that covers pretty
much all the use cases of SVG specification and makes it
even more precise of rendering SVG than
the chrome browser. That's impressive
actually. And as a first glance,
I just completely depend on this library to render images.
Right now I have my own fork of the library to make
it more efficient for sequential rendering to not rerender
some part of the svgs when they are not changed to more efficiently
process the reusability between the frames. But in
fact you
still can render SVG in rust directly
without any kind of problems.
But the problem, it's rendering on cpu,
which means that cpu is pretty much
bad idea for rendering pretty much
anything. But from the
flip side, when we're talking about automate tool
that make you a programming that actually automates
video rendering by giving some input, you get
an output of the video file. It appeared to be like the best
case because nobody will actually create the infrastructure
with GPU, which is pretty expensive just for
some kind of one feature. Like if you have the professional
software, it's another call. But right now it's still
pretty efficient though. If we are focusing on the rendering
and we are doing a clever simplification of
SVG, we're doing this ahead of time. Because you
don't need to support all the effects of SVG, you just
need to have the poses with a specific color scheme.
And for the video that I show you a couple of
slides ago, the hello world video, you probably remember it.
It did like pretty great job. You can see it's not like
speed it up. It's a real world performance of
rendering pure cpu of 9900
frames, full hd of videos with
completely on cpu. Thanks to the rayon
and Ross like parallelization and compile
time optimizations,
when you're avoiding all the copying from the processor and
your rendering is paralyzed across all
the cores like I shown here. So basically, the idea of CPU
based rendering is that each of your cpu renders specific
file, because as you remember, video cannot
be filled with the frames like unordered frames.
And then we prepare the video in that case
to be easy concatenatable
and then just concatenate it in the end. And then we have
the pretty much performant way of rendering
pretty much anything except the things
that breaks performance really hard, like the
shadows, the gradients and the blue are the killers of
CPU. Just because the nature of CPU that you
need to process each pixel, calculate the position one by one.
Things that require going
like several times or smoothing a big amount of gradients,
smooth a giant amount of pixels to calculate the specific
precision per each pixel kills the performance completely.
And here is the
most amazing part of this project is a completely from
scratch created GPU renderer. It's still
not 100% working, it's still
not 100% compatible with SVG
stack, but it's amazingly fast.
But our cpu render is still very fast though you
can see we will compare the pps, the frames
per second, render it without the encoding over CPU
and GPU. You will notice that hello world video due to the
fact it's pretty simple. It renders only text and only color,
but still looks pretty nice. It renders really fast
because of parallelization, because GPU has only like one,
it's less parallelizable across the different frames. And when you parallelize
it over the images, it becomes slightly faster. But when
we increase in the amount of the resolution,
you are getting pretty much same results.
And for the blurs and gradients, when you
have some kind of gradient, full size full page gradient
that is then blurred, it requires six times going
through all the pixels for pretty much
all the background, it becomes like much faster. How it
works pretty interesting question,
because the fact that we are simplification the SVG to
the paths only we can do the tessellation.
That is like the algorithm of parsing the
positives, parsing the dots and the vectors
into the vertices and indices,
and give it directly to GPU and get the result,
the rendering result directly right
from GPU, which is much faster in terms of
effects and shade ring.
And we have not so much time left. I don't know how
I appeared in things situation that I didn't have enough time. But we also
need to have a video, the audio inside the video.
And because the fact that we can work pretty flexible with
images, we can do exactly same with audio. Like we can
even generate the audio with a mass. Because we know that
audio have a sinusoid and frequency
nature. We can generate some sound right from
the code. But in fact what real user needs is
to have something like this. Nothing more than just remixing
the audio. It's a preview of the images, but in fact you get pretty
much the same result in an audio file. And you
actually need to load the image using the
encoder, the decoder, load the audio and
then mix it. But what is really awesome is that
you can define, yeah, here is how you
can define the audio map with frames,
then it will mix
all the audio and mix it into the one and
put it into the file. But what is more interesting,
that by the fact that we already have the audio
data in memory, we can provide an API for users
to create the audio visualization. And that was actually
how the project started. I had a podcast that time
and I always didn't like podcasts,
because you have always a problem, you don't understand who is talking right now.
But with frames and with really not so
much code, with just API that gets the vector of the
frequencies and then render it in using rectangles,
you can get the pretty much awesome looking
visualization with pretty much any
kind of design and render it like 1 hour
video within 15 minutes with AFO
frames, without any kind of additional
GPU usage, only on CPU, which is pretty impressive.
And I think that we have not a lot of time
anymore, but I can say that what
I learned for one point, ive years of developing the
video creation framework and the videos are really interesting.
And if you think the same,
this is the animation of f of frames. I'm really grateful
to invite you to try, but the effort frames,
because just now it came out to beta testing.
Yeah, right now. Yeah, I know, that's amazing.
And today, starting from today, you can sign up to
our discord and just put your GitHub
name to the beta channel. Like you can sign up
for this cup either from this QR code or go into
frames studio and try out the
internals of this project. Try out to play with CPU GPU
rendering, create pretty much any kind of video and automate
it whenever you want to. I'm really.
Thank you for watching this talk. Hey,