Transcript
This transcript was autogenerated. To make changes, submit a PR.
Today we will talk about deploying ML solutions with low latency in python in.
You can find me at my LinkedIn address mentioned here.
So, as the world moves forward with these research in improving accuracies of
deep learning algorithms, we face an imminent problem of deploying
those algorithms. So, as many of my fellow
researchers might know, everyone here present is involved
in, directly or indirectly, in improving the algorithms that we currently
use today. However, these developers that are
using those algorithms often face the problem of making those algorithms
real time, even after doing a bunch of stuff that we'll be
discussing today. So what do I mean by low
latency? So, it is a term used
to describe the performance of can ML pipeline,
and it's basically the time taken to process data
by the ML algorithms. It's like a
time taken to process a single image or a single video,
and highthroughput is the time taken to process these entire data,
and latency is inversely proportional to throughput. And our
focus in this talk will be to convert
or deploy our ML pipelines in low latency. So as you can see
in this graph, entire ML pipeline is it starts with the
input data and it goes through several steps before we can get the output.
And these. The bottleneck we face most of the
time is the model inference. And there are several methods by which we
can improve the latency of the algorithms. First, one is weight quantization,
wherein you can deploy the model
in multiple quantization methods using multiple quantization
methods such as float 16 int eight. There are
two types of quantizations. One is post training quantization,
and one is quantization aware training. I cannot go in detail in
each of these methods as the time won't permit me to do so. But basically
you can use any of those to check which method
is giving you better accuracies. And quantization often involves
you to deal with accuracies and speed versus speed.
So in case you go with int eight, you will get much better speed.
However, there will be a drop in accuracy, and I generally
find that float 16 is the best method to go because it preserves
my accuracy and it also gives me like 1.5
or two times of what float 32 gives me. The second method that
is often used is model pruning. So the basic concept
is that you prune certain layers and connections of the models
based on several experiments. You run your
model with multiple and like a bunch of images
to see which layers can be omitted
and can be skipped so that these parameters won't be
calculated by this way. Sometimes you can remove like
99% of your connections and you can still preserve your accuracy,
but that is like the best case solution. Another method that is quite popular
today is knowledge distillation, and it is the concept wherein you are transferring
the knowledge of a bigger model into a smaller model. So suppose
you have a Stanford car data set and you have trained a Resnet 151 and
you get a testing classification accuracy of 95%.
Now when you train a REsnet 18 on the similar data on the same data
set, you might not find that it is giving you a 95% accuracy.
You often see that it is limited to 80% to 90% because
of the shallow depth of it. So with using
knowledge distillation, what you can do is you can transfer the knowledge learned by
Resnet 151 and
use Resnet 18 as if the like use resonate
18 with 95% of accuracies. And the fourth method
that is my topic of the talk today, that is framework based deployment.
So we will be seeing two frameworks, Tensorrt and deep
steam, and we'll be seeing how they help us in deploying these algorithms.
So what is Tensorrt? Tensorrt is
SDK for high performance deep learning inference. It is provided
by Nvidia and they have written the entire tensorrt in C
with Python bindings available. It includes a deep learning inference
optimizer and runtime. So the optimizer's job is to optimize
your model, like when you are converting it to tensorrt, it will optimize
the entire model and it will convert the layers using advanced
CUDA methods in C. And the runtime is responsible
for actually running your tensorrt engines. So Tensorrt often
delivers low latency and high throughput for several deep learning applications.
I have tried Tensorrt in the industry and it works great.
It supports both Python and C. And nowadays
Tensorrt supports conversion from multiple frameworks such as
Tensorflow, Pytorch, Mixnet, Theano,
Onnx, etc. For reference, I have
linked the tensorrt's official documentation and developer page
below. So how does Tensorrt do the
entire thing? So it is responsible for optimizing
your model, and it does so using the method shown here.
What it does is it does layer in tensorfusion,
it does kernel auto tuning, it does precision calibration, it does
dynamic tensor memory, it uses dynamic tensor memory,
and it is also possible to use multi steam execution with Tensorrt.
That is, you won't have to worry about like you can actually use batch processing
for this using Tensorrt. And we'll be talking about more of these
methods in further slides. So let's
talk about weight and activation precision calibration. So,
to quantize full precision information into intate while
minimizing accuracies loss, Tensorrt must perform a process called
calibration to determine how best to represent the weights
and activations as eight bit integers. These calibration step
requires you to provide tensorrt with a representative sample of the input
training data, and no additional fine tuning or restraining of the model is
necessary. Also, you don't need to have access to the entire training data
set, you just give it a sample. Calibration is a
completely automated and parameter free method for converting your model from
f float 32 to NTA eight. What is kernel autotuning?
So, during the optimization phase of Tensorrt,
it also chooses from hundreds of specialized kernels that are created by default,
and many of them are hand tuned and optimized for a range of parameters
and target platforms. So as an example, there are several different algorithms
to do convolutions. Tensorrt will pick the implementation from
a library of kernels that delivers the best performance for the target gpu,
input data size, filter size, tensorrt layout, batch size and other parameters.
This ensures that the developed model is highperformance tuned for the specific
development platform as well as for the specific neural network being deploying.
So also, TensorRt is supposed to convert
your models on a particular development platform. So you
cannot run a TensorRt engine converted on suppose
NVDiA's 1050 Ti and use it on a 2060 Ti.
You have to do it on that particular GPU to be used on that particular
GPU. So what is dynamic tensor memory?
Tensorrt also reduces memory footprint and improves
memory you reuse by designating memory for each tensor
only for the duration of its use, avoiding memory allocation overhead
for fast and efficient execution. So what
is multi stream execution? So, as I mentioned before, it is basically the ability of
tensorrt to process multiple input streams in parallel,
and it does so beautifully.
And what is layer and tensorfusion?
So, tensorrt calibrates, while in
the optimization step it calibrates your entire model,
it sees the entire model, and it basically fuses
several layers or tensors together, so that you
reduce the parameter calculation and you reduce these
multiple times. The data has to be passed from one layer to another. It is
a single block. So suppose you have a convolution layer,
activation function layer, and a fully connected layer in a
network. What tensorrt it will do is it will combine all those three into
a single module. So basically it reduces the
time to traverse the data, and it
also reduces some of the overhead that is incurred
by each layer calls. So here
you can see the difference between the optimizer network and
the tensor RT optimized network. This is an example of Google's
Leannet architecture, which won the imagenet competition in 2014.
Sorry, it's called Google Net. So as
shown here, after layer intensive fusion, what happens
is you can see on the left side you have multiple layers,
whereas on the right side you have quite a few amount, number of layers.
A deep learning framework, what it does is it does multiple function calls
for calling each layer, and as each layer is
on the GPU, it translates to multiple CudA kernel launches.
The kernel computation is often very fast relative to the kernel
launch overhead and the cost of reading and writing the tensor
data for each layer. This results in the memory bandwidth bottleneck and underutilization
of the available GPU resources. And Tensorrt addresses this by vertically
fusing kernels to perform the sequential operations. Together,
these layer fusion reduces kernel launches and avoids writing into
and reading from memory between layers. So in the
figure shown, the convolution bias and relu layers of various sizes can
be combined into a single layer called as CBR.
And a simple analogy is making three separate
trips to the supermarket to buy three items versus buying all the
three in a single trip. And Tensorrt also recognizes
layers that share the same input data and filter size, but have different
weights. Instead of three separate kernels, tensorrt fuses them
horizontally into a single wider kernel, as shown
as one into one CBR. And tensorrt also eliminates the
concatenation layer by preallocating output buffers and
writing into them into a styled fashion. And that reduces a lot of,
the lot of the overhead. So I actually performed
several calculations and I ran several networks
as shown. These are three networks that I tested with using tensorrt
on my 1050 Ti GPU. Sorry, not 1050 Ti
1650 Ti. You see the number of layers before
fusion and number of layers after fusion,
and those are quite diminished and those are quite less.
That is definitely a lot of less calls.
So how does tensorrt workers like, how do you make it work?
So suppose we are using a pytorch based model.
What you simply have to do is connect the pytos based
model into onnx and import the onnx into tensorrt.
You don't have to select anything else. Tensorrt will automatically generate these applications
and generate an engine. You can then use that engine to perform inference
on the GPU. This is these similar process even for
a tensorflow based model or an Mxnate based model.
Another way by which you can convert to tensorrt is by using
the network definition API provided with C and Python.
It does give you a benefit like you
do get better accuracy and a better speed. Those are
marginally better in some cases and exceptionally better in some others.
But you can try out both methods and the easiest one is to
directly use on Nx parser that is provided with Tensorrt. These are some
of the metrics that I ran with another gpu
that is on JTX 1080 and I used only
for one batch size and I used the Retina phase Resnet 50 based
model as well as the mobile net zero point 25.
So here you can see with FP 32 and an input shape of 640
by 480 I got an FPS of 81
and with int eight based model I got can FPS of 190.
So these are better than real time and you can basically use multistream
now with these models. And believe me, Retina phase won't do
this default by default on Pytorch or tensorflow. And the
Retina phase mobile net based model, even with a float 32 quantized
state, it gave me an FPS of 400. And if you
look at the object detection model Yolo V five, so you can
see the FPS metrics on the right and they are super awesome. And you
can basically even use Yolo V five large on a GTX 1080 for
doing real time processing.
What are the best practices to deploy the model?
Basically use multiple quantization methods. Try those out. Do not
discard intake as easily as possible. Do not build an engine
for each inference as that is an overhead. Save the model, serialize the model
on the disk and then reuse it for your inference. Do try out
different workspace sizes because that would reduce your memory
things to keep in mind while using tensorrt. So engines
generated are specific to the machine. Installation takes time without Docker.
I often go with Docker whenever I'm installing Tensorrt because that's
quite easier. And there are multiple APIs for conversion.
That is Onnx passer, UFf passer network definition API
provided in C network definition API provided in Python.
So that was all that I wanted to talk about of Tensorrt.
It is a very broad topic and I would recommend you guys to go check
it out. Moving forward, we will be talking about again,
it involves Tensorrt, but it is a pipeline
provided by Nvidia specifically for deep learning solutions
and deploying ML solutions. So it is a multi scaled framework,
multiplatform, scalable framework with TLS security.
It can deploy on edge as well as on any cloud. It supports both
Python and C, and it uses GStreamer.
But Nvidia has custom developed the G Streamer objects
for the GPU, so that you often get low overhead in
the pre processing and post processing steps. So in Tensorrt
we saw that our target was the model inference phase,
which was a bottleneck. But after converting an engine to model inference,
even if we want to get more speed, we can use Deepstream, as it
will optimizer the entire pipeline. So the applications and
services of Deepstream are as shown below. You can use Python or
C Plus Plus. Deepstream SDK provides hardware
accelerated plugins, bi directional IoT messaging, Otmo model,
update, reference application, and helm charts.
Below that there is a CUDA layer which is used to deploy
your models, and you can use any of the Nvidia computing platforms
as shown here. So what is the process of a Deepstream pipeline?
So the first process is capturing your stream. It could be a
raw Deepstream RTSP stream, HTTP stream, or a video
recorded sent to a disk.
It is generally read using cpu, it's not read using GPU
right now. After that you have to often decode
the steam because it can be in multiple formats, and that decoding is actually done
on the GPU, so it's quite faster than on CPU, as you can
imagine. After that, Deepstream does image processing.
That is, in case you want any preprocessing steps such as scaling,
dwarping, cropping, et cetera. And all these steps are done on the
GPU. And with Deepstream you get automatic batching,
so you don't have to worry about batching together and then sending
it renders, like rewriting your pipeline to send it to the model.
Deepstream does this job on its own, while this job is done on
the cpu. After that you have several classifiers or
detect layers or segmentation layers that are on Tensorrt
or on the Triton inferencing server. Deepstream also
provides you with an option of by default, tracking.
It is done on both GPU and CPU. You can use that tracker,
and it's quite easy as it's already custom built in Deepstream.
After that you can do two things. Either you can visualize your results
on an on screen display and that conversion is done on the GPU,
or else you can store it to the cloud or send it to a disk.
And that can be done on an HDMI
cable, SATA cable, or using the Nvenc plugin. So these
are some of the models that Nvidia provides. These have
custom built all these for deep steam. So people, the use cases
are pretty specific with the model names and you
can see that you get an FPS of 1100
on a t four inference server and you get a real time FPS on
Jetson. Xavier this is quite awesome and
you can basically use a these detection on Jetson Nano with a 95 FPS.
And if you practically use model pruning and several
other steps that we discussed before, you can get even better
FPS on the models.
These are several other resources on Tensorrt and these are several
resources on the Deepstream. The slide deck is provided
to you and I hope you guys will check this out using these two
resources. Thank you and I'm glad to be
present here. Here the code for retina these tensorrt conversion is
shown on the GitHub link. Please go and see how you can convert
your retina these based model to Tensorrt and deploy
it on any machine. Thank you.