Deploying ML solutions with low latency in Python

Video size:

Abstract

When we aim for better accuracies, sometimes we forget that the algorithms become more massive and slower. This fact renders the algorithms unusable in real-time scenarios. How do you deploy your solution? Which framework to use? Can you use Python for deploying my solution? Can you use Jetson Nano for multi-stream inferencing? If you are curious to solve these questions, join me in this talk to discover TensorRT and DeepStream and how they reduce your algorithm’s latency and memory footprint.

NVIDIA TensorRT™ is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. DeepStream offers a multi-platform scalable framework with TLS security to deploy on edge and connect to any cloud. If you are using a GPU and CUDA/Tensor cores, you can leverage the SDK framework to deploy bigger and better algorithms for your real-time scenarios. The main focus of this talk will be to demonstrate why, where, and how to use TensorRT and DeepStream.

Summary

Today we will talk about deploying ML solutions with low latency in python in. Our focus will be to convert or deploy our ML pipelines in low latency. The bottleneck we face most of the time is the model inference. There are several methods by which we can improve the latency of the algorithms.
Tensorrt is SDK for high performance deep learning inference. It is provided by Nvidia and they have written the entire tensorrt in C with Python bindings available. Tensorrt often delivers low latency and high throughput for several deep learning applications. And we'll be talking about more of these methods in further slides.
Tensorrt must perform calibration to determine how best to represent the weights and activations as eight bit integers. Calibration is a completely automated and parameter free method for converting your model from f float 32 to NTA eight. This ensures that the developed model is highperformance tuned for the specific development platform.
Tensorrt can process multiple input streams in parallel. It fuses several layers or tensors together. This avoids memory allocation overhead for fast and efficient execution. Here are some of the metrics that I ran using tensorrt.
Deepstream is a pipeline provided by Nvidia specifically for deep learning solutions and deploying ML solutions. It is a multi scaled framework, multiplatform, scalable framework with TLS security. Can deploy on edge as well as on any cloud.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Today we will talk about deploying ML solutions with low latency in python in. You can find me at my LinkedIn address mentioned here. So, as the world moves forward with these research in improving accuracies of deep learning algorithms, we face an imminent problem of deploying those algorithms. So, as many of my fellow researchers might know, everyone here present is involved in, directly or indirectly, in improving the algorithms that we currently use today. However, these developers that are using those algorithms often face the problem of making those algorithms real time, even after doing a bunch of stuff that we'll be discussing today. So what do I mean by low latency? So, it is a term used to describe the performance of can ML pipeline, and it's basically the time taken to process data by the ML algorithms. It's like a time taken to process a single image or a single video, and highthroughput is the time taken to process these entire data, and latency is inversely proportional to throughput. And our focus in this talk will be to convert or deploy our ML pipelines in low latency. So as you can see in this graph, entire ML pipeline is it starts with the input data and it goes through several steps before we can get the output. And these. The bottleneck we face most of the time is the model inference. And there are several methods by which we can improve the latency of the algorithms. First, one is weight quantization, wherein you can deploy the model in multiple quantization methods using multiple quantization methods such as float 16 int eight. There are two types of quantizations. One is post training quantization, and one is quantization aware training. I cannot go in detail in each of these methods as the time won't permit me to do so. But basically you can use any of those to check which method is giving you better accuracies. And quantization often involves you to deal with accuracies and speed versus speed. So in case you go with int eight, you will get much better speed. However, there will be a drop in accuracy, and I generally find that float 16 is the best method to go because it preserves my accuracy and it also gives me like 1.5 or two times of what float 32 gives me. The second method that is often used is model pruning. So the basic concept is that you prune certain layers and connections of the models based on several experiments. You run your model with multiple and like a bunch of images to see which layers can be omitted and can be skipped so that these parameters won't be calculated by this way. Sometimes you can remove like 99% of your connections and you can still preserve your accuracy, but that is like the best case solution. Another method that is quite popular today is knowledge distillation, and it is the concept wherein you are transferring the knowledge of a bigger model into a smaller model. So suppose you have a Stanford car data set and you have trained a Resnet 151 and you get a testing classification accuracy of 95%. Now when you train a REsnet 18 on the similar data on the same data set, you might not find that it is giving you a 95% accuracy. You often see that it is limited to 80% to 90% because of the shallow depth of it. So with using knowledge distillation, what you can do is you can transfer the knowledge learned by Resnet 151 and use Resnet 18 as if the like use resonate 18 with 95% of accuracies. And the fourth method that is my topic of the talk today, that is framework based deployment. So we will be seeing two frameworks, Tensorrt and deep steam, and we'll be seeing how they help us in deploying these algorithms. So what is Tensorrt? Tensorrt is SDK for high performance deep learning inference. It is provided by Nvidia and they have written the entire tensorrt in C with Python bindings available. It includes a deep learning inference optimizer and runtime. So the optimizer's job is to optimize your model, like when you are converting it to tensorrt, it will optimize the entire model and it will convert the layers using advanced CUDA methods in C. And the runtime is responsible for actually running your tensorrt engines. So Tensorrt often delivers low latency and high throughput for several deep learning applications. I have tried Tensorrt in the industry and it works great. It supports both Python and C. And nowadays Tensorrt supports conversion from multiple frameworks such as Tensorflow, Pytorch, Mixnet, Theano, Onnx, etc. For reference, I have linked the tensorrt's official documentation and developer page below. So how does Tensorrt do the entire thing? So it is responsible for optimizing your model, and it does so using the method shown here. What it does is it does layer in tensorfusion, it does kernel auto tuning, it does precision calibration, it does dynamic tensor memory, it uses dynamic tensor memory, and it is also possible to use multi steam execution with Tensorrt. That is, you won't have to worry about like you can actually use batch processing for this using Tensorrt. And we'll be talking about more of these methods in further slides. So let's talk about weight and activation precision calibration. So, to quantize full precision information into intate while minimizing accuracies loss, Tensorrt must perform a process called calibration to determine how best to represent the weights and activations as eight bit integers. These calibration step requires you to provide tensorrt with a representative sample of the input training data, and no additional fine tuning or restraining of the model is necessary. Also, you don't need to have access to the entire training data set, you just give it a sample. Calibration is a completely automated and parameter free method for converting your model from f float 32 to NTA eight. What is kernel autotuning? So, during the optimization phase of Tensorrt, it also chooses from hundreds of specialized kernels that are created by default, and many of them are hand tuned and optimized for a range of parameters and target platforms. So as an example, there are several different algorithms to do convolutions. Tensorrt will pick the implementation from a library of kernels that delivers the best performance for the target gpu, input data size, filter size, tensorrt layout, batch size and other parameters. This ensures that the developed model is highperformance tuned for the specific development platform as well as for the specific neural network being deploying. So also, TensorRt is supposed to convert your models on a particular development platform. So you cannot run a TensorRt engine converted on suppose NVDiA's 1050 Ti and use it on a 2060 Ti. You have to do it on that particular GPU to be used on that particular GPU. So what is dynamic tensor memory? Tensorrt also reduces memory footprint and improves memory you reuse by designating memory for each tensor only for the duration of its use, avoiding memory allocation overhead for fast and efficient execution. So what is multi stream execution? So, as I mentioned before, it is basically the ability of tensorrt to process multiple input streams in parallel, and it does so beautifully. And what is layer and tensorfusion? So, tensorrt calibrates, while in the optimization step it calibrates your entire model, it sees the entire model, and it basically fuses several layers or tensors together, so that you reduce the parameter calculation and you reduce these multiple times. The data has to be passed from one layer to another. It is a single block. So suppose you have a convolution layer, activation function layer, and a fully connected layer in a network. What tensorrt it will do is it will combine all those three into a single module. So basically it reduces the time to traverse the data, and it also reduces some of the overhead that is incurred by each layer calls. So here you can see the difference between the optimizer network and the tensor RT optimized network. This is an example of Google's Leannet architecture, which won the imagenet competition in 2014. Sorry, it's called Google Net. So as shown here, after layer intensive fusion, what happens is you can see on the left side you have multiple layers, whereas on the right side you have quite a few amount, number of layers. A deep learning framework, what it does is it does multiple function calls for calling each layer, and as each layer is on the GPU, it translates to multiple CudA kernel launches. The kernel computation is often very fast relative to the kernel launch overhead and the cost of reading and writing the tensor data for each layer. This results in the memory bandwidth bottleneck and underutilization of the available GPU resources. And Tensorrt addresses this by vertically fusing kernels to perform the sequential operations. Together, these layer fusion reduces kernel launches and avoids writing into and reading from memory between layers. So in the figure shown, the convolution bias and relu layers of various sizes can be combined into a single layer called as CBR. And a simple analogy is making three separate trips to the supermarket to buy three items versus buying all the three in a single trip. And Tensorrt also recognizes layers that share the same input data and filter size, but have different weights. Instead of three separate kernels, tensorrt fuses them horizontally into a single wider kernel, as shown as one into one CBR. And tensorrt also eliminates the concatenation layer by preallocating output buffers and writing into them into a styled fashion. And that reduces a lot of, the lot of the overhead. So I actually performed several calculations and I ran several networks as shown. These are three networks that I tested with using tensorrt on my 1050 Ti GPU. Sorry, not 1050 Ti 1650 Ti. You see the number of layers before fusion and number of layers after fusion, and those are quite diminished and those are quite less. That is definitely a lot of less calls. So how does tensorrt workers like, how do you make it work? So suppose we are using a pytorch based model. What you simply have to do is connect the pytos based model into onnx and import the onnx into tensorrt. You don't have to select anything else. Tensorrt will automatically generate these applications and generate an engine. You can then use that engine to perform inference on the GPU. This is these similar process even for a tensorflow based model or an Mxnate based model. Another way by which you can convert to tensorrt is by using the network definition API provided with C and Python. It does give you a benefit like you do get better accuracy and a better speed. Those are marginally better in some cases and exceptionally better in some others. But you can try out both methods and the easiest one is to directly use on Nx parser that is provided with Tensorrt. These are some of the metrics that I ran with another gpu that is on JTX 1080 and I used only for one batch size and I used the Retina phase Resnet 50 based model as well as the mobile net zero point 25. So here you can see with FP 32 and an input shape of 640 by 480 I got an FPS of 81 and with int eight based model I got can FPS of 190. So these are better than real time and you can basically use multistream now with these models. And believe me, Retina phase won't do this default by default on Pytorch or tensorflow. And the Retina phase mobile net based model, even with a float 32 quantized state, it gave me an FPS of 400. And if you look at the object detection model Yolo V five, so you can see the FPS metrics on the right and they are super awesome. And you can basically even use Yolo V five large on a GTX 1080 for doing real time processing. What are the best practices to deploy the model? Basically use multiple quantization methods. Try those out. Do not discard intake as easily as possible. Do not build an engine for each inference as that is an overhead. Save the model, serialize the model on the disk and then reuse it for your inference. Do try out different workspace sizes because that would reduce your memory things to keep in mind while using tensorrt. So engines generated are specific to the machine. Installation takes time without Docker. I often go with Docker whenever I'm installing Tensorrt because that's quite easier. And there are multiple APIs for conversion. That is Onnx passer, UFf passer network definition API provided in C network definition API provided in Python. So that was all that I wanted to talk about of Tensorrt. It is a very broad topic and I would recommend you guys to go check it out. Moving forward, we will be talking about again, it involves Tensorrt, but it is a pipeline provided by Nvidia specifically for deep learning solutions and deploying ML solutions. So it is a multi scaled framework, multiplatform, scalable framework with TLS security. It can deploy on edge as well as on any cloud. It supports both Python and C, and it uses GStreamer. But Nvidia has custom developed the G Streamer objects for the GPU, so that you often get low overhead in the pre processing and post processing steps. So in Tensorrt we saw that our target was the model inference phase, which was a bottleneck. But after converting an engine to model inference, even if we want to get more speed, we can use Deepstream, as it will optimizer the entire pipeline. So the applications and services of Deepstream are as shown below. You can use Python or C Plus Plus. Deepstream SDK provides hardware accelerated plugins, bi directional IoT messaging, Otmo model, update, reference application, and helm charts. Below that there is a CUDA layer which is used to deploy your models, and you can use any of the Nvidia computing platforms as shown here. So what is the process of a Deepstream pipeline? So the first process is capturing your stream. It could be a raw Deepstream RTSP stream, HTTP stream, or a video recorded sent to a disk. It is generally read using cpu, it's not read using GPU right now. After that you have to often decode the steam because it can be in multiple formats, and that decoding is actually done on the GPU, so it's quite faster than on CPU, as you can imagine. After that, Deepstream does image processing. That is, in case you want any preprocessing steps such as scaling, dwarping, cropping, et cetera. And all these steps are done on the GPU. And with Deepstream you get automatic batching, so you don't have to worry about batching together and then sending it renders, like rewriting your pipeline to send it to the model. Deepstream does this job on its own, while this job is done on the cpu. After that you have several classifiers or detect layers or segmentation layers that are on Tensorrt or on the Triton inferencing server. Deepstream also provides you with an option of by default, tracking. It is done on both GPU and CPU. You can use that tracker, and it's quite easy as it's already custom built in Deepstream. After that you can do two things. Either you can visualize your results on an on screen display and that conversion is done on the GPU, or else you can store it to the cloud or send it to a disk. And that can be done on an HDMI cable, SATA cable, or using the Nvenc plugin. So these are some of the models that Nvidia provides. These have custom built all these for deep steam. So people, the use cases are pretty specific with the model names and you can see that you get an FPS of 1100 on a t four inference server and you get a real time FPS on Jetson. Xavier this is quite awesome and you can basically use a these detection on Jetson Nano with a 95 FPS. And if you practically use model pruning and several other steps that we discussed before, you can get even better FPS on the models. These are several other resources on Tensorrt and these are several resources on the Deepstream. The slide deck is provided to you and I hope you guys will check this out using these two resources. Thank you and I'm glad to be present here. Here the code for retina these tensorrt conversion is shown on the GitHub link. Please go and see how you can convert your retina these based model to Tensorrt and deploy it on any machine. Thank you.

Slides

Download slides (PDF)

See all 23 talks at this event!

Conf42 Machine Learning 2021 - Online

July 29 2021

Deploying ML solutions with low latency in Python

Video size:

Abstract

Summary

Transcript

Slides

Aditya Lohia

Machine Learning Engineer @ Tod'Aers

Join the community!

Featured event

2025

2024

Info

Conf42 Machine Learning 2021 - Online

July 29 2021

Deploying ML solutions with low latency in Python

Video size:

Abstract

Summary

Transcript

Slides

Aditya Lohia

Machine Learning Engineer @ Tod'Aers

Join the community!