Conf42 Machine Learning 2023 - Online

Cascade models in computer vision to boost accuracy and performance

Video size:

Abstract

Hi! I was following your events and I would be thrilled to give a talk about my favourite sphere - Computer Vision. I have experience in real tasks, articles, lectures, hackathons and I would love to share that experience, especially because some low cost techniques can really boost up performance

Summary

  • A computer vision engineer at Diagnacad talks about a technique in computer vision. The technique uses two neural nets in cascade. Both nets are tuned in a specific was to complement each other. It can be used to solve business problems like security cameras and false alarms.
  • You need to collect data as close to inference as possible. As a first neural network detector, I often use yellow models. I highly recommend converting your models to tensor RT. Second thing I recommend having a serving solution.
  • Try using a classifier as your second model which runs on the crop of the detector when it finds anything. Increase precision without sacrificing speed.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, my name is Argo, and I'm a computer vision engineer at Diagnacad. So today I would like to talk about a cool little technique in computer vision, and we will discuss different cases where you can use it. Let's start with some basics. Computer vision is, generally speaking, when we use an algorithm to analyze an image. In our case, we're talking about deep learning and neural networks. For image processing, classification is the simplest problem, and it is used when we want to know if there is a can or a dog on the image, for example. So, classification is the most basic one. We give an image to enroll net and get only class prediction. But in the real world, classification is not that useful just by itself. In a lot of cases, we need an object detection or segmentation algorithm. In an object detection task, we just have two subtasks. The first one is to localize our object, and the second one is to classify that object. And when we find an object, we localize it with a bounding box or a b box. Computationally speaking, object detection is a lot pricier than just classification. So we are going to keep that in mind. Let's get closer to real case, where we can make use of an object detection algorithm for solving business problems. Imagine that we have a lot of security cameras, and we want to make sure that our building is safe. Maybe it's a school or a shopping center. And we want to detect any situation where someone gets in with a gun. It would be really expensive to have a lot of security guards who were watching every security camera. So it would be nice to have a neural net, which can process video streams from all cameras and send an alert if something suspicious is happening. That's an object detecting task. But because the system will work 24/7 we will get false alarms. Pretty much a lot. And by false alarm, I mean when the model thinks that, for example, a cell phone is a handgun and gives a false alert. So we would like to have an operator to validate that. Alarms difficulties. Yeah, there are some in this task, and we're going to talk about a couple of them. The first one is that our solution needs to be fast. It means that it should be able to process a lot of frames per second and more cameras we have in our building, more frames we need to process. We can't use infinite amount of hardware, of course, because if the solution is expensive, it doesn't make any sense. Faster neural net, we have more frames per second. We can process on one server. The second problem, a lot of false alerts. We have already talked about it a little bit, but generally speaking, our video stream should be processed twenty four seven. And that's more than 2 million frames per day per camera. That means even if we have a really low rate of false positives, we're still going to get them. So we have to deal with it. And the third problem is that our first two problems don't play well with each other. Unfortunately, as it's always a tradeoff between accuracy and speed. Faster neural net we have, less accuracy we get and more false alerts we create. Usually, if we want more accurate models, we need more hardware for running bigger models at needed speeds. But is there anything that can help a little bit with this tradeoff? Yeah, fortunately there are several techniques, and I'm going to talk about one of them, which has shown its efficacy in different cases. So how can we increase our accuracy without losing processing speed and using the same amount of computing hardware? We can get pretty close to that with this idea. And the technique is to use two neural nets in cascade. So one after another. But both nets are tuned in a specific was to complement each other. The first neural net is a fast detector, so it runs on the whole frame, but downscaled to 640 by 640 and localizes interested object and classifies it. This model should be tuned to have higher recall, sacrifice and precision, which means that model will try to always detect a gun, but will often give a false positive on a cell phone or something else. Detectors should be trained on relevant to inference images with guns. The second neural net is a classifier. It runs on the crop made by a detector, so it runs only if detector finds anything at the frame. Classifier's task is to validate if chosen by detector object is a gun or not. Classifier should be trained on crops from the detector. So again, real images but crops from the detector and classifiers should have several classes. Not only a gun, but some other objects which are often detected as a gun, for example, cell phone, umbrella, bag, or something else. The goal of the detector is to detect an object which might be a gun. Classifier should be able to correctly validate that object. So training both nets is done based on that. I should mention that classifier is easier to retrain based on real cases after deployment. So that's a great thing too. So why does it work? Let's discuss a couple of things. First of all, why is it fast? Basically, the speed of our solution is tied to the speed of the first neural network, the detector. It's because our second net runs just in cases when firstnet finds anything. So 99% of the time second net is idle. The second thing is that because we have our classifier and it can really validate our solution, our prediction, we can pick a fast detector. And now let's discuss why we gain accuracy. First of all, we use an ensemble of two models. Then second model gets full resolution, part of the image, not downscaled like the first net. Third one, we use different data set data sets for our models and our classifier is tuned to distinguish guns and other similar object. And one more thing, from detector we need good detection. From classifier we need good classification so we can choose models based exactly on that. We do not need to sacrifice classification power of the detector just because it is fast. Here are some metrics. As you can see on the right side we have slightly decreased recall but noticeably increased precision. And that's good because we don't want to spam operators with our task. Now let's touch a little bit on specifics of collecting the data set for this technique. Of course you should remember about general things like collecting balanced data set, data set, diverse representative, have a good and consistent labeling and so on. But here are a couple of things that I want to mention for the detector. You need to collect data as close to inference as possible. If your inference is going to be on security cameras, you would want to collect your data set on cameras close to that ones for sure. You need to place it correctly and have correct angles. Just how inference cameras are going to have your object in the frame should be close to inference too. So in an example with a gun, you don't need to bring gun like to 10 cm in front of the camera because it is not realistic. So collect realistic images for the classifier. You would want to create a lot of crops with your detector. It's a good idea to run a detector with a low threshold to get a lot of false positives. Then try to analyze images and select groups like cell phones. If it shows up often and create a separate class, then you can create a class named other and put there all images which do not belong to any created class. Yeah, I will also mention deployment slightly. As we have said, it is important for us to have fast solution so we can just train a model on Pytorch or tensorflow and call it a date. I highly recommend converting your models to tensor RT and it's basically an optimization for Nvidia's gpus as it will make your model like two to five times faster and you will need less vram on your inference so both things are really good when you want to process as many video streams as possible. With Pytorch, you can also have a look to Tors script, but basically tensor RT is always better, I mean faster. So have a look to Tensorrt. The second thing I recommend having a serving solution, serving software such as Treetonserve or torch serve or maybe tensorflowserve. Basically it's like a backend where you store your model and then send requests with images to process that will help you for scaling your deployed solution. And let's have a look to a couple of examples so you just know where to start. As a first neural network detector, I often use yellow models. You can have a look to yellow v eight and there are several different models from nano version to extra large, so you can really find what you're looking for and what is important. Yellow models converge to tensor RT really easily, so keep an eye on that. As our second model classifier, I often start with efficient net maybe efficient net version two. It also has different sizes, around eight, so you can choose. All right, let's sum up. When you need to deploy an object detection solution which is going to be need to increase precision without sacrificing speed. Try using a classifier as your second model which runs on the crop of the detector when it finds anything. Thanks for your attention.
...

Argo Saakyan

Computer Vision Engineer / Researcher

Argo Saakyan's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)