Abstract
Drones with mounted cameras provide significant advantages when compared to fixed cameras for object detection and visual tracking scenarios. Given their recent adoption in the wild and late advances in computer vision models, many aerial datasets have been introduced.
In this talk, we’ll explore recent advances in object detection, comparing the challenges of natural images with those recorded by drones. Given the successes achieved by pretraining image classifiers on large datasets, and transferring the learned representations, a set of object detectors fine-tuned on publicly available aerial datasets will be presented and explained. We’ll highlight existing libraries that mitigate the cost of training large models from scratch, by including pretrained model weights and model variants found in the literature. Both Convolutional Neural Networks and the newly developed Transformers applied to vision will be covered and compared, outlining the main features of each architecture. The presentation will be accompanied by code snippets for aiding understanding and delivering practical examples.
This is aimed at a general audience familiar with Python. Knowledge of Computer Vision is a plus but not a requirement as we’ll introduce the necessary concepts. We’ll ground the presented model architectures and libraries on the task of object detection applied to aerial datasets and demonstrate that state-of-the-art methods are within everyone’s reach.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Eduardo Dixo, I'm a senior data scientist at Continental and
today I'm going to talk to you about object detectors using cnns and
transformers applied to images recorded by datasets. First I'm going to
introduce the task of object detectors and also the data set that
we'll be using. Next we'll see some common CNNS based architectures
like the faster RCNN and retinate before discussing
the transformer and seeing how the detection transformer performs on our data set.
So let's begin by first introducing the task of object detection taskoff.
Object detection can be regarded as given an input
image. We want to find all the objects that are present in that image.
So we need to spatially locate them using bounding
boxes and also we need to classify them into a set of
predefined classes. If we compare the task of object detection with
image classification, for example, in image classification we
usually have a single main target. In object detection we may
have different number of objects present in that image
with different poses, with different scales. And this makes the task
very challenging, more challenging than the image classification. For example,
the data set that we are going to use is those visron data set that
contains nearly 6000 training images and 500 validation
images. It also contains ten categories from which we are only interested
in cars. We are going to build an object detector for
detecting only one class, which will be cars. And this
data set is interesting because it records the
images and the different conditions like different weather,
different lighting, different object density of
the scenes, different scales of the objects. We have some fast motion artifacts
because of the movement of the cars or the movement of
those drone during flight. And also the bounding boxes are annotated for
occlusion and those truncation. Some applications of
training such object detector could be interesting
for road safety, traffic monitoring or
even driving assistance as finding free tracking
slots. First, we make a distinction between the one
stage and two stage object detectors. Two stage object
detectors contain a region proposal network
that will output high confidence region proposals
that should contain an object on it. So it's not concerned
what is the class of the object in it, it's only concerned if there is
an object or not. And then the object detector head that typically
drones bounding box regression for finding the position of the object
and object classification to find its class can attend to these
proposed regions and by doing so it will
have a much smaller set of candidate regions that might
have an object. And this will eliminate
many of the case positives that we would have otherwise. A one shot
detector, on the other hand, generates a dense sampling
of possible object locations. So it will generate lots of
object candidate locations with different shapes and
different aspect ratios, and it will process them directly to
learn the class labels and bounding boxes. The first model that we are
going to discuss is the faster RCNN. The faster RCNn
is a two stage object detector that employs two models,
a region proposal network, and also the classifier head
that has the bounding box, regression and object classification. We will start
by following those typical data flow of the image as
it goes through the architecture. So the image goes through the backbone.
The backbone goal is to extract eye
level semantic feature maps from the image. That will be
useful later for the region proposal network and for the classifier.
This can be typically achieved by any of the
shelf convolutional architectures like Rasnet or Vgg.
As the image goes through these several convolutional
layers, it gets downsampled
so it will have smaller width and smaller height,
but much more depth, meaning that the feature map of the
last stage of the backbone will have
many channels. Next we have this region proposal network.
This region proposal network will predict the object bounds as well
as the objectness cars. Meaning if it is
an object or not and it's a fully convolutional network,
it will receive as input the feature maps from the backbone.
It will slide those window over these feature maps. At each
point of those sliding window it will generate k anchor boxes.
The number of anchor boxes is parameterized by this k and
it will have two sibling networks for the outputs,
one that is two times the number of anchor boxes
for the score classification in foreground and
background, and the other one will be four times the
number of anchor boxes for the bounding box coordinates.
Finally, now we have a set of regions proposed
by this region proposal network module, and in a very naive way,
we could simply crop the image using these proposal regions
and feed it into another classifier just
to get the object class. However, we want to make those
end to end and to reuse the feature maps that we have
computed from the backbone. And for doing so, we are going
to map the feature maps to the proposals of the region proposal network using
this region of interest pulling layer that will extract
then fixed size features maps from each of these proposals
from the feature map. The reason these are fixed size
is because we are going to use fully connected layer that expects fixed size.
Then we have this classifier that will predict the object
class as well as the bounding box coordinates.
We are going to use the detection tool library which is a pytorch based
deep learning framework for object detectors and
also semantic segmentation. And we are using to use
faster RCNn with a Resnet 50 backbone
using fully features pyramid networks the reason we
are going to use these feature pyramid networks is because we have
images in our data set that have very
small scale so we have small cars and also large cars that we
want to detectors depending on the altitude that the drone is
flying. And by using these feature pyramid networks we
can improve the multiscale object detection because those goal
of the feature pyramid network is to build these eye level semantic
feature maps across all the pyramid levels from
a single image of a single
resolution. This is done by merging
the bottom pathway which is the feature maps
from our CNN backbone that then are upsampled through
those top line pathway and merged through lateral
connections in the feature pyramid network
architecture. For training the faster RCNN the first step is to register
our data set. We do this so that
the detectors two knows how to obtain it.
If we already have the annotations in adjacent cocoa
format, we can use the register cocoa instances directly.
In this case we have prepared the annotations in this format so we can use
those register cocoa instances and we also pass the
base path images so it knows where to fetch the images
from. Next, detectors two uses the key value
config system based on YaML files
that provide already some common functionality and operations.
If we require more advanced features, we can drop down to the Python's
API or also derive from a
base config file and implement the attributes. And in here
what we are going to do is first we load the default configuration
file. We then inherit from
the configuration file of the model that we want to fine tune.
We specify the training and test data sets that we already registered previously.
We specify the number of workers for the multiprocessing part
and we load the pretrained model weights from
the detectors two model zoo. Then we have the learning rate,
the maximum number of iterations, the batch size, and the steps at
which to decay the learning rate. All of these are very important parameters
that we should tune to get the best metrics,
but also to squeeze the best performance out
of the GPU. And then we specify the number of classes for
this particular architecture, which is one
because we are only interested in detectors cars.
Finally, we can launch the training using the default trainer class
that provide out of the box standard training logic.
If we require, we could also implement our own Python
training loop or also subclass this default trainer
in here. Since we are not loading from a checkpoint. We pass this resemb equal
false. Now we take a look at a one stage detector.
So retinate is a powerful one stage detector
that employs the feature pyramid network that we have seen before that helps
with a multiscale detection of
the objects and also two
civilian networks, one for classification and the other for bounding box
regression. The one stage detectors were typically regarded
as being faster than the two stage, but they were lagging the accuracy
of the two stage detectors. So the authors of retinate
attributed this to the eye class imbalance between
foreground and background that may happen.
And the reason is if you remember these
one stage detectors, they will examples
a large set of candidate regions, many of them will
be background, will be easy negatives and they will not contribute
with a useful learning signal for the network or they
can even overwhelm the training loss.
So what they propose is this nova loss called those focal loss that
adds this modulating factor to the standard cross entropy and
it will downweight the well classified examples so
that the model can focus more on the other examples. For the retinate
we are also going to use REsnet 50 backbone for
comparison with the faster RCNN. We also use the detectors
and two library for doing so. Registering the data
set requires no changes. Launching the training also requires no
changes, but we need to change the configuration file.
So in this case we need to inherit from the appropriate model. We also
need to load the appropriate models from the model zoo. And now
for setting the number of classes we need to access a
different attribute of the config which is under those retinate
num classes. After training both models we see that
they both have good cocoa evaluation
metrics. We are using the average precision which
basically penalizes missing detections
and also detecting having too many duplicate
detections for the same object. And we see
that the average precision is very similar for each model. In this case the
retina net is better at detecting larger
objects but worse at smaller objects. But if we look at the average precision,
they are very equally matched and also the inference results
as well. Another thing that is commonly employed
in computer vision is that augmentation for aiding in the generalization
of the network. And the reason is that we want our object detection
to work under different lighting, viewpoint scales, et cetera.
So we can generate an augmentation policy that will bake
these transformations there. And so
we pass our data set through this augmentation policy, enriching our data
set that we will then use for training our model.
In this example we have an horizontal flip and
also some we can see on the left the
augmentation policy used and also some random brightness,
some random saturation, some random contrast.
For using this augmentation policy we use those that takes
a data set. We use this data set mapper
that takes a data set in detection two and then we map
our data set into a format that will be used by the model,
which is this dictionary with the keys it
with image instances. So we
read the image, we transform it using the augmentation policy
that we have defined. We also need to be careful for transferring as
well the bounding boxes and then of generating those
data in the format the model expects. But we are not limited to
use representations only from detectors tool.
We can also integrate external libraries like algorithmations
or cornea and these libraries
have a very large collection of transformations
that are not readily available in
detectors tool like this random sandflower and that we can also use.
One comment is that we used data
augmentation for training the faster RCNN and the retina net,
but we didn't see improvements even
when training for more iteration steps.
Now we will discuss those transformers. The transformer
was originally proposed as a sequence to sequence model for machine translation
and it is now a standard in natural language processing.
But also it has found its way into computer vision and other tasks.
It's a very general purpose architectures that lacks the inductive biases
of cnns, for example the locality and translation invariance.
But given large enough scale data, it can learn this from the data
and perform on par or even surpassing the cnns.
The vanilla transformers uses
an encoder and a decoder. The encoder has
two modules and the decoder, the multi head self
attention and the feed forward network. And we employ around each models
a residual connection and also layer normalization.
The decoder also uses cross attentions. So in those
cross attention the keys and values come from
the encoder and the queries come from the decoder. And we
also have when we talk about differences
between applying these transformers from NLP to visions, we have
these differences in scale and resolution scale being
that in NLP the words serve as the basic elements of pre
processing. But when we're talking about object detection, those objects
may vary in scale, so they may be compared of a different number of pixels
and resolution. If we think that, for example,
those images are comprised of a
big number of a large number of pixels. Since the soft
attention is very central to the transformers, let's see what
makes it so appealing when compared to other layers.
We see that self attention. So in here, this table on
the bottom left, the t stands for the sequence
size and the d stands for the representation dimensionality
of each part of the sequence. And we see that self attention is
more parameter efficient and fully connected layers as well,
better at handling arbitrary variants input sizes.
And if we compare this to recurrent layers,
it's also more perimeter efficient if the size of the sequence
is smaller than the representation dimensionality when compared to
convolutional layers. Convolutional layers
for achieved a global receptive field, meaning that every pixel
would interact with every pixel, we typically
need to stack many of these convolutional layers on top of each other.
And in self attention, all parts of those sequence interact with each
other within a single layer. Let's take a look at how
the self attention works. So the self attention relates
different positions of a sequence to compute a different representation
of that sequence. So we feed it as an input a sequence
z, in this case of size t and dimensional
td, and we compute three matrices, the queries,
keys and values. We do so by multiplying the input with this
matrix UQV and slicing along the last
dimension, the dimension of the tree times the dimension of
the head, and this will generate the queries, keys and values
for us. Next, we compute the dot product between the queries and
keys, so the queries and keys must have the same dimension and we
divide by a scaling factor. To alleviate vanishing gradient problems,
we apply a soft max in a row wise manner, and this will
be our attention matrix that has size t
by t. So it's quadratic to the size of the input
sequence, which is one of the bottlenecks of the transformer. And then we multiply
this by v, our matrix value to
retrieve the final computation.
However, the transformer
doesn't use the regular self attention, it use a generalization of it,
which is called a multi head self attention. Multi head self attention
is an extension of the self attention in which we run case of attention
operations in parallel. So we run
many self attentions in parallel, we concatenate them,
and then we do a linear projection again to
the dimension d. To not explode the dimensionality.
Let's now revisit the transformers after having seen how the self attention
works. So in the original transformer we add
an encoder and a recorded but we can also
use only a part of the architecture. For example,
architectures that only use those encoder part like vert,
are important when we only want global representation
of those sequence. And you want to build classification on top of it. For example
for performing sentiment analysis. When we
architectures that only use a decoder are used for
language modeling like GPD two. And we also have architectures
that use both encoder and decoder like detection transformer
that we'll see next. A fact that is also important is that self attention is
invariant to the position of those tokens. So it's very common to add
these position encodings to the input so that those model can reason
about the positions of the parts of the
sequence during
the self attention in the encoder and decoder
blocks. And now we are going
to talk about the detection transformer. The detection transformer is a
very simple architectures that is based on a CNN and a
transformer recorded architectures and it
uses a CNN backbone. So we feed it an images
and this image goes through the CNN backbone and
generates a feature map with lower width
and lower weights, but with a much deeper
number of channels. And now we have this distancer
of width, height and channels, but we want to feed it into the transformer
encoder. But the transformer encoder is expecting a
sequence. So the way we can do this is by flattening the spatial
dimensions of the input, by multiplying the
height and width and then we can feed it into the transformer encoder.
Then we have the transformer decoder that has these object
queries which are learned by the model as the input.
And these object queries are the number of objects that we are trying to
detect in an image. So it must be set to
be larger than the largest number of objects
that we have in an image to provide us some slack. And they will learn
to attend to specific areas and specific bounding boxes
sizes in an image. Then the decoder is
also conditioned on the encoder output
and we predict the classes,
the object class and the bounding box through parallel decoding. So it's
not in an autoregressive way. We output them in parallel
and we are treating the object detectors problem as a direct set case
prediction. So we need an appropriate loss for that. They use
those bipartite matching loss based on the hungarian algorithm
that is permutation invariant and also forces a unique
assignment between the ground truth and the predicted objects.
We are going to use the egging phase library that contains many
transformers and they recently added those visual transformers
for image classification like the visual transformer VIT and also
this detection transformer for object detection. They added this
to the library and we are going to use based on a REsnet 50
backbone the reason we use this dilated convolutional
is that the dilated convolution will increase
the resolution by a factor of two at the expense of more computations,
but it will help detecting small scale
objects. Egin face provides a very comprehensive
set of documentation. It also explains the internal part of those model and we also
have these example notebooks by Niels rogue that are linked
at the page and at the bottom of this slide. That explains how we
can fine tune object object object object object object object object object
object object detection transformers CNNS drone case
feature extractor used for pre processing
the input for the model or for post processing the output of the model in
the cocoa notation format. For example for running the cocoa
evaluation metrics we also have the data for object detection
model that exposes those
logits and the prediction boxes and also we have this data
config that can be used for institiating
data for object detection model. Through this
configuration. The modifications that we do when
compared to those notebook is that we use the REsnet with a little convolutions instead
of the REsnet 50. We also set the maximum size of
the image to 1100 to not eat good out of memory
aircorse and we also use a smaller batch size of two instead
of four because on v 100 gpu we
use get out of memory aircraft. Otherwise, after training
detectors transformers on our data set we see that average
precision is very poor compared to the objects
detections. Based on cnns we
have seen previously, the model is able to detect large
objects. It has a fairly good average precision
for large objects, but it is very small for small and medium objects
which can be attributed to the detection transformer
not being suitable for these small scale
object detectors problems and as feature pyramid
networks did for cnns for
helping addressing those multiscale object detectors problem.
Similar approaches could also help improving the detection transformer
further. We see in the inference results. We have some
duplicate detections that could be probably removed
by using non maximal suppression and we also have
some missed detections. So how can
we improve these results further? We can for example scale the
backbone. In all of those experiments we use the resonate 50 but we could
use a larger backbone like a REsnet 101.
The results we had for the documentation didn't
improve our results, but we could fine tune the probabilities
or also change the augmentation transformations
to find if we could get better results. Right now we also
have more publicly available data sets
recorded by drones like Miva,
UAVDT and so on and we could use this to build
a larger data set to see if we can get
better results out of this. Also, we only used
static images for the object detection part.
But if we think about video object detection, we can exploit
these temporal cues of the
different frames to reduce the number of false positives.
We also have different transformer architectures, for example
the using transformer or the focal transformer
that could be used and tested to see if
they provide better results. To conclude,
we see that CNNs make for very powerful baselines.
We used off the shelf pre trained CNN
architectures, the faster CNN and retinate and
got very good average precision results in Visdron
for detecting cars. The transformer architectures are being increasingly
used in research and practice and we
can see that they are being added to these mainstream libraries. Like egging
case, for example, the detection transformer
is better suited for medium to large to
large objects. But developments similar to the feature pyramid
network as it was
used for CNNs can also help. The detection
transformer and the transformers will continue to be used into
downstream tasks like object detection, images classification
and image representations. We can see many research papers coming from these
areas and last but not least,
transformers make from a unifying framework for different fields.
So before we encode all of these inductive biases
that we have for those CNNs and for the OSTMs.
On the other hand, the transformer makes for a very general purpose architecture
that lacks these inductive biases, but it can learn them from
large scale data and it has
given very good results for natural language processing
and it's now also giving some state of the art results
in image. And so it can maybe unify
both fields and also unify the
practitioners and researchers from both areas.
So today, this concludes my presentation.
I want to thank you for listening.