Object Recognition: an introduction and overview (with Tensorflow)
Introduction
Although object recognition was a relatively advanced task until recently, advancements in the field of deep learning networks have made the task a lot easier to do, not to mention faster in terms of performance. Current state-of-the-art techniques in machine learning are fast enough to analyze videos and streams in real-time, and they dont require extremely large datasets to train the neural networks behind it. In fact, as we point out in the guides, many pre-trained models exist on the internet and can be readily incorporated in the object detection model you wish to build.
A typical object detection approach
Typical object detection frameworks work as follows. The first step is to generate a region of interest using a specialized algorithm. Such regions consist of bounding boxes around the object of interest, and multiple instances of them typically cover the entire image. After the bounding boxes are created, visual features are extracted for each box that is identified. In this step the model checks for objects that may have been predefined based on certain visual features. Lastly overlapping bounding boxes are combined by the model to obtain distinct bounding boxes, which is called “non-maximum suppression”.
Object detection versus image classification
Although these two concepts sound pretty similar, the way they process images is quite different. The use cases differ as well. An image classifier is used to verify whether an image relates to a certain category. For instance, it will identify whether the image is a ‘cat’ or ‘dog’. An object detection algorithm, on the other hand, is design to spot the location of various objects in a target image; it may spot 2 cats and 3 dogs in 1 image.
The two concepts are not mutually exclusive. If your goal is to check whether an image belongs to a certain category, sometimes the object in question is too small relative to the full image. In such cases you may want to do object detection first, and apply image classification afterwards.
Take for instance circuit boards and detecting whether they feature any defects. At first, this might seem like a classification problem, given the fact that you have to split the results into correct or incorrect. However, the defects present on the board can be very small, so you are better off using object recognition and building a dataset to train it on beforehand. Just as with classifying types of mushrooms or glass in scikit-learn, precision in identifying the nuances of the dataset is paramount.
An image classification model is designed to generate image features using various deep learning methods. Naturally, the resulting features are aggregates of the original picture, which is not always what you are interested in. Object detection instead allows you to investigate the image in greater detail and look for patterns at a more granular level.
Data necessary for training object recognition models
Data is required to create an object recognotion model, and not just any kind of data. Labelled data in particular is necessary, consisting of images that feature pre-made bounding boxes, along with labels and coordinates. When you have an image, you need to provide the model with x and y coordinates as well as the class of the object you are interested in, so that the model knows which type of object is located in which part of the image.
Unsurprisingly, a question pops up every time an object recognition problem is discussed: How many labeled images do you need in order to properly train the network? While this is indeed a good question, it’s perhaps more important to ask about the scope of the model and think about how the model is going to be used.
Naturally, it is always recommended to have a sizable dataset to work with, preferably with more than one hundred or even one thousand labeled images. Not only that, but they also have to be representative images, and the numbers above are per one object class only. For those venturing into machine learning projects involving image analysis, understanding the application of these techniques in similar contexts can be facilitated by exploring resources like Person detection in video streams using Python, OpenCV and deep learning. Furthermore, the term representative in this case means that the images should fit with the type of scenarios and use cases you have in mind at the beginning of the project.
As an example, a traffic sign detection moden will need pictures of the same signs in different weather conditions, otherwise your application might fail to perform as expected. Just like with every other machine learning algorithm, do not expect magical results from your models, because they are only as good as the amount of data they had to train on.
Generating region proposals
In order for the model to spot areas of interest in an image, several different algorithms can be used. One of them is named “selective search”, and it is defined as a clustering-based approach. What this means is that the algorithm tries to group pixels according to a pattern and then generate region proposals based on the resulting clusters.
There are other approaches as well, including some that try to extract much more complex visual features from the image in order to generate regions. In addition, you can also opt for a brute-force algorithm that does away with the analysis and simply scans an image from top to bottom in as much detail as possible. This approach does not take into account image features, however.
As expected, there’s a trade-off that must be considered when you are picking the desired method for region proposal generation, namely in terms of total number of regions vs. the computational complexity. Obviously, common sense dictates that with more regions you also get a better chance of spotting the desired object. On the other hand, such a task is very expensive in terms of computational complexity. Not only that, but you might also lose the ability to perform the object detection in real-time, which is a major downside for certain use cases.
Even so, there are scenarios where this approach can work if you know the details beforehand. As an example, you can considerably reduce the number of ROI’s while looking for pedestrians in an image, since they typically have a ratio of 1.5, which allows you to disregard all the other rations and considerably speed up your model’s execution time.
Extracting the desired features
Since you are often working with images of different sizes, feature extraction is designed to reduce each image to a fixed set of visual features, thus eliminating the problem. In fact, most image classification models are designed to use very strong visual feature extraction methods.
The goal of such models is to extract these features in order find out to which class an image belongs to, and there are different approaches as well. You can use histogram methods, deep learning, and even filters, although they all achieve the same thing in the end.
An object detection framework is usually designed using pretrained image classification models, because this way you can extract the visual features relevant to your use case from a variety of general datasets. Many such datasets can be found around the web, with MS CoCo being one example that allows you to train a model on some generic features. It is recommended to toy with a couple of different approaches, however, especially if you want to improve the model and make it more relevant to your specific needs.
How non-maximum suppression works
In short, the non-maximum suppression algorithm allows you to combine overlapping detections in an image into a single bounding box. This step is very important if your images typically feature many different objects, but it’s also worth noting that NMS can require a bit of hyperparameter tweaking in order to function as intended with your model.
Metrics used for evaluation
Object recognition comes with many possible evaluation metrics, but one of the most commonly used is the “mAP”, or “mean average precision”. This metric has a range of 0 to 100, with higher values being considered better, but it’s worth noting that these values do not denote the accuracy of the classification as well.
In simple terms, this works by giving every bounding box a score that represents the likelihood that it contains an object. After that, a PR curve (which stands for precision-recall curve) is generated for each class by varying the score threshold. The average precision is then calculated as the area under the PR curve. Lastly, the mAP is calculated by computing the AP for each class, which is then averaged over all the available classes.
To determine the detection, the model looks at the “intersection over union”, which is also known as IoU or overlap. This IoU must be greater than a certain threshold in order for the detection to be a true positive. A typical threshold is 0.5, which is why the syntax mAP@0.5 is often used to refer to the IoU used in the process.
The Tensorflow Detection API
Dive into the inner workings of the SSD and Faster R-CNN algorithms in order to understand how they are implemented into Google’s TensorFlow Detection API.
To begin with, some basic concepts require some explaining before going forward. As such, it’s important to understand how SSD and Faster R-CNN function at their most basic level, since these are the algorithms that power the Tensorflow Detection API.
Finally, we can discussed how Tensorflow uses these concepts in order to achieve its results. To begin with, the great thing about Tensorflow Detection API is that it encompasses many of the ideas presented above into a single package, and it also allows you to quickly switch from method-to-method. By using the API, you can define an object detection model by making use of certain configuration files, while Tensorflow itself takes care of structuring everything else.
The Protos folder
Since the API supports a plethora of different components, it’s recommended to take a look at the “protos folder”, which is where you can find all the important function definitions. Some of these include preprocessing protos as well, such as ssd, eval and faster_rcnn.
The SSD model (Single Shot Multibox Detector)
Researchers from Google published the SSD model back in 2016, and it uses a single deep neural network that combines feature extraction with regional proposals in order to identify and detect objects. For those inspired to build their own deep learning system to experiment with models like SSD, Building your own deep learning machine in 2023: some reflections offers practical insights.
To begin with, multiple default boxes with different scales and aspect ratios are applied to the feature maps. Next, images are passed through an image classification network, where the feature maps are computed and the features are extracted for every bounding box in one step. The object categories also receive a score in every default bounding box. After that, adjustment offsets are also calculated for each box, so that they better fit the ground truth.
To handle different scales for objects, different receptive fields correspond to different feature maps in the convolutional network. Thanks to this, the work is done by a single network, which allows for fast computational speeds, such as 59 frames-per-second for an input of 300 x 300.
Using different configuration files
Since there are a few important parameters to look at when working with the SSD architecture, we are going to analyze a couple of sample configuration files and go in depth for each one of them.
To begin with, not all classification networks are the same, since each one of them comes with certain strengths and weaknesses. Thus, while the ResNet architecture allows for a high overall accuracy, the inceptionv3 network is designed for better object detection at multiple different scales. In addition, there is also the Mobilenet, which is trained to work with very few computation resources.
To get a better understanding of what you gain or lose by picking one network over the other, you can check its performance with ImageNet, as well as consider the total number of parameters used to train the original dataset. The “feature_extractor” section is where you can find more about the feature extractor.
Secondly, you must also consider the parameters regarding the aspect ratios and the default boxes. The labeled data comes with a variety of aspect ratios and scales for the bounding boxes, but the best results are achieved by considering your specific use case. In addition, you are also making sure that the network does perform unnecessary work outside the desired scales and aspect ratios.
These parameters can be tweaked from the “ssd_anchor_generator” section, but you should keep in mind that extending the range of scales and aspect ratios can improve performance, but only in certain situations and with diminishing returns.
The two other sections that require your attention before training the model are the “image_resizer” and “data_augmentation_options”. Obviously, working with large image sizes has downsides in terms of performance, but it helps if you are dealing with small objects that can be difficult to detect. Furthermore, if you are dealing with different scales, data augmentation is also a crucial step in the context of SSD. Lastly, you can also fiddle with the ”train_config” configuration file in order to set the batch size and the learning rate. These parameter values depend on how big your dataset is, and it’s important to avoid overfitting.
The Faster R-CNN model
Unlike SSD, the Faster R-CNN model was developed by Microsoft, and as the name clearly implies, it is based on R-CNN, with added performance. As a short backstory, R-CNN detects objects using a multi-phased approach, and it uses a selective search in order to come up with region proposals. These are then run through a classification network, after which an SVM is utilized to classify the different regions.
On the other hand, Faster R-CNN is an end-to-end approach, since it comes with a Region Proposal Network (RPN) instead of default bounding boxes in order to generated the fixed set of regions. On top of that, this RPN achieves nearly cost-free region proposals by making use of the convolutional features taken from the image classification network. Objectness scores and object bounds are predicted at each position with a fully convolutional network, namely the RPN.
In spite of these differences, both the SSD and the RPN come with a fairly similar layout, since the bounding box predictions are not pulled from thin air. However, an RPN network takes the feature maps and slides a window across them, while the proposal sets are calculated at various aspect ratios and scales by the sliding window at each anchor or location. The results are in the form of adjusted bounding boxes, just like with SSD.
To put it in different words, what an RPN does is guide the network’s attention to regions that show promise or are otherwise interesting. All of the components are also combined into a single setup, but the training can be done in multiple pases or end-to-end.
Use cases for Faster R-CNN
There are no significant differences between SSD and Faster R-CNN in terms of usage details. However, it’s worth pointing out that if you are going for raw mAP performance, Faster R-CNN is usually better than SSD, but then again you also have to sacrifice more computing power in the process.
As for relevant sections regarding the Fast R-CNN, you can take a look at the “first_stage_anchor_generator” configuration, where you can find the definitions for the anchors generated by the RPN. If you want to fine-tune the model for smaller objects, you can tweak the “stride” parameter, which controls the steps of the sliding window.
Lastly, the developers behind Faster R-CNN recommend using it on smaller datasets, although that’s not necessarily a rule.
Final thoughts
Although these architectures are by far the most popular when it comes to object detection, there are others out there as well, many of which achieve similar results. However, not all of them are part of the Tensorflow Detection API yet, but they might be added in the future.