Exploring Object Detection Using R-CNN Models — A Comprehensive Beginner’s Guide (Part 2) | Written by Raghav Bali

[ad_1]

Object detection model

Object detection is a complex process that helps locate and classify objects within a given image. In Part 1, you familiarized yourself with the basic concepts and general framework of object detection. This article briefly describes some important object detection models, with an emphasis on understanding their main contributions.

Common object detection frameworks emphasize the fact that there are several intermediate steps to perform object detection. Based on the same thought process, researchers have devised many innovative architectures to solve this object detection task. One way to differentiate these models is in the way they approach a given task. Object detection models that utilize multiple models or steps to solve this task are called multistage object detectors. The region-based CNN (RCNN) model family is a prime example. Multi-stage object detector. Many improvements have since been made, giving rise to model architectures that solve this task using a single model itself.Such a model is called Single stage object detector. The single-stage model will be discussed in the next article. For now, let’s take an internal look at some of these multi-stage object detectors.

Region-based convolutional neural network

Region-based convolutional neural networks (R-CNN) were first presented by Girshick et. Al. R-CNN is described in a 2013 paper titled “Rich Feature Hierarchy for Accurate Object Detection and Semantic Segmentation.” R-CNN is a multi-stage object detection model that became the starting point for faster and more sophisticated variants in subsequent years. Let’s start with this basic idea before understanding the improvements this achieves. Fast R-CNN and Faster R-CNN model.

The R-CNN model consists of four main components.

Regional proposal: Region of interest extraction is the first and most important step in this pipeline. The R-CNN model utilizes an algorithm called selective search for region proposals. Selective search is a greedy search algorithm proposed by Uijlings et al. Al. Without going into too much detail, selective search utilizes a bottom-up, multiscale iterative approach to identify his ROI. At every iteration, the algorithm groups similar regions until the entire image becomes one region. Similarity between regions is calculated based on color, texture, brightness, etc. Selective search produces many false positive (background) ROIs, but has a high recall rate. The list of ROIs is passed to the next step for processing.
Feature extraction: The R-CNN network utilizes a pre-trained CNN such as VGG or ResNet to extract features from each ROI identified in the previous step. Before the regions/crops are passed as input to the pre-trained network, these are reshaped or warped to the required dimensions (each pre-trained network only requires inputs of a certain dimension). The pre-trained network is used without a final classification layer. The output of this stage is a long list of tensors, one for each ROI from the previous stage.
Person in charge of classification: The original R-CNN paper utilized a support vector machine (SVM) as a classifier to identify the classes of objects within an ROI. SVM is a traditional supervised algorithm that is widely used for classification purposes. The output of this step is the classification labels for all ROIs.
regression head: This module handles the localization aspects of object detection tasks. As explained in the previous section, a bounding box can be uniquely identified using four coordinates: the box’s top left (x, y) coordinates and its width and height. The regressor outputs these four values for each ROI.

For reference, this pipeline is visually depicted in Figure 1. As shown in the figure, the network requires multiple independent forward passes (one for each ROI) using the pretrained network. This is one of the main reasons that slows down R-CNN models, both during training and inference. The author of this paper states that training the network requires him over 80 hours and a huge amount of disk space. The second bottleneck is the selective search algorithm itself.

Figure 1: Components of the R-CNN model. The region proposal component is based on a pre-trained network such as VGG for selective search and subsequent feature extraction.Classification head utilizes SVM and separate regression head — Figure 1: Components of the R-CNN model. The region proposal component is based on a pre-trained network such as VGG for selective search and subsequent feature extraction. The classification head utilizes an SVM and a separate regression head.Source: Author

The R-CNN model is a great example of how different ideas can be leveraged as building blocks to solve complex problems. In the original setup itself, R-CNN utilizes transfer learning, although detailed hands-on exercises are performed to verify object detection in the context of transfer learning.

Although the R-CNN model was slow, it provided a good foundation for later object detection models. The computationally expensive and time-consuming feature extraction step is mainly Fast R-CNN implementation. Fast R-CNN was introduced in 2015 by Ross Grishick. This implementation boasts not only faster training and inference, but also improvements to his mAP on the PASCAL VOC 2012 dataset.

Main contributions from Fast R-CNN The paper can be summarized as follows.

Regional proposal: For the basic R-CNN model, we saw how a selective search algorithm is applied to an input image to generate thousands of ROIs, on which a pretrained network works to extract features. Fast R-CNN modifies this step for maximum effectiveness. Instead of applying the feature extraction step thousands of times using a pretrained network, Fast R-CNN networks perform it only once. In other words, we first process the entire input image once through the pretrained network. The output features are used as input for a selective search algorithm to identify ROIs. This reordering of components significantly reduces computational requirements and performance bottlenecks.
ROI pooling layer: The ROI identified in the previous step can be of any size (as determined by the selective search algorithm). However, the fully connected layer after the ROI is extracted only takes a fixed size feature map as input. Therefore, the ROI pooling layer is a fixed size filter (7×7 size is mentioned in the paper) that helps convert these arbitrarily sized ROIs into fixed size output vectors. This layer works by first dividing her ROI into equally sized sections. Next, find the maximum value for each section (similar to a Max-Pooling operation). The output is just the maximum value of each section of the same size. The ROI pooling layer significantly speeds up inference and training times.
Multitasking loss: In contrast to the two distinct components of the R-CNN implementation (SVM and bounding box regressor), Faster R-CNN utilizes a multi-head network. This setup allows us to jointly train the network for both tasks using a multitask loss function. The multitask loss is the weighted sum of the classification and regression losses for the object classification and bounding box regression tasks, respectively. The loss function is given as:

Lₘₜ = Lₒ + 𝛾Lᵣ

where 𝛾 ≥ 1 if the ROI contains an object (objectness score), 0 otherwise. The classification loss is just a negative log loss, whereas the regression loss used in the original implementation is a smooth L1 loss.

The original paper details a number of experiments highlighting performance gains based on different combinations of fine-tuned hyperparameters and layers in a pre-trained network. The original implementation utilized a pre-trained VGG-16 as the feature extraction network. Since the first implementation of Fast R-CNN, many faster and improved implementations have emerged, including MobileNet and ResNet. You can also replace these networks in place of VGG-16 to further improve performance.

Faster R-CNN is the last member of this family of multi-stage object detectors. This is the most complex and fastest variant of them all. Although Fast R-CNN significantly improved training and inference times, it was still penalized by the selective search algorithm. His Faster R-CNN model published by Ren et al. in 2016. Al. The paper titled “Faster R-CNN: Towards Real-Time Object Detection with Regional Proposal Networks” primarily addresses aspects of regional proposals. This network is built on top of the Fast R-CNN network by introducing a new component called . Regional Proposal Network (RPN). The entire Faster R-CNN network is shown in Figure 2 for reference.

Figure 2: Faster R-CNN consists of two main components. 1) Region Proposal Network (RPN) to identify ROI, 2) Fast R-CNN like multi-head network with ROI pooling layer.Source: Author

RPN is a fully convolutional network (FCN) that helps generate ROIs. As shown in Figure 3.12, RPN consists of only two layers. The first is his 3×3 convolutional layer with 512 filters, followed by two parallel 1×1 convolutional layers (one each for classification and regression). A 3×3 convolutional filter is applied to the feature map output (input, which is the original image) of the pre-trained network. Note that RPN’s classification layer is a binary classification layer for determining objectness scores (not object classes). Bounding box regression is performed using his 1×1 convolution filter of anchor boxes. In the setup proposed in this paper, we use 9 anchor boxes per window, so RPN produces 18 objectness scores (2xK) and 36 position coordinates (4xK). Here K=9 is the number of anchor boxes. Using RPN (instead of selective search) improves training and inference times by orders of magnitude.

Faster R-CNN network is an end-to-end object detection network. Unlike the Basic and Fast R-CNN models, which utilize many independent components for training, Faster R-CNN can be trained as a whole.

This concludes our discussion of the R-CNN family of object detectors. We discussed key contributions to better understand how these networks work.

[ad_2]

Source link

Exploring Object Detection Using R-CNN Models — A Comprehensive Beginner’s Guide (Part 2) | Written by Raghav Bali | February 2024

Christian Science speaker to visit Chatauqua Institute Sunday | News, Sports, Jobs

Hundreds of basketball-sized space rocks hit Mars every year

Space Cadet’s Emma Roberts opens up about middle school science trauma