Cascade R-CNN

General Information

Intersection Over Union (IuO) threshold is used in object detection to define positives and negatives.

If the IoU threshold is low, then the object detector tends to produce many noisy detections.

However, increasing the IoU threshold degrades the performance of the detector. This mainly happens due to two reasons:

Overfitting during training due to exponentially vanishing positive samples.
A Mismatch between the IoUs for which the detector is trained and the IoUs present in the actual input hypothesis.

Cascade RCNN has been proposed to solve these problems by introducing a multistage technique where multiple detectors are trained with increasing threshold values which makes it more selective to false positives. The detectors are trained stage by stage, leveraging the fact the output is a good distribution to train another high-quality detector.

In the above diagram, we can see that multiple stages of detectors have been used. In each stage, the distribution of the hypothesis is changed through resampling. Also, multiple specialized regressors are trained for these different distributions.

Model Playground: Parameters

Backbone Network

It is the network used to extract the feature map that is later processed by other sub-networks. In total, Cascade RCNN has four stages, one for Regional Proposal Network and three others for detection. This detection uses a backbone network which is usually ResNet.

Depth of Resnet Model

This is the depth variant of the ResNet which is used for the feature extractor.

IoU thresholds

The IoU threshold is used to decide whether the bounding box contains a background or an object.

Everything above the value of the upper bound will be classified as objects and everything lower than the lower bound will be classified as background. The values in between the lower and the upper bound are ignored.

Number of Fully Connected Layers

It defines the number of fully connected layers that are used for the bounding box regressor. Increasing the number of fully connected layers can increase the accuracy of the model but it might overfit this sub-network as well.

2 fully connected layers are a good starting point.

Weights

The weight that is used to initialize the network. Here, the weights are initialized by R-50 Cascade COCO.

Pre NMS Number of proposals for training

It is the maximum number of proposals to be considered for training before the non-maximal suppression. The proposals are sorted in descending order according to the confidence and only the ones with the highest confidence are chosen.

Pre NMS Number of proposals for testing

It is the maximum number of proposals to be considered for testing before the non-maximal suppression. The proposals are sorted in descending order according to the confidence and only the ones with the highest confidence are chosen.

Post NMS Number of proposals for training

It is the maximum number of object proposals to be considered during training after the non-maximal suppression.

Post NMS Number of proposals for testing

It is the maximum number of object proposals to be considered during testing after the non-maximal suppression.

Stages to Freeze

Freezing the stages in a Neural Network is a technique that was introduced to reduce the computation. After the stages have been frozen, the architecture doesn't have to backpropagate to it. Freezing many stages will help the computation be faster but aggressive freezing can degrade the model output and can result in sub-optimal predictions.

Pooler Resolution

It is the size to pool proposals before feeding them to the box predictor, in Model Playground default value is set as 7.

Pooler Sampling Ratio

After extracting the Region of Interests from the feature map, they should be adjusted to a certain dimension before feeding them to the fully connected layer that will later do the actual object detection. For this, ROI Align is used which makes use of points that would be sampled from a defined grid, to resize the ROIs. The number of points that we use is defined by Pooler Sampling Ratio.

Boost model performance quickly with AI-powered labeling and 100% QA.

Learn more

Last modified 14d ago

Previous - Computer Vision model architectures

RetinaNet

Next - Computer Vision model architectures

FBNetV3IS