DeepLabv3+ is a semantic segmentation architecture that builds on DeepLabv3 by adding a simple yet effective decoder module to enhance segmentation results.

Multiple downsampling of a CNN will lead the feature map resolution to become smaller, resulting in lower prediction accuracy and loss of boundary information in semantic segmentation. Similarly, aggregating context around a feature helps in segmenting it better, which is accomplished with the atrous convolutions. DeepLabv3+ helps in solving these issues.

Downsampling is widely adopted in deep convolutional neural networks (CNN) for reducing memory consumption while preserving the transformation invariance to some degree.

Atrous Convolution/Dilated Convolution is a tool for refining the effective field of view of the convolution. It modifies the field of view using a parameter termed atrous rate. It is a simple yet powerful approach for enlarging the field of view of filters without affecting computation or the number of parameters.

Atrous/Dilated Convolution has wider field of view with same number of parameters as Normal

DeepLabV3+ adds an encoder based on DeepLabV3 to fix the previously noted problem of DeepLabV3 consuming too much time to process high-resolution images.\
The application of the depthwise separable convolution to both atrous spatial pyramid pooling and decoder modules results in a faster and stronger encoder-decoder network for semantic segmentation.

Output stride describes the ratio of the size of the input image to the size of the output feature map. It specifies how much signal reduction the input vector experiences as it passes the network.

In Model Playground, we have the option of having output stride as 8 or 16

In the architecture below, the encoder is based on an output stride of 16, i.e. the input image is down-sampled by a factor of 16.

Architecture proposed in deeplabv3 original paper by Chen et al.

DeepLabV3+ employs Aligned Xception as its main feature extractor (encoder), although with substantial modifications. Depth-wise separable convolution replaces all max pooling procedures.

Thanks to the encoder-decoder structure in DeepLabv3+, you can arbitrarily control the resolution of extracted encoder features by atrous convolution to trade-off precision and runtime.

In Model Playground, we can select feature extraction (encoding) network to use as either Resnet or EffiecientNet.

It's the weights to use for model initialization, and in Model Playground ResNet101 COCO weights are used.

python
      # implement semantic segmentation with deeplabv3+ model is trained on ade20k dataset.
!pip3 install tensorflow
import pixellib
!pip3 install pixellib — upgrade
from pixellib.semantic import semantic_segmentation

segment_image = semantic_segmentation()
segment_image.load_ade20k_model("deeplabv3_xception65_ade20k.h5")
segment_image.segmentAsAde20k("path_to_image", output_image_name= "path_to_output_image")

#xception model trained on ade20k for segmenting objects: 
http://download.tensorflow.org/models/deeplabv3_xception_ade20k_train_2018_05_29.tar.gz
    
As seen above, DeepLabv3+ surpasses various SOTA techniques, including LC, ResNet-DUC-HDC (TuSimple), GCN (Large Kernel Matters), RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3.

Wiki entry for U-Net

Wiki entry for U-Net++

Boost model performance quickly with AI-powered labeling and 100% QA.

Learn more
Last modified