Dual context prior and refined prediction for semantic segmentation

ABSTRACT Recently, the focus of semantic segmentation research has shifted to the aggregation of context prior and refined boundary. A typical network adopts context aggregation modules to extract rich semantic features. It also utilizes top-down connection and skips connections for refining boundary details. But it still remains disadvantage, an obvious fact is that the problem of false segmentation occurs as the object has very different textures. The fusion of weak semantic and low-level features leads to context prior degradation. To tackle the issue, we propose a simple yet effective network, which integrates dual context prior and spatial propagation-dubbed DSPNet. It extends two mainstreams of current segmentation researches: (1) Designing a dual context prior module, which pays attention to context prior again with a shortcut connection. (2) The network can inherently learn semantic aware affinity values for each pixel and refine the segmentation. We will present detailed comparisons, which perform on PASCAL VOC 2012 and Cityscapes. The result demonstrates the validation of our approach.


Introduction
Segmentation is a fundamental task among many computer vision tasks, such as scene parse (Chen et al. 2019), autonomous driving (Chen et al. 2020), objects detection (Chen et al. 2017a(Chen et al. , 2017b, to name a few. Its mission is to assign each pixel with a category, which is crucial to subsequent task. As the thriving of Deep Convolution Neural Networks (DCNNs), particularly with the development of FCN (Long, Shelhamer, and Darrell 2015), many breakthroughs of semantic segmentation have been achieved based on many prior works. These improvements of segmentation should give credits to the adoption of taking advanced networks as feature extractor, such as ResNet (He et al. 2016), ResNeXt (Xie et al. 2017), XceptionNet (Chollet 2017). Dilated convolution is also a powerful tool since it can effectively enlarge receptive fields while remains high resolution feature map. It relieves the issue of intra-class inconsistent segmentation via extracting rich context information. Intra-class inconsistent segmentation means parts of the object (which belong to the same category) are falsely classified into other classes. Context information is so crucial for segmentation mainly due to it can highlight the co-occurrent visual patterns. Nevertheless, as a result of using large windows in both convolution and pooling operation, the segmentation of many prior researches may lack of local location information and precise boundary, like PSPNet (Zhao et al. 2017), Deeplabv3 (Chen et al. 2017e).
A very recent work, Deeplabv3+  improves the segmentation through better reconstruction of location information. It performs deconvolution (Long, Shelhamer, and Darrell 2015) and bilinear interpolation over the coarse prediction. After that, low-level features are introduced for fusion process. Other similar works are also focusing on prediction refinement. However, though these decoding networks are effective to some extent, redundant boundary is introduced due to the absent of rich semantic awareness. It highlights the problem of intra-class inconsistent segmentation, which is emphasized by prior works (Zhao et al. 2017). As shown in Figure 1, parts of sheep and cow are falsely classified into cats and horses respectively.
With above discussion, we revalue how to possess both refined boundary and intra-class consistent segmentation. We bring in current neural network named UPerNet (Xiao et al. 2018) as our basic network. It includes top-down connection with inline context aggregation module followed by down-top and skip connections. As shown in Figure 2, each lateral branch gradually brings in features with object scales, local location, boundary details, which are crucial premise to the generation of dense prediction. However, these low-level features, which have disadvantages of weak semantic representation and redundant boundary details, result in being deficient in ability of learning the most distinctive features. For instance, a bus is likely to be classified as a car if the network responses too much to irrelevant features like windows or wheels. To this end, we introduce a graphical model-based method to inherently learn semantic-aware global pairwise relationships of an image. A recent work, Liu et al. utilize an auxiliary network (Liu et al. 2017a) to learn semantic aware affinity values for high-level vision task and achieve promising result. We extend it to online training via a simple yet effective approach. After the processing of attention module, once more, the context information is introduced. Hence, the network obtains the capability of selectively combining category region and pixel segmentation to suppress the problem of intra-class inconsistent segmentation. Meanwhile, with a bold yet reasonable assumption, lateral branches with features of object scales, pose, viewpoints can replace the aforementioned auxiliary network to learn affinity matrix with rich semanticawareness. During training, our network refines the prediction via online linear propagation, which can enable the network to learn pairwise relationships in local-toglobal feature space. Unlike SPN (Liu et al. 2017a) Figure 1. Visual results on PASCAL VOC 2012. From 1st to 4th column are images, ground truth, results from PSPNet and Deeplabv3+, respectively. Comparing to Deeplabv3+, PSPNet has better results that can suppress the intra-class inconsistent segmentation. performs spatial diffusion over the last hidden layer, we directly perform it over the prediction result. An intuition is that the inter-class diffusion should be reciprocal inhibition after the softmax layer.
In summary, there are three contributions in our paper: (1) We review the problem of intra-class inconsistent segmentation, which occurs in the procedure of refining prediction when employing down-top and skip connections. (2) We propose a simple yet effective architecture which incorporates Dual Context Prior (DCP) information module and refined prediction module. The DCP module can selectively combine category region and pixel segmentation to suppress the intra-class inconsistent segmentation.
(3) We also develop an online spatial propagation network, which can perform local-to-global diffusion over prediction result by learning pairwise affinity value and yield precise segmentation.

Related work
Our work is built upon prior works of dilated convolution and context aggregation, prediction refinement, attention module and linear spatial propagation.

Dilated convolution and context aggregation
DCNNs achieve many astonishing accomplishments in the domain of image classification. Kai et al. propose residual convolution module along with a much deeper network (He et al. 2016). Benefiting from this, also for much dense segmentation, Deeplab (Chen et al. 2014) utilizes dilated convolution, which can effectively enlarge the receptive field while remains high resolution feature map. Further, to extract context information, GCN (Peng et al. 2017) constructs a large convolution kernel via a series of small ones and PSPNet (Zhao et al. 2017) employs parallel pooling module for context aggregation. Deeplabv3 (Chen et al. 2017e) adopts parallel dilated convolution with different rates. More recently, Zhang et al. propose Encnet (Zhang et al. 2018), which can extract context information much more effectively and set a new baseline on benchmarks.

Prediction refinement
Structures with top-down, down-top and skip connections are widely used among many computer vision tasks, like object detection (Ren et al. 2017), boundary detection (Xie and Tu 2015;Liu et al. 2017b;Xie and Tu 2015;Yang et al. 2016;Yu et al. 2017) and semantic segmentation (Long, Shelhamer, and Darrell 2015). To integrate different level features, FCN (Long, Shelhamer, and Darrell 2015) adopts fully convolutional network while UNet (Ronneberger, Fischer, and Brox 2015) introduces the U-shape (Xie and Tu 2015;Peng et al. 2017;Ghiasi and Fowlkes 2016;Lin et al. 2016) structures with side connected. Kai et al. propose a variant of pyramid structure named FPN (Lin et al. 2017) to obtain more precise prediction.

Attention module
The attention module (Mnih et al. 2014;Wang et al. 2017;Chen et al. 2017c), which can make the model more responsive to what we need, becomes a powerful tool for deep neural networks (Chen et al. 2016;Hu, Shen, and Sun 2018;Zhang et al. 2018;Yu et al. 2018). The method (Chen et al. 2016) enables the network to pay attention to different scales information for semantic segmentation, PAD-Net (Xu et al. 2018) uses the attention module to control the features from other tasks into the target task. A very recent work, SE-Net (Hu, Shen, and Sun 2018) explores the cross-channel information to learn a channel-wise attention and achieves state-of-the-art performance in image classification task. In two lately researches on semantic segmentation, both EncNet (Zhang et al. 2018) and DFN (Yu et al. 2018) utilize attention module to obtain assumption factors, including scale attention factors and global attention factors.

Linear spatial propagation
The affinity matrices, which define pairwise relationships, are widely used for image filtering (Tomasi and Manduchi 1998) and image segmentation (Krahenbuhl and Koltun 2011). It improves performance among above tasks and propagating information over feature map. It also can retain the information of the edge for prediction refinement.
For an effective learning strategy, Bertasius et al. (Bertasius, Shi, and Torresani 2016) take the levelfeatures from DCNNs to extract the global pairwise relationships and takes a random walk network to share weights between nodes, which lead to high quality semantic segmentation. Since the computation of the random walk network is so expensive that the algorithm cannot converge stably, Liu et al. propose a spatial propagation network (Liu et al. 2017a) for learning the affinity matrix for visual tasks. By constructing a row/column linear propagation model, the spatial sales transformation matrix accurately and constitutes an affinity matrix, simulating the dense global pairwise relationship of the image. Taking the spatial propagation network as a post-processing strategy, Cheng et al. (Cheng et al. 2017) refine the coarse mask of instance-level object segmentation into a refined mask.

Proposed method
In this section, we will elaborate our methods, including Dual Context Prior (DCP) module and refined prediction module. Firstly, we will review the linear spatial propagation network and to extend it based on current architecture. Then, we will describe how to embed DCP information into prediction-end. Finally, we will give the overall architecture of our network.

Dual context prior (DCP)
As mentioned above, multi-scale context pooling module can effectively solve the problem of inconsistent segmentation, which may occur when an object has very different textures. As shown in Figure 2, in the decoding part with down-top and skip connections, the intra-class inconsistent segmentation occurs again. For this propose, we introduce the context prior attention again after the fusion of multiscale feature. The attention mechanism has been successfully employed in various tasks for filtering useful information. For instance, DFN (Yu et al. 2018), which extracts channel-wise attention factor, achieves stateof-the-art performance in semantic segmentation. Hence, we utilize attention mechanism to determine the context information whether introduce into the feature map after fusion of low-level and high-level features. In other words, in some case, the context information may be redundant. The attention module can be regarded as a control gate to determine the usage of context information. This strategy allows the network inherently to pay or not to pay attention to context information. As shown Figure 3, our proposed DCP module learns attention factor G from multi-scale features, where w denotes the weights of convolution kernel while σ denotes sigmoid activation function, and F s denotes the fused features from both low-and highlevel features. The final output of this module can be written as: where F g is the multi-scale context information which from SPP module, as the left module shown in Figure 2.
With respect to the role of DCP module, an intuition is that the network can adaptively select the useful information based on the reintroduced context prior information.
For instance, if it is a bus, comparing to a car, the network should ignore the common features like windows or region, which with the same textures under illumination while remains the most distinctive features like boundary and color texture to ensure intra-class consistent segmentation. The inspiration for this work comes from (Xu et al. 2018), which uses the attention method to associate the features of other tasks. After the fusion between the high-level features from branch 5 (i.e. F g ) and the low-levels features (i.e. branch 1, 2, 3, 4), channel reduction operation (Conv+BN+ReLu) is performed on the fused features to make sure the dimension of channels from the fused features and F g are equal. Then the Sigmoid operation is performed to the fused features F s to generate an attention map G. The values G ranges from 0 to 1, which indicates the per-pixel importance of the original context feature F g . Through the multiplication operation, the network can adaptively filter out the redundant context features from F g via the attention map G. Furthermore, the addition between F s and the filtered F g can be deemed as residual learning.

Online linear spatial propagation
An intuitive understanding of SPN (Liu et al. 2017a) is to learn a semantic perceptual affinity value for each pixel pair through an auxiliary network (Long, Shelhamer, and Darrell 2015). In other words, it reveals the similarity of a pixel pair. Thus, for a pixel to be classified, a linear weighting operation is performed depending on the affinity value between the pixel and its adjacent pixels. Noted that the linear weighting operation is anisotropic which can retain the information of the edge. The computation of local to global diffusion is expensive. So, the SPN develops the linear spatial propagation direction with four directions, such as from left to right, and propagates once in one direction only associates with three adjacent pixels.
We modify this module to the point that it can be learned from online training. As shown in Figure 2, after the fusion of low-level and high-level features, we introduce features of object scales, poses and viewpoints to make sure local to global diffusion under limited propagation times. A feature map F dc with size m * n, which is the output of DCP module serve as the input of the affinity matrix. Here, we have: Where M K denotes the affinity matrix that need to be learned. Where K belongs to N(i; j), which indicates the affinity values of a pixel with position of (i; j) which response to its adjacent pixels, e.g. top-to-down, the column j is from j -n + 1 to j. It defines the similarity between pixels based on high-level vision features. Meanwhile, the network output a coarse segmentation mask based on the feature map F dc , which is: We define as a propagation-hidden layer above feature map X, h ij and are pixels with position (i; j) for the hidden layer and the coarse prediction map, respectively. The 2D linear propagation from one direction can be described as: where is an adjacent pixel of (i; j) in the hidden layer. Therefore, each direction of propagation ensures each pixel to obtain information from its adjacent pixels in a same direction. So, taking the node-wise maxpooling for merge four different directions, each pixel can obtain information from all over the prediction map. As mentioned by (Liu et al. 2017a), this diffusion operation will be stable under the condition as: The propagation in Eq. (5) is performed as columnwise transitions, which can be expressed by the following linear operation: Here, H i , X i denotes the ith column from linear propagation layer and coarse prediction map respectively.
Where h 0 = x 0 , and is a linear transition matrix. We define that this propagation repeating T times, (Liu et al. 2017a) proves that the two hidden matrices in adjacent states domain have: Where L denotes a Laplacian matrix. Therefore, this linear propagation process is equivalent to spatial anisotropic diffusion process, which can smooth the nonboundary region and response to boundary details, which is the premise to high quality segmentation. As for the final result H T , the softmax function will be employed to perform channel-wise quantized operation followed by cross-entropy loss function for final prediction.
In the overall processing of refined prediction, except for using a serious of convolution operation for affinity matrix learning, all other layers directly perform on the coarse prediction and learned under Eq. (9). A motivation of these is that we consider the interclass diffusion should be reciprocal inhibition.

Overall network architecture
Based on the DCP module and online linear spatial propagation module, we propose a simple yet effective deep semantic neural network (DSPNet), which integrates cascaded DCP attention module and prediction refinement module. As shown in Figure 2, we adopt deep residual networks, which followed by a SPP module as the backbone of our propose network. The SPP module extracts multi-scale context information, which is rich in semantic-awareness. Skip connections are utilized to extract multi-scale feature of object scales, local location information. After channel dimension reduction, each lateral branch has a 512-dimensional feature map followed by bilinear interpolation upsample operation to restore the resolution to a quarter of the input size. This strategy is inspired by two prior works, object detection network FPN (Lin et al. 2017) and scene understanding network UPerNet (Xiao et al. 2018). The output of aforesaid modules will be concatenated to a feature map, which serves as the input of DCP attention module, which determines the fate of context information. Then, it output a coarse mask which contains robust intra-class consistent segmentation. Meanwhile, an affinity matrix, which contains the pixel affinity, learned via the information from the previous feature map. The affinity matrix performs spatial propagation over the coarse segmentation mask to further sharpen the boundary and smooth the intra-class region. Then, it outputs the final semantic segmentation prediction. Meanwhile, the network also supervises the coarse segmentation to enable the network for obtaining stable prediction result quickly and learns affinity values of high confidence.

Experimental results
To evaluate our proposed approach, experiments are conducted on the PASCAL VOC 2012 (Everingham et al. 2015) and the Cityscapes benchmark (Cordts et al. 2016). In this section, we firstly introduce the datasets and illustrate the implementation details. Thereafter, we evaluate each module of the proposed network by ablation study. Finally, we present the performance comparison with other state-of-the-art methods.

PASCAL VOC 2012
The PASCAL VOC 2012 (Everingham et al. 2015) is a well known semantic segmentation dataset, which contains 20 object classes and one background, involving 1464 images for training, 1449 images for validation and 1456 images for testing. The original dataset is augmented by the Semantic Boundaries Dataset (Hariharan et al. 2011), resulting in 10,582 images for training.

Cityscapes datasets
The Cityscapes datasets consists of images collected from 50 different cities in Europe. 5000 images are with fine annotations, and 20,000 additional images are only with coarse annotations. These images are captured with urban street scenes, and the pixels are categorized into 19 testing classes.

Metrics
To evaluate the segmentation performance of our proposed architecture, we resort to the standard Jaccard Index, known as the mean intersection-overunion (mIOU) metric.

Network implementation and training
Our approach is based on the ResNeXt network (Xie et al. 2017). In regarding to ablation study, all parameters of batch normalization layers are fixed.

Training
We train the network using mini-batch stochastic gradient descent optimizer. The momentum is set to 0.9, and weight decay is set to 0.0001. Similar to Zhao et al. 2017), we take patches with a size of 512 and 720 as input for PASCAL VOC and Cityscapes separately. We also use the "poly" learning rate policy where the learning rate is multiplied by ð1 À iter maxiter Þ 0:9 and the initial learning rate is set to 0.007 and 0.0035 with or without SPN.

Data augmentation
We operate data augmentation as recommended in training process of (Zhao et al. 2017). Scale factor is sampled from the range (0.5, 2) and a rotation is from (−10°, 10°).

PASCAL VOC 2012
In this section, we will discuss the influence of the DCP attention module on segmentation quality. Also, we will compare the segmentation result of DCP attention after incorporating different levels of low-level features. The result will prove that more low-level features lead to severe intra-class inconsistent segmentation. Finally, we will discuss the enhancement, which the online linear spatial propagation brings in semantic segmentation and influence of DCP attention module on the quality of linear spatial propagation.

Ablation study for dual context prior
We define decoder with stride of 4 and 2 as decoder A and B, respectively. Each lateral branch has a 512dimensional channel feature map. On the issue of intraclass inconsistent segmentation during the procedure of gradually decoding, we adopt DCP module. The comparison result can be seen in Table 1. Firstly, in order to prove the importance of context prior to segmentation, the ResNeXt-101 with decoder B is similar to FCN-4 s (Long, Shelhamer, and Darrell 2015), which without context module has 71.8% mIOU. Furthermore, it achieves 74.9% mIOU with a growth of 3.1% after brings the SPP module for context prior, which is same as PSPNet (Zhao et al. 2017). This result indicates that context aggregation is important to segmentation. After integrating decoder A, it has a growth of 1.2% mIOU and while with decoder B, which has more lateral branches has a growth of 1.6% mIOU, the PSPNet with the decoder B is from UPerNet (Xiao et al. 2018). The result indicates that decoder B is better than decoder A, which means that these low-level features are disturbing but useful. After introducing the dual context prior module with decoder A and B, it has a growth of 1.2 and 1.4% comparing to the SPP with decoder A and B, respectively. It also shows that network integrates DCP with decoder B has better result, which demonstrates our argument that integrates too much low-level features resulting in severe intra-class inconsistent segmentation, which is needed to be suppressed. When using the ResNeXt-152, which has deeper layers, the SPP with decoder B achieves 80.48%, the improvement in our methodology is still substantial, which has a growth of 0.7%. As the examples shown in Figure 4, The 1st and 2nd column are images and ground truth respectively. After introduce DCP (5th column), comparing to the SPP with decoder B (4th column) with integration of low-level features, it has a much smooth within intra-class region, which is close to PSPNet (3th column) but has stronger refined boundary.

Ablation study for refining prediction
As shown in Table 1, it yields better segmentation result when adopts DCP module before low-level features aggregation. To this end, to verify the validation of linear spatial propagation network, we adopt the DCP module with decoder B as basic network. As shown in Table 2, the network with SPN has better performance, which has a growth of 1.6% in the ResNeXt-101 while a growth of 0.9% in the ResNeXt-152. Our DSPNet network performance grows with deeper networks, which mainly due to bias to the features of object scales, which also help learning the affinity values between pixels.

Ablation study for offline linear spatial propagation
In order to justify the impact of online or offline linear spatial propagation learning on segmentation, we perform a series of 32 channels convolution operation on the preliminary prediction to output a 32-dimentional feature map. Then, we use FCN-4 S (Long, Shelhamer, and Darrell 2015) as an auxiliary network to learn an affinity matrix.
After the affinity matrix propagates over the 32dimentional feature map, a series of 64 channels    convolution operation are employed for the final refined prediction. This method comes from SPN (Liu et al. 2017a). As shown in Table 2, # denotes offline processing. It has similar results either online learning or offline learning. When going with deeper layers, the offline learning just has a growth of 0.2%, which is proved that the learning of semantic-aware affinity values can be shared from the same network. Though the offline SPN has a slightly higher performance than the online version, the disadvantages of the original offline SPN is that it takes much more time to train and needs extra memories to store the features. Thus, we modify the offline SPN so that it can train with the whole network, which can significantly reduce the demand of computation resource and memories.

PASCAL VOC 2012
In evaluation, we apply the multiscale scheme on inputs and also horizontally flip the inputs to further improve the performance. We further finetune our model on PASCAL VOC 2012 train and val set for evaluation on test set. More performance details are listed in Table 3, our model achieves 82.5%, which is competitive result. Example can be seen in Figure 5.

Cityscapes datasets
In previous sections, we elaborately discuss how the decoder module impacts the performance on SPP.
For the process of progressive decoding, the problem of intra-class in consistent segmentation shows up again, which also exists in the road scene dataset. Examples can be seen in Figure 6, the headstocks of the truck are falsely classified into bus due to similar textures between these two classes. Therefore, we introduce the DCP module and SPN module in the Cityscapes ablation study. We use the ResNeXt-152 (Xie et al. 2017) as base network. We take the patch with a size of 720 pixels as input. We also use fixed BN operation for training with the batch size smaller than 16.
We apply the multi-scale inputs with scales range from (0.5, 2.0) and also horizontally flip images to further improve the performance. Without employing coarse data, when using the DCP module which can bring 0.6%, with the liner space propagate, our final model is evaluated on the Cityscapes val set and achieves an mIOU of 81.2%, as seen in Table 4. With the fine set which contains 3475 images, our final model is evaluated on the Cityscapes test set and achieves an mIOU of 80.4%, which outperforms state-of-the-art methods, as seen in Table 5.
To compare with other state-of-the-art methods, we train our network with the train-val and coarse set, which has the extra 20,000 coarsely annotated images. We pretrained our model on the coarse data of Cityscapes, and then fine tune it on fine data. The final mIOU is 82.2%, which is still competitive to other approaches. Examples can be seen in Figure 7.

Conclusions
We review the most crucial problem, which is ignored by most of the researches: Context prior information does effectively solve the problem of intra-class inconsistent segmentation. But it starts to degrade when the network incorporates multiscale features for refining prediction. For this purpose, this work proposes a simple yet effective network (DSPNet), which can pay attention to context information again by attention mechanism. Our network can perform robust intra-class consistent segmentation while inherently extract rich semantic affinity feature, which is utilized within the linear propagation network to sharpen boundary and smooth intra-class region for refined prediction.
The results of experiment demonstrate the validation of our method. Wujing Zhan is with the School of Data and Computer Science, Sun Yat-Sen University, Guangzho. He is a master's degree student and studies machine learning and deep learning in applications of scene recognition, scene segmentation, and scene flow.
Baoding Zhou received the PhD degree in photogrammetry and remote sensing from Wuhan University in 2015. He is currently an Assistant Professor with the College of Civil and Transportation Engineering, Shenzhen University. His research interests include indoor localization and mapping, mobile computing, and intelligent transportation.
Qingquan Li received the PhD degree in geographic information system and photogrammetry from Wuhan Technical University of Surveying and Mapping in 1998. He is currently a Professor with Shenzhen University, and Wuhan University. His research areas include 3-D and dynamic data modeling in GIS, location-based service, surveying engineering, integration of GIS, global positioning system and remote sensing, intelligent transportation system, and road surface checking.