Deep blur detection network with boundary-aware multi-scale features

Recently, blur detection is a hot topic in computer vision. It can accurately segment the blurred areas from an image, which is conducive for the post-processing of the image. Although many hand-crafted features based approaches have been presented during the last decades, they were not robust to the complex scenarios. To solve this problem, we newly establish a boundary-aware multi-scale deep network in this paper. First, the VGG-16 network is used to extract the deep features from multi-scale layers. Contrast layers and deconvolutional layers are added to make the difference between the blurred areas and clear areas more prominent. At last, a new boundary-aware penalty is introduced, which makes the edges of our results much clearer. Our method spends about 0.2 s to evaluate an image. Experiments on the large dataset confirm that the proposed model performs better than other models.


Introduction
Blur detection without any knowledge about the blur type, level and the camera setting is a challenging problem. It aims to accurately segment the blurred areas in an image. It is very useful in many interesting applications, such as image restoration (Zhang et al., 2016), defocus magnification (Bae & Durand, 2007), salient object detection (Jiang et al., 2013;Sun et al., 2017), blur segmentation and deblurring (Shi et al., 2014(Shi et al., , 2015, and so on. Blur detection is different from the object detection. It aims to detect the blurred parts of an image, not only the objects. According to the cause of blurness, the blur could be divided into out-of-focus (Figure 1(a)) or motion blur (Figure 1(b,c)).
As shown in Figure 1, the blurred parts of an image could be an object, or a region, even the rest parts of an image excluding the clear object. In the case of Figure 1(a), since there is a clear object and the blurred regions are almost the whole background, it is easier to get a good detection result benefiting from some skills of object detection. But in the case of Figure 1(b,c), things are more difficult because the blurred regions are no longer limited to the background. The uncertainty and diversity of the blurred regions make the blur detection task more challenging to the traditional detection tasks. Sometimes, the blur is purposely generated by photographers. They take a wide aperture and prevent light converging to achieve a visual effect, which makes the important objects more prominent. This photography skill is very common in the images captured by optical imaging systems. Based on studying the cause of blurness, many methods were proposed to solve this problem. In Levin (2007), they tried to identify partial motion blur via analysing the statistics information of an image. In Liu et al. (2008), they used four local blur features to detect the blurred regions. In Chakrabarti et al. (2010), they first used the local Fourier transform to solve this problem. In Shi et al. (2014), they analysed the discriminative features for the blur detection task in gradient and Fourier space. All these methods belong to the hand-crafted features based methods. In simple scenarios, they are convenient and often effective; however, they cannot handle the complex scenarios well.
One possible reason is that it is hard to segment the blurred areas from the smooth areas inside a clear object. There is no structural information in blurred areas, including clear boundaries and textures. Neither does the smooth areas inside a clear object. Only depending on the low-level cues, it is too hard to distinguish them. The second reason is that the traditional methods are more sensitive to the edges of objects, but they often fail when there is not a clear object in an image, just as the case of Figure 1 (b,c). For these two possible reasons, in this paper, we avoid to design a new hand-crafted feature. Instead, we set up a multi-scale deep network to learn which regions are blurred in an arbitrary image.
During the last five years, convolutional neural network (CNN) has developed very fast. It has performed better than the traditional methods in many applications, including object detection (Kang et al., 2016;Srivastava & Biswas, 2020;Xu & Zhang, 2020;Zhu et al., 2020), image classification (Wei et al., 2015), image denoising (Zhang, K. et al., 2017), image super-resolution , saliency detection (Zhang, P. et al., 2017), object tracking (Sun et al., 2018) and network optimisation (Liu et al., 2020;Wang et al., 2016;Xue & Wang, 2015Ye et al., 2016). Inspired by these facts, we proposed a boundaryaware multi-scale deep network to solve the blur detection problem in this paper. First, the image scales are one of the most important factors to determine the blur confidence of an image (Shi et al., 2014). Motivated by this observation, we used the VGG-16 network (Simonyan & Zisserman, 2015) to get the feature maps with different scales from an input image. Second, we used a series of contrast layers to strengthen the differences between the blurred areas and the clear areas. And we connected the corresponding deconvolution layers to get the final estimation result with the same resolution of input image. Finally, a new loss function was added to force the blur boundary more clear. Since our new model is not based on the image patches, it is trained end-to-end. Our new model spends only 0.2 s evaluating an input image. Experiments on the large dataset show that the performance of our new model is better than the other methods.
In a word, our contributions in this paper are summarised as follows: • We have given an deep blur detection network which has achieved a good balance between fast running speed and high detection accuracy. Our network architecture is simple, clear and effective. • We have demonstrated it is more useful to use the contrast layers as the skip connection between the encoder network and the decoder network. It could be also useful for the other applications base on the contrast features, such as salient object detection, etc.
• We proposed a boundary-aware loss function, which penalised the errors on the boundary. Benefiting from the new loss function, our blur detection results have more clear edges and much cleaner background.
The rest sections are arranged as follows. Section 2 summarises the related literatures. The proposed model is detailed described in Section 3. In Section 4, we have shown a plenty of experiments and the comparison results. At last, we give the conclusion in Section 5.

Related work
Traditional blur detection models include two categories: the gradient magnitude-based methods (Chen et al., 2020;Elder & Zucker, 1998;Su et al., 2011;Tai & Brown, 2009, etc.) and frequency space-based methods (Golestaneh & Karam, 2017;Lu et al., 2016;Shi et al., 2014;Tang et al., 2016;Wu et al., 2019, etc.). It is observed that if an image patch is clear, there are more strong gradients. Based on this observation, in Elder andZucker (1998), andTai andBrown (2009), the authors used the ratio of strong gradient magnitude to characterise the sharpness of an image patch. In Su et al. (2011), the authors combined the distributions of singular value and gradient pattern in the alpha channel, which was used to detection the blur areas.
In frequency space, a blur image often has less high-frequency components and more low-frequency components. Based on this fact, many methods were proposed in frequency space. They detected the blur areas by calculating the ratio of high-frequency components for an given image. In Golestaneh and Karam (2017), the authors combined the multi-scale blur detection maps in high frequency. In Shi et al. (2014), the authors have proposed four effective descriptors including a Fourier domain descriptor. In Tang et al. (2016), the authors iteratively refined the coarse blur map by using the similar neighbour image patches. In Lu et al. (2016), the authors used unsigned utility magnitudes to compute the blur areas.
In some simple scenarios, the hand-crafted features have shown good performance, but they often fail in the complex scenes, just as shown in Figure 1(b,c). The blur detection task is still an open problem.

Network architecture
In this paper, we propose an end-to-end fully convolutional and boundary-aware multiscale network to segment the blur areas from an given image. The entire network of our model is shown in Figure 2. As shown, the entire network contains five scales and three rows. Five scales mean to extract the features with multi-resolutions. Three rows are used for feature extracting, contrast computation and up-sampling processing, respectively. The input image I is resized to 352 × 352. The estimated blur map is 176 × 176, and we resize it back to 352 × 352 with a bilinear interpolation. The detailed network settings are listed in Table 1.
As mentioned in Section 1, the multi-scale resolutions are important to the blur detection task. In our network, the first row includes five layers coming from VGG-16 network (Simonyan & Zisserman, 2015). All of these five layers connect with a max-pooling  The second row contains five layers, named Conv6 to Conv10, which are connected to the Conv1 to Conv5, respectively. These five layers aim to compute the contrast features {C 1 , C 2 , . . . , C 5 }. These contrast features are local, which capture the lost of object details compared to their local neighbourhoods.
The last row contains five deconvolution layers, which fuse and up-sample the feature maps from the first row and the contrast feature maps from the second row. The up-sample factor is 2, therefore, the resolution changes from 11 × 11 to 176 × 176 step by step. The final estimated blur map is got after two convolution layers and one softmax operation.

Multi-scale features extraction
The multi-scale perception in blur detection has been noticed in Shi et al. (2014). Looking from one singe resolution, it is hard to accurately identify whether an area of an image is blur. Looking from the large scale, an image may be clear, but when looking into the smaller scale, if no more details of structures or textures information provided, it is considered to be blurred. That is called the scale ambiguity (Lu et al., 2013;Yan et al., 2013).
In this paper, we use VGG-16 network (Simonyan & Zisserman, 2015) as our pre-trained network. It has multiple stages of spatial pooling, which progressively down-sample the input image, resulting in a multi-resolutions feature maps. To transform the original VGG-16 model into a fully convolutional network, we only reserve the first five down pooling blocks of VGG-16 network as the multi-scale feature extraction layers, e.g. Conv1 to Conv5, and the multi-resolutions feature maps are denoted as {F 1 , F 2 , . . . , F 5 }.

Skip connection using contrast layers
Skip connection has been recently used to associate the low-level features maps across resolutions. For example, it was used in U-Net (Milletari et al., 2016) and SharpMask (Pinheiro et al., 2016) for segmentation, in Recombinator networks (Honari et al., 2016) for face detection, and in Stacked Hourglass networks (Newell et al., 2016) for keypoint estimation. The skip connection was often executed by a convolutional layer with 1 × 1 resolution, named Conv(1*1), just as shown in Figure 3(a). In this paper, we used a contrast layer as the skip connection instead of Conv(1*1), as shown in Figure 3(b), because we found that it was more effective to strengthen the contrast features between the blur regions and the clear regions than only to introduce them.
Contrast features are often used in salient object detection (Achanta et al., 2009;Jiang et al., 2013;Yan et al., 2013). In blur detection task, the clear image refers to the image with clear edges, subtle textures, and rich detailed information. On the other hand, the blurred image has blurred edges or textures, and is often lack of details. Therefore, the contrast features are also useful in blur detection. But the contrast features for blur detection are more difficult to extract. In the past, the contrast features for blur detection are founded through image statistics information in gradient distribution (Jiang et al., 2013;Liu et al., 2008;Su et al., 2011;Zhao et al., 2013;Zhuo & Sim, 2011) or in frequency space (Park et al., 2017;Shi et al., 2014;Tang et al., 2013Tang et al., , 2016Vu et al., 2011). Instead of finding a new hand-crafted features, in this paper, we use a series of local contrast features layers to capture the difference between clear and blurred regions.
We connect a convolutional layer to each scale feature map, e.g. Conv6 to Conv10. Each convolutional layer has a kernel size 3 × 3 and 128 channels. And each contrast feature map C i is calculated as follows: Note our contrast features are learned and not pre-defined. The size of each contrast feature map C i is equal to the corresponding feature map F i .

Deconvolution layers
Since we have got a lot of contrast feature maps from different scales, we need a fuse category to fuse them together and generate the final estimated blur map with the same size of input image. We adopt five deconvolution layers each connected to one contrast layer. In the same time, the lower deconvolution layer is connected to the higher deconvolution layer. The feature map is up-sampled from the lowest deconvolution layer to the highest by a factor of 2 one layer after another. At each deconvolution layer, the resulting up-pooled feature map U i is computed by combining the corresponding contrast feature map C i and the previous up-pooled feature map U i+1 . It is defined as follows: The final estimated blur map is the result of U 1 after two convolutional layers and one softmax layer. The detailed network parameters could be found in Table 1.

Boundary-aware loss function
Given the training set { = (U, G)}, where U denotes the estimated blur maps, and the G denotes the ground truths. Here, x, y denotes the coordinate of one pixel. The pixel-wise loss function between the estimated blur maps U and the ground truths G is generally defined as follows: where Here, denotes the parameter set.
Motivated by the progress on medical image segmentation (Milletari et al., 2016;Taha & Hanbury, 2015;Zou et al., 2004), in this paper, we use a boundary-aware loss function to approximate the penalty on boundary length. To compute the boundary pixels, we use a Sobel operator and a tanh function to get the gradient magnitude in the estimated blur map. The tanh function could project the gradient magnitude of the estimated blur map to a probability range of [0, 1]. The boundary penalty term is defined as follows: whereÛ i denotes the gradient magnitude of the results, andĜ i denotes the corresponding ground truth. The final loss function of our model is the sum of Equations (3) and (5). This final loss function is boundary penalised and the computation procedure is end-to-end trainable. Its usage will be demonstrated in the experimental section.
Here, we have shown the comparison with the other methods using gradient magnitude in Figure 4. In Figure 4(a), there is an original image and the corresponding ground truth. In the Figure 4(b), the left is the gradient magnitude detected by a Sobel edge detector in Zhuo and Sim (2011), and then the blur amount at edge locations was the propagated to the entire image to get the final blur map in the right. In Figure 4(c), the left is the learned boundary by our method after 200 epochs of training process, the right is our detected blur map. Obviously, the learned boundary after 200 cycles are more accurate to the boundary of ground truth. This will be useful for a better blur map.

Dataset
The blur detection dataset proposed in Shi et al. (2014) is selected as the evaluating dataset. For the convenience, this blur detection dataset is named Shi's dataset 1 in this paper. As far as we know, it is the largest public blur detection dataset with pixel-wise ground truths. It contains 1000 images with pixel-wise ground truth annotations for the blurred regions. Ground truth masks are produced by 10 volunteers. It varies a lot in contents, including images with defocus blur or motion blur.

Experimental setup
Our blur detection model was based on TensorFlow (Abadi et al., 2016) platform. The weights in the Conv1 to Conv5 layers were initialised with the pre-trained weights of VGG-16 network (Simonyan & Zisserman, 2015). All the weights of new layers were randomly initialised with a truncated normal distribution (δ = 0.01). The Adam optimizer (Kingma & Ba, 2015) is adopted in our training process. Initial learning rate is set to 10 −6 .
For avoiding over-fitting problem, we selected the first 500 images from the Shi's dataset as our training set, and used the following 100 images as the validation set. The rest 400 images are used to test. The images of training dataset were resized to 352 × 352. Comparison with the other methods using gradient magnitude. In (a), the left is original image and the right is the ground truth; in (b), the left is the gradient magnitude and the right is the result generated by Zhuo and Sim (2011);, the left is the our learned boundary and the right is the result of our method. (a) Image and GT. (b) Zhuo and Sim (2011) and (c) Our method.
Our model was trained on a personal computer. The CPU is an Intel i7-8750H, and the memory is 8GB. The Nvidia GeForce GTX1060 GPU is used in our computer. It spent about 12 h for 20 epochs during the training stage, and we needed only 0.2 s to generate the estimated blur map for a testing image with 352 × 352 resolution.

Evaluation metrics
In the experiments, six metrics (Ma et al., 2018) are selected as the metrics, including the precision-recall curves, average of Precision values, average of Recall values, mean  absolute error, the max F-measure score and the structural measures. For convenience, they are abbreviated as PR curves, Avg(Precision), Avg(Recall), MAE, Max-F and S-measure, respectively.
To more easily understand how to use different metrics to evaluate different methods, in Figure 5, we have given an illustration. In this figure, given an input image and the corresponding ground truth (GT), the results of 6 methods are shown and two metrics (MAE and S-measure) are used to evaluate their performances. As shown in the first row, since the MAE value is the smaller the better, we have sorted the performances of 6 methods from left to right. But the S-measure value is the bigger the better, as shown in the second row, we still arrange the best method to the most left. According to the MAE, Su et al. (2011) method is better than Shi et al. (2014) method, but according to the S-measure, Shi et al. (2014) method is better. Therefore, by evaluating the performance from different views, we could give a comprehensive comparison to all these methods. Taking Figure 5 as an example, we can see that our method achieves the highest value of S-measure and the lowest value of MAE, so the comprehensive performance of our method is best.

Effectiveness of the boundary penalty
To demonstrate the effectiveness of our newly added boundary penalty to the loss function, we removed the boundary penalty from our model and the rest parts remained the same. Then we trained the "ours_1" network for 20 epochs and the quantitative results were reported in the Table 2. Using MAE value, our newly added boundary penalty has improved our method by about 8% on Shi's dataset. In the Figure 6, we have shown the comparison of results generated from our model without and with boundary penalty. Result with boundary penalty is more clear at the object boundaries, and the background is also cleaner.

Effectiveness of the contrast layers
To demonstrate the effectiveness of the contrast layers of our model, we removed the contrast layers from our network and connected the output of VGG-16 network to the deconvolutional layers with a Conv(1 * 1) skip connection. Then we trained the "ours_2" network for 20 epochs and the quantitative results were reported in Table 2, too. Using MAE value, the contrast layers have improved our method by about 5% on Shi's dataset. In Figure 7, we have shown the comparison of results generated from our model without and with contrast layers. Benefiting the contrast layers, the result of our model with the contrast layers is more complete and closer to the ground truth.

Objective comparisons
We compared our model with the other seven methods on the Shi's dataset, including FFT (Chakrabarti et al., 2010), Liu et al. (2008), Su et al. (2011), Shi et al. (2014, JNB (Shi et al., 2015), LBP (Yi & Eramian, 2016) and HiFST (Golestaneh & Karam, 2017). Table 3 has shown the performance comparison results on Shi's dataset (Shi et al., 2014) using MAE, Max F-measure score, Average Precision, Average Recall and S-measure metrics. As shown, on Shi's dataset, our method has achieved the best performance in terms   (Shi et al., 2014)  of Average Precision, MAE, Max F-measure and S-measure scores. Especially in term of MAE metric, our method has got an improvements by over 30 percents than the previous methods. In term of S-measure score, our method has also got the performance promotion by over 21 percents. In term of Avg(Recall) score, the HiFST method is higher than our method, but its score of Avg(Precision) is lower than our method. In other word, most of the positive samples predicted by HiFST method are correct, but the negative samples predicted may be wrong, which is usually caused by over-fitting. The Max F-measure is the balance of both scores of the Precision and Recall. In term of the score of Max F-measure, our method has  the comparable performance with the HiFST, but the results of our method have clearer boundaries and cleaner backgrounds, which means a lot to the applications. Figure 8 has shown the PR curves of different methods on Shi's dataset. Obviously, both PR curves of the HiFST method and our method are higher than the others. Furthermore, Figure 9 has shown the F-measure, precision and recall values of different methods on Shi' dataset. Although the HiFST method achieved the highest recall value, but its precision value was lower. Our method has got the highest precision values while remaining the high recall values, and then the F-measures value of our method was also the highest.
Overall, based on these quantitative evaluations on Shi's dataset (Shi et al., 2014), our novel blur detection method has achieved higher precision and recall rate, bigger Max F-measure and S-measure scores, and less mean absolute errors, compared with other methods.

Subjective comparisons
For further understanding our superiority, the subjective comparisons on Shi's dataset are shown in Figures 10 and 11.
In Figure 10, the blurness of images is caused by out of focus or by motion. All these images have something in common, that is, all of them have the clear objects and the blurred background. For this kind of image, the blur detection task is much easier. Some traditional methods, such as Su et al. (2011) and Liu et al. (2008), JNB (Shi et al., 2015), etc., could generate the good results by using hand-crafted features. But as shown, their common disadvantage is that the backgrounds in the results are messy, and the detected objects are not complete. In contrast, the clear objects detected by our method are more complete and the blurred backgrounds are more cleaner.
In Figure 11, all original images have the blurred regions. Because the blurred parts are regions, that means no specific shapes or structures, the performance of the traditional methods decline a lot, especially in the 3rd, 6th, 7th and 8th examples. In these special cases, our method still performs well. Our method could locate the position of the blurred regions and highlight them all. Benefiting from our newly added boundary-aware loss function, our results have clearer boundaries and much closer to the ground truths.

Running time comparison
The comparisons of the running time of different methods, which were published during the last four years and had higher performances are shown in Table 4. We can see that the proposed method keeps a good balance between the fast running speed and high detection accuracy.

Conclusions
In this paper, we present a novel boundary-aware multi-scale deep network for the blur detection. Multi-scale features extraction is benefited from the VGG-16 network. Instead of the skip connection using Conv(1 * 1), the contrast layers are used in this paper to enlarge the margin between the blurred areas and clear areas. For getting the same resolution with the input image, the step by step deconvolutional layers are introduced. Finally, the boundary-aware loss function is introduced to refine the results. Our method has achieved the fast evaluating speed and superior performance to the other methods, which has been demonstrated by the experiments on the public large dataset. Note 1. http://www.cse.cuhk.edu.hk/leojia/projects/dblurdetect/dataset.html.

Disclosure statement
No potential conflict of interest was reported by the authors.