Visual inspection of surface defects of extreme size based on an advanced FCOS

ABSTRACT Surface defects of industrial products are generally detected through anchor-based object detection methods during manufacturing. However, these methods are prone to missed and false detection for ultra-elongated and ultra-fine defects. An advanced fully convolutional one-stage object detector (FCOS) is proposed. This method is based on an anchor-free FCOS network model. First, a novel type of center-ness is proposed to reduce the suppression of off-centered positions of defects of extreme size. In addition, to eliminate background interference, a self-adaptive center sampling method is proposed as a replacement for the conventional center sampling method. The regularization method and the loss function are also improved according to the defect characteristics. Experimental results show that this advanced-FCOS-based method outperforms anchor-based methodson the surface defect dataset. The proposed method effectively detects defects of extreme size without affecting the detection of normal defects. The performance of the proposed method meets the requirements of real industrial applications.


Introduction
Surface defects not only affect the appearance of a product but may also cause serious safety problems during the use process. In recent years, machine vision-based detection methods (Wang et al. 2018;Kwon et al. 2015) have received extensive attention due to their high detection accuracy and fast detection speed. This type of method involves first the collection of product surface images via an industrial camera and then the processing of the images with conventional image processing or deep learning methods to obtain the corresponding results.
Compared with conventional image processing methods, deep learningbased defect detection methods have a wider range of adaptability. A network can be trained on samples with different types of defects for use in the detection of multiple types of defects. Liu and Kang (2005) proposed a neural-network-based method for cold rolled strips, but this method cannot effectively locate defects. With the proposed object detection framework, deep learning methods can be used to accurately locate various defects. Ji, Du, and Peng et al. (2019) used the Faster region-based convolutional neural network (R-CNN) to detect defects in gears, which is faster and more accurate than previous methods. Zhang and Huang (2020) integrated Faster R-CNN and You Only Look Once (YOLO) v3 for the detection of aluminum surface defects.
The abovementioned methods are anchor-based. These methods can often effectively detect defects with a normal aspect ratio. However, these methods can only partially detect or even fail to detect defects of extreme size, such as elongated defects or microdefects. In this study, a defect with a label box aspect ratio greater than n (e.g., n > 5) or a label box area satisfying the ratio of the number of pixels in the label box to the number of pixels in the original image ≤ f (e.g., f = 1 × 10 −4 ) is defined as a defect of extreme size. This type of defect often exists on large parts, such as engine blades or wind turbine blades. Normal defects with a large overall size are also considered to be extremesize defects, for which anchor-based methods cannot enumerate all the label boxes. Therefore, an advanced-FCOS-based anchor-free detection method is proposed in this study to solve this problem.
The main contributions of this paper are as follows: (1) An anchor-free detection method is proposed to solve the problem that extreme-size targets are difficult for the existing network to detect without affecting the detection of normal-size defects.
(2) To alleviate excessive suppression of slender defects by the original center-ness, the center-ness index term is modified to improve adaptability to slender defects. (3) Adaptive central sampling is proposed to reduce the loss of information for extreme-size defects caused by central sampling.
The rest of the paper is organized as follows: the recent research advances in this field are described in Section 2. An overview of the methodology of the advanced FCOS is presented in Section 3. Experimental results are presented and discussed in Section 4. Conclusions are presented in Section 5.

Related Work
Object detection in machine vision refers to finding the position of an object of interest in an image and classifying the object. This task is highly challenging because of the large variety and complexity of shapes of objects and the presence of background interference in industrial scenes.
Conventional object detection uses a sliding window in conjunction with a classifier method. Each time the sliding window slides to a region, the classifier determines the category of the region. Chen and Liu (2007) and Han and Liao (2009) used the Harr feature and the AdaBoost classifier to detect human faces. Bauer, Köhler, and Doll et al. (2010) proposed a pedestrian detection method based on a support vector machine (SVM). Wang, Jia, and Huang et al. (2008) and Gan and Cheng (2011) investigated object detection methods based on a histogram of oriented gradients (HOG) for pedestrian detection. These methods require feature representation to be manually designed based on experience.
Deep learning methods were first used in image classification. AlexNet (Krizhevsky, Sutskever, and Hinton 2017), VGG (Simonyan and Zisserman 2014), and ResNet (He, Zhang, and Ren et al. 2016) were shown to far outperform other conventional image classification methods on the ImageNet dataset. Subsequently, an R-CNN (Girshick, Donahue, and Darrell et al. 2014) was used to locate objects using a selective search algorithm with an SVM classifier, and a deep convolutional network was used for end-to-end object detection. Generally, deep-learning-based object detection methods are classified into two categories. One category includes two-stage methods, such as Fast R-CNN (Girshick 2015) and Faster R-CNN (Ren, He, and Girshick et al. 2015). The other category includes one-stage methods, such as YOLO (Redmon and Farhadi 2017;Redmon, Divvala, and Girshick et al. 2016) and single-shot detection (SSD) (Liu, Anguelov, and Erhan et al. 2016). Two-stage methods are based on R-CNN. These methods first generate the object candidate box, then classify the candidate box, and perform regression on the coordinate offset of the candidate box. These methods are more accurate but are less efficient. In contrast, the one-stage methods directly regress the object position and the probability of its category. Although the one-stage methods generally have lower accuracy than the two-stage methods, their detection speed is faster. A feature pyramid network (FPN) (Lin, Dollar, and Girshick et al. 2017) was proposed to simultaneously make predictions on multiple scales, thus improving the performance of the network for detecting small objects. The methods mentioned above are all anchor-based.
An anchor is a prediction box with a fixed shape and size obtained by clustering ground truth labeled boxes, which can avoid blind searching during model training and help the model converge quickly. For example, Faster R-CNN needs manual specification of the anchor parameter ratio and scale and obtains a series of anchors through different combinations of these two parameters, while YOLOv3 (Redmon and Farhadi 2018) obtains the widths and heights (dimension clustering) of representative shapes to form the anchors by clustering the ground truth of all samples in the training set via the k-means algorithm. These anchors mark the detection object in a rectangular box, and the horizontal and vertical axes of the box are parallel to the horizontal and vertical directions of the image. The shape and size of the rectangular box need to be determined by prior knowledge from the ground truth. Therefore, when the size and aspect ratio of the object change drastically, the performance of the model declines sharply. For example, when there are defects such as elongated scratches and microspots, the anchor-based method is prone to missed detection and misdetection. In addition, the exhaustive number of rectangular boxes must be increased to improve the detection accuracy for such defects, which increases the detection time. The experiments in this paper confirm that Faster R-CNN and YOLO have limitations in detecting defects of extreme size.
Therefore, to achieve better detection results, anchor values should be setted properly or a customized anchor-based architecture should be built. DefectDet (Duje et al. 2022) modified the detection head to improve the detection of the objects with extreme aspect ratios which are common in UT images. Another way is to free the network from the constraint of anchors. The first anchor-free model DenseBox (Huang, Yang, and Deng et al. 2015) first introduced the concept of a fully convolutional network (FCN) (Shelhamer, Long, and Darrell 2017) into target detection. The model directly predicts the bounding box and confidence score of each pixel result through NMS. CenterNet (Zhou, Wang, and Krahenbühl 2019) and FCOS (Tian, Shen, and Chen et al. 2019) are the most representative anchor-free methods. CenterNet replaces the object with its center point and ultimately returns the position of the center point and the object size. An FCOS is similar to CenterNet, but it returns a series of points close to the center point of the object and the distance from this point to the object bounding box. In CenterNet, a target corresponds to the local peak point on the feature map output by the network. This network does not provide an effective solution to overlapping targets. Comparing the two types of methods, an FCOS is more likely to have the ability to detect objects of extreme size. Therefore, an FCOS is used in this study as a backbone for the detection of defects in industrial products. The original network structure is improved to improve the detection performance for extreme-size defects.

Original FCOS
For the anchor-based methods, it is necessary to generate a large number of anchors and gradually fit an object through these anchors, while an FCOS predicts the sampling points by using the concept of FCN to obtain the corresponding category of each sampling point and the distances from each sampling point to the four sides of the corresponding object bounding box, as shown in Figure 1.
Let F i 2 R H�W�C represent the feature map of the ith layer in the convolutional neural network (CNN) backbone and s represent the total stride from the input image to this layer. In the F i layer, a certain position (x, y) corresponds to an area centered at (s/2 + xs, s/2 + ys) on the input image.
For the FCOS, if the position (x, y) falls in a certain ground truth box, it is considered a positive sample belonging to this category; otherwise, it is regarded as a negative sample, that is, background. In the FCOS, C binary classifiers (C is the total number of categories) are trained rather than one multiclassifier. Moreover, to describe the prediction results, in addition to the classification label of the position, the FCOS regresses a four-dimensional vector (L, T, R, B), where L, T, R, and B represent the distances from the position to the four sides of the bounding box, as shown in Figure 1. The network structure of the FCOS in Figure 2 shows that regression and classification are output as two branches. As the distance obtained by the regression is always positive, the output of the model is mapped through the exponential function at the top of the regression branch. If a position exists in multiple ground truth boxes at the same time, its category attribution is ambiguous, that is, a position belongs to multiple categories at the same time; however, the final output of the classifier should be one category.
To eliminate the ambiguity of overlapping objects, the bounding box with minimal area is chosen as its regression target. Moreover structure of an FPN is introduced in the FCOS, with different levels of the FPN predicting objects   of different sizes. The FPN has a total of five feature levels, namely, P3, P4, P5, P6, and P7. By limiting the regression results of each layer (i.e., L, R, T, and B in Figure 1), the model assigns a task of predicting objects of a different size to each layer, thereby avoiding the problem of overlap between objects of different sizes. Specifically, if max(L, T, R, B)>m i or max(L, T, R, B) <m i-1 , this position is set as the background in this layer. Here, m i represents the maximum regression distance of the ith feature layer, and m 2 , m 3 , m 4 , m 5 , m 6 , and m 7 are generally set to 0, 64, 128, 256, 512, and ∞, respectively. In addition to the output classification and regression common to the object detection models, the FCOS outputs the center-ness to suppress the bounding boxes that are close to the edge of the object. The center-ness represents the distance between a position and the center of the object. The center-ness of the object center is 1, and the greater the distance of the position from the center, the smaller its center-ness. During inference, the center-ness is combined with the category confidence to calculate a final score. At the nonmaximum suppression (NMS) stage, filtering is performed based on the final score so that these prediction boxes that are far from the center can be filtered out. The expression of center-ness c x,y is as follows: In Equation (1), the square root operation is to reduce the attenuation rate of the center-ness. Since the value range of center-ness is [0, 1], binary crossentropy loss is used for training. In the testing phase, the ranking score of the NMS is taken as follows: s x;y ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi p x;y � c x;y p ( 2) where p x;y represents the classification score. The loss function of the FCOS is as follows: where L cls is the classification loss, which is represented by the focal loss (Linet al. 2017), and L reg is the regression loss. N pos represents the number of positive samples, and positive samples are the sample points that fall within the label box. Λ is a hyperparameter with a default value of 1.

Improved Center-Ness
For elongated defects, such as the scratches shown in Figure 3(a), a very large aspect ratio causes the center-ness defined in (1) to be very sensitive to changes in the short side but less sensitive to changes in the long side. Thus, a large number of bounding boxes close to the ground truth are suppressed in the NMS stage, resulting in missed detection of elongated defects. As shown in Figure 3(b), the predicted values in the narrow red range at the center are retained, while the predicted values in the surrounding large range are suppressed and filtered in the NMS stage. In fact, these positions can also describe the defect well and are not the so-called low-quality prediction boxes. Consequently, a novel definition of center-ness is proposed: where h and w denote the height and the width of the bounding box, respectively. When the aspect ratio increases, α decreases, and in this manner, the suppression of the prediction boxes of elongated defects can be weakened. For nonelongated defects, the improved center-ness can also suppress the offcenter bounding boxes as the original center-ness does. As shown in Figure 3(c), the improved center-ness has a better tolerance in the width direction, which weakens the excessive suppression of the predicted values in the width direction of elongated defects by the original center-ness. In addition, the center-ness and classification modules in the conventional FCOS share parameters, as Figure 2 shows. In this study, the center-ness branch and regression branch are implemented together, as in (Tian et al. 2020), after the center-ness is improved. The advanced FCOS network structure is shown in Figure 4.

Self-Adaptive Center Sampling
In the conventional FCOS, the sample points that fall into the ground truth labeled box are treated as positive samples, which causes many positive samples to deviate from the center of the object. These samples introduce a large quantity of background information and affect the detection results, as shown in Figure 5(a). Tian et al. (2020) used center sampling to improve this problem, as shown in Figure 5 (b). Specifically, only the points in the central region of the object are treated as positive samples. The central region is defined as (c x -rs; c y -rs; c x + rs; c y + rs). c x and c y represent the abscissa and ordinate coordinates of the center point, respectively, s represents the stride of the FPN layer, and r is a hyper-parameter with a default value of 1.5. However, this sampling method loses most of the information for elongated objects. Therefore, a self-adaptive center sampling method is proposed in this study to redefine the center region as (c x -rs; c y -ρrs; c x + rs; c y + ρrs) when ρ ≤ 1, or (c x -ρrs; c y -rs; c x + ρrs; c y + rs) when ρ > 1, where ρ represents the aspect ratio of the label box and the definitions of the remaining symbols remain unchanged, as shown in Figure 5 (c). It will work for objects elongated in all the directions according to ρ. After the improvement, as the aspect ratio of the label box changes, the central area changes accordingly so that the part that deviates in the length direction can also be used as a positive sample instead of the background.

Using the GIoU Loss Function
The regression branch of conventional FCOS uses the intersection over union (IoU) loss as the loss function (Yu, Jiang, and Wang et al. 2016). However, there are two problems with using IoU loss as the loss function. First, when two bounding boxes do not intersect, the IoU is 0 regardless of the distance. However, the closer the bounding boxes are, the more accurate the prediction of the model should be, but the IoU loss cannot reflect this trend. Second, when two bounding boxes intersect, the prediction accuracy is not only related to the intersection area but also related to the intersection position, which is also not affected by the IoU loss. Therefore, this study uses the generalized intersection over union (GIoU) loss (Rezatofighi, Tsoi, and Gwak et al. 2019) instead of IoU loss.
Assuming that there are two bounding Boxes A and B, then The minimum closure region of A and B is defined as C; then, the GIoU formula is as follows: The GIoU loss is hence as follows: Through the definition of GIoU, it can be seen that GIoU takes into account both the overlapping and nonoverlapping regions of the two intersecting bounding boxes as well as the situation of two bounding boxes that do not intersect, which makes up for the deficiency of using IoU as the loss function.

Group Normalization
The original FCOS performs batch normalization of the dataset (Ioffe and Szegedy 2015) and limits the results to a specific range to exclude singular data. In industrial detection, most collected images contain more than one million pixels. In training, only a small batch size can be used, and the data are often highly imbalanced. In this case, batch normalization not only affects the network performance but also causes the mean and variance of the data to deviate from the original values, which affects the training results. Group normalization (GN) can solve the problems encountered with using batch normalization for small batches (Wu and He 2018). The data dimension of the neural network is generally expressed in the form of [N, C, H, W]

Experiments and Results
The software and hardware platforms used in our experiment include an NVIDIA GTX 1080Ti as the GPU, Ubuntu 18.04 as the operating system, and PyTorch as the deep learning framework.

Experimental Dataset
The data used in this paper are from the Tianchi aluminum surface defect dataset (Tianchi 2018). The image resolution is 2560 × 1920. Since there are many types of defects in the original dataset and most of them are defects of common size that are easy to detect, the low accuracy and low recall rate of defects of extreme size are masked in the evaluation of the results. Hence, we modify the original dataset by considering two representative types of defects, that is, those with the largest aspect ratios and smallest resolutions (scratches and spots), and use two types of typical defects (wrinkles and bumps) as controls. These training data are enhanced by flipping and rotating. Eventually, 1600 images are obtained. Each of these 1600 images is manually labeled. Sixty percent of the dataset is used as the training set, and the validation and test sets each account for 20%. Sample images of the four types of defects are shown in Figure 6.

Evaluation Indicators for Detection Results
The precision, recall, and mean average precision (mAP) are common indicators used to evaluate object detection performance. The formulas for the precision and recall are as follows: where the true positive (TP) represents the number of defective areas detected as defects, the false positive (FP) represents the number of nondefective areas detected as defects, and false negative (FN) represents the number of defective areas detected as nondefective areas (Ren and Xue 2020;Wang et al. 2020). The precision-recall curve for a defect is plotted, and the area under the curve that lies above the x-axis is taken as the average precision (AP) of the defect. The mean of the AP values of all defects is taken as the mAP.
In this study, the AP and mAP are used to evaluate the detection results.

Analysis of Experimental Results
During training, the batch size is set to 16, the total number of training iterations is 20,000, and stochastic gradient descent (Theodoridis 2015) is used for optimization. The momentum factor μ is 0.9, and the weight attenuation coefficient ω is 0.0001. In addition, ResNet101 is used as the model backbone. The training loss curve is shown in Figure 7. The loss curve in the figure is obtained by sampling once every 200 epochs.

Comparison of IoU Loss and GIoU Loss
To verify the effectiveness of the GIoU loss in place of the IoU loss, the GIoU loss and IoU loss are substituted into the original FCOS framework for comparison. The results are shown in Table 1. The AP values of all four types of defects (especially those of spots and bumps) evidently increase. The experimental results show that the use of GIoU loss can increase the mAP by 2.9%.

Comparison of Batch Normalization and Group Normalization
For microdefects, such as spots in the dataset, the use of compressed images inevitably leads to the loss of defect information. Therefore, it is necessary to input the original size image into the network. In addition, the use of a smaller batch size for training results in batch normalization being inferior to group normalization. Based on the advanced FCOS model discussed in 3.3.1, we replace batch normalization with group normalization. The comparative experiment results are shown in Table 2. The AP of spot detection is significantly improved, the AP of the detection of other three types of defects is improved, and the overall mAP is increased by 1.5%.

Comparison Between the Improved Center-Ness and Original Center-Ness
In the dataset, most scratches are defects of extreme size that lie near the widest part of the image. The improved center-ness and the original centerness are compared. The results show that after the center-ness is improved, the AP of scratch detection is increased by 4.4%. As shown in Table 3, the improved center-ness optimizes the detection of elongated defects by FCOS without affecting the detection of other types of defects, and the overall mAP is increased by 1.3%.

Experimental Analysis of Self-Adaptive Center Sampling
The comparative experimental results of self-adaptive center sampling and center sampling are shown in Table 4. Since self-adaptive center sampling is also an improvement targeting elongated defects, it effectively improves the AP of scratch detection from 58.2% to 62.0%. The overall mAP increases from 74% to 75.4%.

Overall Analysis of Experimental Results
The experimental results are visualized in Figure 8. The results of the entire ablation experiment are presented in Table 5 and show that, based on the original FCOS framework, the performance for detecting defects of extreme size can be optimized for each improvement experiment and the accuracy of detection of normal-size defects can be improved to some extent. For ultraelongated scratches, the AP can reach 62%, with an increase of 11.8%, while the AP of detecting microspots can even exceed 80%, with an increase of 10.5%. For the detection of wrinkles and bumps (defects of common size), the AP increases by 1.8% and 4.2%, respectively.

Comparison of Advanced FCOS and Other Anchor-Based Methods
For the anchor-based methods, it is necessary to define the anchors based on the defects. Faster R-CNN defines nine anchors of fixed size. YOLOv3 and YOLOv5 obtain the appropriate anchors through clustering. YOLOv5x is the Figure 8. Visualization of the detection results, where defects 1 to 4 represent scratches, spots, wrinkles, and bumps, respectively. version with the strongest detection capability among YOLOv5 series. Figure 9 is a scatter plot of the defect size distribution in the aluminum-surface-defect dataset. In the figure, the sizes of the four types of defects are widely distributed: both the width and length follow a nearly random distribution in the interval (1, W), where W denotes the width of the input image. Neither the defined anchors nor the anchors obtained by clustering can be well fitted. In fact, the anchor-based methods generally rely on increasing the number of anchors to solve these problems, which seriously affects the detection efficiency and can even make the network difficult to train. As shown in Figure 10, when the angle of an ultra-elongated defect is rotated from horizontal to 45° and 90°, the aspect ratio of the bounding box     changes sharply, which corresponds to the overly large Euler distance between the two points in Figure 9. Consequently, the detection performance of the anchor-based methods drops. Therefore, the anchor-based methods are not suitable for the detection of defects of extreme size. Advanced FCOS and the anchor-based methods are compared on the test set. The AP (%) values of the detection of the four types of defects are shown in Table 6. Among these methods, Faster R-CNN also uses ResNet101 as the backbone and introduces the FPN structure. YOLOv3 uses DarkNet as the backbone. The hyperparameters for both YOLOv3 and YOLOv5x were set to default values proposed by its creators. The number of anchors in both models is set to the default value of 9. It can be seen that the mAP of our model for the detection of the four types of defects is significantly higher than that of Faster R-CNN, by 8.3%, and that of YOLOv5x, by 9.4%. If we look only at the two types of defects of extreme size, i.e., scratches and spots, the improvements are even more obvious. For example, the AP for spot detection using our model is higher than that of Faster R-CNN by 12.3% and that of YOLOv5x by 14.9%. The visualized examples of the scratch detection results of the four models are shown in Figure 11. The figure shows that the ranges detected by Faster R-CNN, YOLOv3 and YOLOv5x differ considerably from the actual values, resulting in lower AP values.
The detection speeds are compared in Table 7. YOLOv5x has a detection speed of approximately 28 f/s and the best real-time performance, whereas Faster R-CNN has a detection speed of approximately 0.8 f/s. The detection speed of the proposed method (6 f/s) lies between Faster R-CNN and YOLO methods, whereas the size of input image is approximately 28 times that of YOLOv3. The advanced FCOS method is suitable for real industrial detection.

Conclusions
Ultra-elongated and ultra-fine defects are prone to be missed and false detection during manufacturing, in this study an advanced fully convolutional onestage object detector is proposed to solve this problem. We improved the original FCOS framework, proposed center-ness and self-adaptive center sampling to prevent center suppression, and improved the regularization method and the loss function based on the defect characteristics. Experimental results show that the proposed method significantly improves the performance of the network in detecting defects of extreme size, including elongated defects and microdefects, without affecting the detection of normalsize defects. The proposed method outperforms Faster R-CNN, YOLOv3 and YOLOv5x on the aluminum-surface-defect dataset. In addition, the proposed method can detect large images (>2Kx2K) at 6 f/s, which meets the requirements of real-time industrial detection.