An improved YOLOv3 model based on skipping connections and spatial pyramid pooling

The cascaded deep-learning network of YOLOv3 emphasizes on the layer-wise feature extraction. It neglects the sequential influence among the layers that contributes to the subtle features for the objects detection. An improved YOLOv3 model with skipping connections is proposed in this paper for the sufficient utilization of layer-wise features. Firstly, a DenseBlock network is adopted as the fourth and fifth down-sampling layers of YOLOv3. The DenseBlock is characterized of a parallel architecture and capable of transmitting the features backwards among extraction layers. Then, the features of preceding layers are incorporated by a skipping fashion into subsequential layers. Secondly, a spatial pyramid pooling (SPP) module is introduced at the neck of the object detector. It realizes the size-tuning of the model input. Then the multi-scaled region features are generated after the pooling and concatenation operation of the SPP. Finally, the validation experiments have been conducted on a dataset of helmet objects. The results have shown that the proposed YOLOv3 model improves the accuracy effectively. It yields a mean average precision 88.6% on the helmet detection, which is 3.5% higher than the original YOLOv3 network.


Introduction
Most classical object detection methods relied on the colour histogram information, texture features and etc. (Fei-Fei & Perona, 2005). The AlexNet model showed a high detection accuracy on the ImageNet dataset, which attracted many scholars to study object detection based on the deep learning (Krizhevsky et al., 2012). In general, the deep learning model has two kinds of structure. Two-stage networks take sliding window and efficient selective search to get region proposal such as FastR-CNN (Ross, 2015) and Faster R-CNN (Ren et al., 2017). While SSD (Liu et al., 2016) and a series of YOLO (Redmon & Farhadi, 2017Redmon et al., 2016) models are typical representatives of one-stage networks without the usage of the region proposal. Specially, YOLO advances to get the detection result in one-step. It directly generates target information without region proposal, including the classification, precision and position coordinates. The SSD model continues the YOLO's idea and integrates the multi-scaled feature adaption and detection of small targets (Liu et al., 2016). Further, YOLOv3 model (Redmon & Farhadi, 2018) uses the Darknet-53 as the backbone network. It provides a rather higher mAP than SSD and YOLO models with the ResNet and multi-scale detection. From CONTACT Xinliang Zhang zxldq@hpu.edu.cn YOLOv1 to YOLOv3, it can be seen that the parallel structure with the ResNet (He et al., 2016) and FPN (Lin et al., 2017) contributes to the accuracy improvement of the detection model. Yunong T changed the YOLOv3 backbone network to a dense connection for the combination of features from different layers (Tian et al., 2019). In Zhou et al. (2018) the DenseNet-169 was adopted as the backbone network of STDN to enhance the parallel relation.
With the enhancement of feature transmission between layers, the model had yielded a significant improvement on the detection accuracy. At the same time, it is noted that the spatial pyramid pooling network (SPPNet), widely used in multi-scale scenes (Gao et al., 2017;He et al., 2015;Lecun et al., 2015;Özdemir et al., 2010), equivalently adds a parallel structure to the cascade network. SPPNet imposed different sized convolution kernels at the same layer. It enlarged the receptive fields of the network by concurrent use of the features. In Liu et al. (2020), the atrous spatial pyramid pooling (ASPP) block utilized different rated dilation convolutions in parallel to capture multi-scale features. It achieved a high segmentation performance. In this paper, the skipping connections are introduced into the YOLOv3 to improve the utilization of features among layers. Meanwhile, a SPP is adopted to enhance the adaption of the model to multi-sized features. The remainder of the paper is organized as follows. Section 2 details the method of YOLOv3 with a comparison to YOLOv1. Section 3 describes the realization of the skipping connections and Section 4 the operation of involved SSP block. Then, the improved YOLOv3 model is presented in Section 5. And the validation results are demonstrated in Section 6. Finally, Section 7 gives the concluding remarks and discusses possibilities for future work.

Methodologies
The comparison of YOLOv1 and YOLOv3 detection models is given in Figure 1. YOLOv1 as shown in Figure 1(a), the convolution layer and pooling layer are stringed together to form the backbone network. It takes a head-tail feedforward structure as a whole, i.e. the output of the current extraction layer is exactly the input of the next layer. The input of the current layer does not act as any input component and exerts no direct effect on the next layer. YOLOv1 leads to an insufficient utilization of feature extraction. The unexpected gradient disappearance would happen due to the non-smoothness of the actuation function during the model training. Compared with YOLOv1, YOLOv3 takes Darknet-53 as the backbone network. As shown in Figure 1(b), YOLOv3 improves the network by inserting a ResBlock at the frontend of Darknet-53. It is a parallel structure. It deepens the network structure and avoids the gradient disappearance. Meanwhile, YOLOv3 changes the network to multi-scale detection at the backend. There are 5 down-sampling blocks in the Darknet-53 to achieve the scale diversity. Each down-sampling block is composed of 3 × 3 convolutional layers. Because of the scale diversity, the feature maps contain both general information and semantic information.
However, YOLOv3 implements down sampling through a convolutional layer in series. Without fusion of the current feature map into each subsequent layer, it leads to a deficiency in the extraction of multi-layer semantic information in the process. Moreover, the receptive field become rather limited because of the use of small-sized convolution kernels to pool the current layer.

Skipping connections
Densely connected convolutional networks can realize connecting each layer to every other layer in a feed-forward fashion. More feature maps of inconsistent dimensions can be obtained through skipping connections. The operation of the DenseBlock is shown in Figure 2. The involved feature maps contain both shallow detail features and deep semantic features naturally. The output feature map is described as is the corresponding output of the layer i; ⊕ is the concatenation operator for the eature maps; H l uses the function BN-ReLU-Conv(1 × 1)-BN-ReLU-Conv(3 × 3), which is composed of the batch normalization (BN), rectified linear units (ReLU) and the convolution network(Conv). As shown in Figure 2, each layer takes all preceding feature maps as its input. The feature map x sized of W × H × C initially passes through a convolutional layer with a convolution kernel of 3×3 and stride of 2, but the number of filters is half of the original's. Subsequently, the derived feature map x 0 , sized of W/2 × H/2 × C, inputs H 1 to generate the feature map x 1 . Through the skipping connections, the input of H 2 is a concatenation of preceding feature maps x 0 and x 1 , i.e. x 0 ⊕ x 1 . The latter operation is similar. Finally, the output of the DenseBlock network with l feature layers is The computation consumption of a network is defined as the Billion Floating point operations (BFLOPs), i.e.
where l n and l c are the numbers of convolution kernels and filters within the current layer; l size is the size of convolution kernel; l out_h and l out_w represent the size of the feature map after convolution. The default value of l groups is 1. For original model in Figure 1 while the replacement of fourth down sampling block with DenseBlock, it yields From the above mentioned, the skipping connections between each layer alleviate the insufficient utilization of feature extraction and gradient disappearance caused by cascade structure. Also, the structure has a narrower network width and fewer parameters, so it has faster speed than original model. To improve the YOLOv3 with the parallel structures, the fourth and fifth down samplings are realized by the DenseBlock instead of its original convolution layers with a kernel of 3 × 3 and a stride of 2.

Feature pyramid structure
Spatial pyramid pool (SPP) proposes a structure to merge feature vectors of arbitrary size and fixed size. In each spatial bin, SPP pools the input by using the max pooling. The introduced feature pyramid structure and general SPP block are shown in Figure 3. The input feature is W × H × C. After entering SPPBlock, we get four branches.
Three of them use convolution kernels of different sizes, such as 5 × 5, 9 × 9 and 13 × 13. The fourth branch is connected with other branches without any convolution operation. Finally, the network gives an output with a scale of W × H × 4C.
This method changes the size of the filters in the pooling layer. It partitions the image into divisions from finer to coarser levels, and aggregates local features in them. Convolution kernels of different sizes have different sizes of reception fields and extract multi-scale features. Spatial features are extracted separately to obtain local information. These features are concatenated to get global features.

The improved network architecture
The improved YOLOv3 network replaces the downsampling layer scaled of 26 × 26 with the DenseNet structure. As shown in Figure 4, the DenseNet consists of 4 concatenation operations and 8 convolutions. It is composed of a 1 × 1 convolutional layer, a 3 × 3 convolutional layer and a concatenation layer. Four dense blocks of the same structure make up the DenseNet structure. Before the YOLO detector, a SPP module is added. The module contains three maximum pooling, and the last layer of the module connects the results of the three pooling in series. The SPP module can extract features that reflect various sized objects, so it can increase the detection rate for both small and large objects.
After the addition of the DenseNet structure, the multiple layers of features are combined and more feature information is incorporated. It solves the problem of lack of feature information in the feature extraction process. On the other hand, SPP provides convolution kernels of different sizes to enhance the adaption of the model to multi-sized features. Putting the module in different locations on the network will have different effects. Because the deeper the network layer, the richer the semantic information, the more useful the function, and the greater the meaning of information transmission (Lecun et al., 2015). So the module is placed in the deep layer of the network. The feature map 13 × 13 × 512 is fused with shallow features through skip connection. SPP structure is only putted before this feature map, for outputting the changed features to each scale.

Loss function
The loss function is used to evaluate the modelling errors between predicted and real values. In object detection, the total loss value is composed of position loss (L loc ), classification loss (L cls ) and confidence loss (L conf ), i.e.
where position loss includes coordinate loss and widthheight loss; x i , y i , w i , h i are coordinates of each prediction box;x i ,ŷ i ,ŵ i ,ĥ i are coordinates of each ground-truth box. λ coord is a coordination coefficient which is used to coordinate different size rectangles contributing inconsistently to the loss function. An input picture is divided into a S × S grid. Each grid cell predicts B bounding boxes.
where p i (c) andp i (c) are the predicted and ground-true values of the object category. C i is the probability of an object in the prediction box andĈ i is ground-true value. λ noobj is a weight value, which represents the proportion of the confidence error in the loss function when the prediction box predicts no object, usually set as 0.5.

Experiments and discussion
The detection models were trained and tested on Ubuntu16.04, Intel R Core i7-8700K 3.70GHz, GeForce GTX1080Ti 11GB, CUDA10.1, cuDNN7.5. The network initialization parameters are shown in Table 1. Training steps were 15,000. The batch size was set to be 2. The model was trained under the training parameters. The learning rate decreased 90% after 10,000 steps and to 0.00001 after 13,000 steps.

Evaluation indicators selection
The evaluation indicators of the object detection model include recall, precision, Precision-Recall curve (P-R curve), Average-Precision, mAP and so on (Özdemir et al., 2010).
A certain confidence level was setted as a threshold. When it is greater than this threshold, a P-R curve can be drawn for a certain class. The area enclosed by this curve and the coordinate axis is the average accuracy of the class. When there are N classes of objects, mAP is the average of the sum of the average precision of all classes.

Image data preparation
The dataset contained 665 pictures of people wearing helmet. Since a small amount of training data caused overfitting in the deep network, the existing dataset was expanded by increasing the number of images in the dataset. In order to enhance the generalization ability and model robustness, the collected images were preprocessed by data enhancement methods including rotation, translation, scaling, random occlusion, horizontal flip, and noise addition. The main methods used here are jitter, hue, saturation, exposure, and flip. In order to compare the performance of different algorithms, images in the training set were marked to PASCAL VOC format. After using the script to number the image, we used the label tool labelImg to tag the image to get the XML file. If positive samples were too small or blurred, we did not label them to prevent over-fitting in the neural network.

Improved YOLOv3 detection results
We used the improved YOLOv3 to detect candidate helmet regions for the input images. The loss curve and P-R curve obtained after using the improved network training is shown in the Figure 5. It shows that the loss value can quickly converge in a small number of iterations, which depends on the weight sharing of transfer learning in the shallow convolutional layer. In addition, if there is a larger accelerated graphics card, the model convergence and detection speed will be further improved. Figure 6 shows the comparison results between the improved YOLOv3 network and the original YOLOv3 network. The pictures at the left column are the detection results of the original YOLOv3 while the three pictures at the right column are the improved YOLOv3 model results. Where Figure 6(a) missed an object compared to Figure 6 In order to verify the high performance of this model, the dataset was trained and tested on the four different algorithms, i.e. the YOLOv3 original network, SSD, Faster R-CNN and improved YOLOv3. The mAP between these algorithms is shown in Table 2. Experiments showed that the improved YOLOv3 had the highest detection accuracy for helmets.

Conclusions
This paper proposed an improved YOLOv3 algorithm for object detection. Firstly, a dataset was created by a web crawler and annotated it, and then the dataset was clustered to optimize the anchor parameters. Due to DenseNet structure and SPPNet structure were introduced, deep features fused with shallow features and network accuracy improved. Finally, the multi-objective loss function combined with mean square error loss and cross entropy loss was used to regress and correct the prediction frame, so the accuracy of network detection was improved. Experimental results show that the improved YOLOv3 algorithm can effectively improve the detection accuracy of the helmet. Because this paper presents only a theoretical model, the next step is to deploy on the embedded system for research.