Small object detection in UAV image based on improved YOLOv5

Less effective information is obtained by the object detection network, due to the small size of the detection object in the entire image, the complex background, and the dense object in unmanned aerial vehicle (UAV) images. In response to the difficulties encountered, a small object detection method in UAV images is proposed as an improved YOLOv5-based algorithm in this paper. First, the space-to-depth(SPD) conv module is introduced into the basic feature extraction network, to improve significant loss of image information during downsampling. Then, various attention mechanisms are added, to intensify the acquisition of regions of interest in UAV images. Finally, the multiscale detection module is improved, to enhance the network's ability to detect small objects in UAV images. By conducting experiments on the VisDrone-DET2019 dataset, the test results of the established model show. The improved algorithm achieved a Mean Average Precision (mAP) of 41.8%, which is 7.8% better than the baseline network. In addition, the detection performance is better than most current mainstream target detection algorithms and is of some practical value.


Introduction
As the cost of manufacturing UAVs continues to fall, it is making them more and more popular.At the same time, deep learning-based object detection algorithms have made great performance in recent years.Thus, combining the two can play an important role in many areas.
The object detection problem is defined by determining the position of an object in a given image (object localization) and the class to which each object belongs (object classification) (Xu et al., 2022), It is widely used in the fields of autonomous driving, image classification, face detection, visual search, object tracking and detection, and medical diagnosis (Duan et al., 2020).Deep learning-based object detection algorithms are divided into two main categories.A class is two-stage object detection algorithms, typified by Fast R-CNN (Girshick, 2015), Faster R-CNN (Ren et al., 2015), Mask R-CNN (He et al., 2017), and others.Another class is single-stage object detection algorithms, typified by algorithms such as SSD (Liu et al., 2016), YOLOv3 (Redmon & Farhadi, 2018), and YOLOv4 (Bochkovskiy et al., 2020).The singlestage object detection algorithm has end-to-end performance advantages, but it is low in terms of detection accuracy compared to the two-stage.The two-stage object detection algorithm uses a localization followed by a recognition approach, which is better than single-stage CONTACT Ming Jiang kjjm@ahpu.edu.cn in terms of detection accuracy, but it has poorer real-time performance.
Aerial detection of UAVs is an important task of object detection, through the camera on board, UAVs can easily capture images in various environments.Therefore, UAVs are increasingly used in civilian and military management.However, UAV object detection faces the following challenges.Figure 1 shows the above challenges more intuitively.
(1) Most of the images captured by UAVs are mainly small objects, and small object detection is a major pain point in the field of object detection.The main reason is that the detailed information on small objects is low, and the network will continue to lose detailed information in the downsampling process, which makes the network extract even less feature information (Lin et al., 2017).
(2) The number of objects in UAV images is large and highly susceptible to environmental interference, so a single pixel shift in the model's prediction can have a huge impact on the detection results.
(3) In a real environment where the object in the UAV images are too dense, the network may remove the correct edges due to non-maximum suppression (NMS) when predicting, resulting in missed detections, or the detection frames may be too close to each other, making the model eventually difficult to converge (Shi et al., 2022).
In response to the above-mentioned problems.A new model structure is proposed (Liu et al., 2020), the multibranch parallel feature pyramid network (MPFPN), for extracting richer information about small object features.Also, a supervised spatial attention module (SSAM) is added to reduce the effect of background noise using an attention mechanism.However, the network suffers from the problem of unbalanced training samples, which can lead to problems with false detection.The Trident-FPN backbone structure is proposed and also a new attention mechanism is designed (Liu et al., 2022), to improve the problem of multi-scale object prediction in UAV images.However, the detection speed of the model is slow and the real-time performance does not meet the detection demand well.A multiscale feature fusion method is proposed (Xun & Li, 2021), to improve the problem of imprecise object position localization and thus enhance the detection accuracy.However, the improved method performs poorly for dense object detection.The TPH-YOLOv5 model is proposed by adding a prediction head to YOLOv5 and replacing the Transformer prediction head with the original prediction head (Zhu et al., 2021), and introducing the attention model (CBAM).Better results are achieved in the detection of small objects in UAV images.However, this makes the network deeper and more computationally intensive, which is not conducive to UAV deployment and use.A feature fusion and scaling-based single shot detector (FS-SSD) is proposed (Liang et al., 2019).It makes the improved algorithm more competent for the task of small object detection by adding a scaling branch of the deconvolution module to the SSD network, and then the feature fusion branch of the original network is adjusted.However, the network does not handle the problem of missed detection well, when there is an occlusion of the object in a complex background.An efficient end-to-end detector for SPB-YOLO is proposed to improve the performance in UAV image object detection tasks (Wang et al., 2021), which designs a new band bottleneck (SPB) module and employs a path aggregation network (PANet) based upsampling strategy.However, this network can suffer from missed and false detections at complex backgrounds during detection.
As a single-stage object detection algorithm, the YOLO series always has a good trade-off between detection speed and accuracy and is therefore widely used in UAV object detection.The YOLOv5 model has high flexibility and detection speed.The flexible parameter configuration allows it to be applied to different scenarios, and the false and missed detection rates are significantly reduced for the detection of small objects.YOLOv5 has five versions, YOLOv5s, YOLOv5 m, YOLOv5n, YOLOv5 l, and YOLOv5x, which differ in the depth and width of the network.This paper chooses the YOLOv5s detection model with lower complexity, which can ensure higher detection accuracy while maintaining smaller memory occupation and faster recognition speed to meet the needs of UAV object detection.
In brief, the main contributions of the research in this paper are summarized as follows.
(1) In this paper, the SPD-conv module (Sunkara & Luo, 2022) is introduced in the backbone feature extraction network, to improve the problem of loss of detail information, due to the use of strided convolution and/or pooling layers when extracting feature information, so that the network can obtain more detail information.
(2) In this paper, the CA module (Hou et al., 2021) is added to the network, which enhances the saliency of the object to be detected by using the CA module to obtain location and channel information.The Transformer module (Vaswani et al., 2017) is also introduced to strengthen the feature extraction, and global information acquisition capability of the network through the Self-Attention mechanism.
(3) This paper improves the detection head module by adding a small object detection head, removing the large object detection head, and introducing the Transformer module in the medium object detection head to improve the detection performance.(4) Experiments in the VisDrone-DET2019 dataset (Cao et al., 2019).The algorithm proposed in this paper improves the mAP by 7.8% compared to yolov5.

YOLOv5 algorithm
The structure diagram of YOLOv5 is shown in Figure 2. Firstly, feature information is extracted from the input image using a convolutional neural network, secondly, the obtained feature information is fused, and finally, the network outputs three detection heads for predicting the object.YOLOv5 network structure includes four parts: input, backbone, neck, and head.Data enhancement of the input image using the Mosaic method in the input section.The backbone layer consists of five downsampled convolutional layers, four C3 modules, and one fast spatial pyramid pooling layer (SPPF) to extract feature information from the image.The scheme used for the neck layer is the FPN + PAN structure to achieve a multi-scale fusion of the extracted features.The head layer will extract three scales of 80 × 80, 40 × 40, and 20 × 20 from the neck layer for prediction, and finally, output the detection results.

Improved YOLOv5 algorithm
To realize the small object detection of UAVs, the network structure of YOOv5 needed to be improved accordingly.The structure of the improved YOLOv5 algorithm proposed in this paper is shown in Figure 3.
The idea of improving this paper is as follows: (  using the self-attention mechanism, and the multihead attention has the property of mapping it to multiple spaces, which can improve the problem of the low resolution of images at the end of the network.

Invoking coordinate attention to improve feature information acquisition
SE (Hu et al., 2018) and CBAM (Woo et al., 2018) are more widely used, where SE only considers internal channel information and ignores the importance of location information; CBAM globally pooled by taking the maximum and average values for multiple channels, and this weighting takes into account only the local range of information.The latest CA module is by embedding location information into channel attention, a simple and efficient approach that enables plugand-play.It enables the network to acquire information over a wider area without causing additional computational load, outperforming attention modules such as SE and CBAM.
For the network model in the process of acquiring lowresolution image feature information, it causes information loss as the network deepens, thus ignoring the information of small and dense objects, so this paper reduces the loss of image information by introducing the CA module, and because of the simultaneous action of location and channel information, it can achieve the acquisition of significant information and help the network to better identify objects.
The CA mechanism is divided into 2 main steps: coordinate information embedding and generation.The information embedding related equation is as follows.
where Zh c(h) is the output of the c-th channel with height h, and H, W is the height and width of the input feature map corresponding to the current attention module, respectively.The above 2 transformations aggregate features along two spatial directions respectively, so that a pair of direction-aware feature maps can be obtained.Such an information embedding formulation enables the attention module, to capture long-term dependencies along one spatial direction, and preserve precise location information along another spatial direction, helping the network to locate targets of interest more accurately.The relevant equations in the generation are as follows.
The F 1 is the convolutional transform function, the [Z h ,Z w ] is the stacking operation of the feature layers along the spatial dimension; the δ is the nonlinear activation sigmoid function; and the f is the intermediate feature map that encodes the spatial information in horizontal or vertical directions.Using two other 1 × 1 convolutional transform functions to transform f h and f w into a tensor with the same number of channels to the input x c , respectively, we obtain.
The g is the attention output result of splitting feature maps into different spatial dimensions; the F is the convolution calculation function.The final output of the CA module is shown in equation 6, and the structure of the CA module is shown in Figure 4.
As described in the above equation, the input image acts simultaneously on the attention along the horizontal and vertical directions, resulting in two directions of attention.Each element of these two attentions in the graph reflects whether the object of interest is present in the corresponding row and column.This operation allows the CA module to take full advantage of the captured position information so that the region of interest is accurately captured, and also effectively captures the relationship between channels, thus helping the model to better obtain information about the image object.

Enhance global information acquisition with Transformer
Recently, the Transformer module has made great achievements in vision, and the Vision Transformer (ViT) is proposed (Dosovitskiy et al., 2020), which is the first Transformer to be applied to a visual recognition task and achieved good detection results.In this paper, the last C3 module in the backbone extraction network is replaced with the Transformer combined with C3.The improved module improves the global information acquisition capability and enriches the contextual feature information compared with the C3 module of the original network.This makes the improved module better for the task of high-density object detection for UAV aerial images.The Transformer structure is shown in Figure 5, which contains two sub-layers.The first sublayer is a multiheaded attention layer and the second sublayer is a fully connected layer.Each sublayer is connected to the other using residuals.
The expression for the Transformer network calculation is given in equation ( 7) where Q, K, and V are the query vector, key vector, and value vector, respectively, and dk is the input feature dimension.Performing a matrix product operation with Q after transposing K, taking a scaling factor of 1/ √ dk for scaling; subsequent activation calculations using the softmax function to obtain the correlation matrix between each pixel and the other pixels; The correlation matrix is then applied to V such that the eigenvector values of each pixel in the matrix are fused with the features of other pixels compared to the input matrix.Finally, the output of the multi-headed attention mechanism is obtained by stitching the results from different subspaces, as in equation ( 8): R d mod el ×hd v is the parameter matrix and dmodel is the vector dimension.
The output of the feedforward layer of the Transformer module is represented in equation ( 9): Where z denotes the output after layer normalization of the input sequence x.The LN(MultiHead(Q, K, V) + z), as the input of the feedforward layer, is also the output of the multi-headed attention mechanism summed with the short-circuit-connected features after layer normalization, representing a multilayer perceptron containing two fully connected layers.The final output of the overall Transformer module is represented in equation ( 10): The self-attention mechanism, as the core module in Transformer, makes each pixel of the image pay more attention to other pixel features, and by generating correlation weights with each pixel, it eventually makes the model establish some kind of connection existing between different objects according to the weight size, thus extracting more effective features from the global and obtaining richer contextual information.The multihead mechanism in the Transformer module enriches the diversity of feature subspaces, which improves the ability of the algorithm to capture information from different locations without adding additional computations.Therefore, the introduction of the Transformer module in the backbone extraction network can enhance the feature extraction capability of the backbone network for objects in complex backgrounds.

Adding SPD-conv to improve feature extraction network
When convolutional networks are used for common scene object detection, because regular images have good resolution and medium-sized objects (Tan et al., 2019a), strided convolution and/or pooling can skip this redundant information, allowing the model to still extract feature information well.However, convolutional networks become relatively less redundant in the task of extracting information from UAV images, and the use of strided convolution and/or pooling layers leads to information loss of small objects as well as less efficient learning of feature representations.For this reason, the SPDconv module is introduced in this paper to replace the stride convolution.For any given feature map X, each submap can downsample X by a scale factor.Figure 6 gives an example plot for scale = 2.The SPD-conv module then transforms the feature map into an intermediate feature map X ( S scale , S scale , scale 2 C 1 ).After the SPD-conv feature transformation layer, a non-stride (i.e.stride = 1) convolution layer is added, where the filter of the , by further transformed in this way.All feature information of the image is preserved as much as possible (Tan et al., 2020).

Improved multi-scale detection
As shown in Figure 2, the YOLOv5 algorithm has 3 detection layers at different scales.When the input image size is 640 × 640, after the backbone layer, the neck layer network performs 2 upsampling, and then the results are fused with layers 4 and 6 in the backbone network, respectively, followed by downsampling.The head layer will extract the neck layer 80 × 80, 40 × 40, and 20 × 20 size feature maps for output, which are used to detect small, medium, and large objects in the image.However, there are mostly small objects in UAV images, which leads to the poor detection capability of the original YOLOv5 model for UAV image objects (Yan et al., 2022).In this paper, by improving the head structure, adding a detection head, and outputting a detection feature map with a size of 160 × 160, the sensitivity of the model to small objects is improved, then the algorithm's ability to detect small objects in UAV images is enhanced.Considering that this paper mainly focuses on the detection of small objects in UAV images, there are few detections for large objects, and increasing the detection heads will increase the computational cost, so this paper removes the detection heads for large objects and retains the detection heads for medium, tiny detections based on the output four detection heads.This can reduce unnecessary computation and increase the speed of model detection (Tan et al., 2019b).Due to the low resolution of image extraction to the end of the network, thus this paper only references the Transformer module on the low-resolution feature maps as a way to reduce the expensive computation and memory costs.

Experimental environment, data and evaluation index
The experiments in this paper were conducted under Windows 10 and used Pytorch 1.10.2deep learning framework with CUDA version 11.3.The hardware configuration is NVIDIA GeForce RTX 3060 with 12GB video memory, Intel i7-10700, and 16GB RAM.The parameters during training are set as follows: input size is 640 × 640; SGD optimizer is used, the initial learning rate is 0.01, weight decay coefficient is 0.005, and the learning rate is decayed by cosine annealing.
The experiments in this paper are trained and tested on the VisDrone-DET2019 dataset, which was collected by the AISKYEYE team at the Machine Learning and Data Mining Laboratory of Tianjin University, with a large-scale dataset of 10 categories, and its data are divided into three parts: training set, validation set, and test set, containing 10209 static images (6471 for training, 548 for validation and 3190 for testing).
To evaluate the effectiveness of the UAV small target detection method proposed in this paper, P (Precision), R (Recall), and mAP (Mean Average Precision) are used as the measurement indexes of the model in this paper, and the specific calculation formula is as follows.

Experimental results and analysis
To verify the effects of the improvements in this paper on the model detection performance and the effectiveness of the improved modules, ablation experiments are conducted with the original YOLOv5 as the benchmark, and the experimental results are shown in Table 1.
Improving the effectiveness of the backbone feature extraction module.To demonstrate the effectiveness of the improved backbone feature extraction module, this improved module is added to the original version for experiments.As shown in Table 1, mAP is improved by 2.5 percentage points, and FPS is reduced by 15 f•s −1 on the VisDrone2019 dataset compared with the original version.this shows that the improved backbone feature extraction module can effectively improve the information extraction of small objects capability, proving the effectiveness of the module.
Improving the effectiveness of the detection head module.To demonstrate the effectiveness of the improved detection head module, this improved module is added based on the original version and conducts experiments.As shown in Table 1, compared with the original model and the improved backbone feature extraction module, the mAP is improved by 6.7 and 4.2 percentage points, and the FPS is reduced by 17 and 2 f•s −1 , respectively, on the VisDrone2019 dataset.this demonstrates that the improved detection head module, by using the small object detection head, can effectively improve the detection performance for small objects in UAV images, and the addition of the Transformer module improves the detection performance for mesoscale objects to a certain extent, which in turn improves the overall detection performance.The indispensability of the improved detection head module is demonstrated.Effectiveness of the overall structure.To demonstrate the effectiveness of the overall structure, the overall scheme is put into the original version for experiments in this paper.As shown in Table 1, on the Vis-Drone2019 dataset compared with the original version, the mAP improved by 7.8 percentage points and the FPS decreased by 28 f•s −1 .The mAP improved by 5.3 and 1.1 percentage points over the improved two modules, respectively.This shows that the algorithm proposed in this paper achieves better results in suppressing the influence of background information, acquiring small object feature information, and detecting small objects, thus improving the object detection performance of UAV images.The effectiveness of the algorithm proposed in this paper is demonstrated.It also illustrates the complementarity and inseparability of the two models.
It can also be seen from Table 1 that the FPS decreases after the network joins each module compared to the original version, and the algorithm proposed in this paper is 54 f•s −1 , but it can meet the UAV object detection requirements.
This paper also details the detection accuracy of each improved model on the VisDrone2019 dataset for 10 target classes.The results are shown in Table 2. From Table 2, it can be seen that all the improved modules proposed in this paper have increased in all categories of mAP compared to the original YOLOv5.For the original YOLOv5 network, the mAP for the two categories of bicycle and awning could only reach 13.5% and 10.8%, due to the presence of small target images in the VisDrone2019 dataset that is near the size limit and the background, field of view and UAV flight altitude of the images is uncertain and variable.And the improved network in this paper improves the mAP of two categories, bicycle, and awning by 3.7% and 4.9%, while other categories such as small size also improve significantly.This shows that  the network model proposed in this paper has strong multi-scale detection and generalization capabilities, and is suitable for applications in the detection of UAV image objects.
From the above ablation experiments, it can be illustrated that the algorithm in this paper can effectively improve the mAP of the network, and it is found that the image of missed detection and false detection is reduced in the experiments.As shown in Figure 7, the detection results of the two algorithms on two identical images, where the left side of Figure 7 shows the detection results of the YOLOv5 algorithm and the right side shows the detection results of the algorithm proposed in this paper.Observing the two Figs. a and b of Figure 7, it can be found that the original yolov5 has a false detection situation, which is not present in the algorithm of this paper.Figs.c and d of Figure 7 show the complex detection environment.In the area of the box, it can be seen that the algorithm in this paper not only detects the missed trucks of the original version but also that is no false detection of the car.The experiments prove that the algorithm proposed in this paper can well improve the situation of missed and false detection of yolov5, and the improved algorithm has better robustness in small object scenarios and better detection accuracy in complex backgrounds.
To verify the superiority of the algorithm in this paper, representative deep-learning target detection models are selected for comparison experiments.SSD512, Faster-RCNN, RetinaNet, YOLOv4 network, and the algorithm in this paper are selected and tested respectively, and the experimental results are shown in Table 3.
From Table 3, it can be seen that the SSD512 network and RetinaNet network are less effective in detecting UAV object datasets; Faster-RCNN will be slightly better, but the frame rate is lower due to the complex structure of the network model.The detection effect of YOLOv4 is improved relative to the above network and is higher than YOLOv5s, which is selected in this paper to take into account the speed, so the detection effect is reduced.However, the detection effect of YOLOv4 is still slightly worse than the model proposed in this paper, and the frame rate is not much different.This shows the superiority of the proposed method in UAV object detection compared with the above comparison methods.
To further verify the performance of the algorithms in this paper for UAV object detection, deep learning models for UAV object detection and the latest YOLO algorithm are selected for comparison experiments, and YOLOv3_ReSAM (Liu et al., 2022), mSODANet (Chalavadi et al., 2022), TPH-YOLOv5, UCGNet (Liao et al., 2021) and YOLOv8 are tested, and the experimental results are shown in Table 4.
From the results in Table 4, it can be concluded that the improved algorithm proposed in this paper has improved mAP over the four network models YOLOv3_ReSAM, mSODANet, TPH-YOLOv5 and UCGNet.The YOLOv3_ReSAM model uses the SAM attention mechanism of residual structure, to enhance the backbone information acquisition and help feature information to be better fused, and the accuracy of object detection in multiple categories is better than the algorithm in this paper.However, it is less effective in tiny object detection, specifically for both pedestrian and human objects.It is mainly because this paper outputs prediction heads for tiny objects, which makes the network more favourable for detecting tiny objects.The mSO-DANet model uses a Hierarchical dilated network to fully learn contextual information, and to acquire features of more difficult-to-detect targets in the detection process, achieving 24.67% mAP and improving 8.97% mAP in the awning with the lowest detection effect of the algorithm in this paper.However, this network only outputs two prediction heads, which cannot achieve accurate detection of tiny objects compared with the improved detection end of this paper.The TPH-YOLOv5 model also outputs tiny object prediction heads at the detection side, but the detection accuracy for tiny objects is still slightly inferior to this paper, which is due to the improvement of the backbone extraction network, in this paper, so the network can extract more object feature information, and then use more feature information for prediction.UCGNet proposes an LLM to predict the distribution map of objects and an unsupervised clustering module to generate dense sub-regions, but the backbone network consists of three BottleNeckCSP and an SPP module, which does not extract enough detailed information for small objects, thus leading to lower detection than the algorithm proposed in this paper.Finally, YOLO's latest YOLOv8 is selected for comparison in this paper, and it can be seen that the detection accuracy of YOLOv8 is still slightly lower than that of the algorithm proposed in this paper, mainly the algorithm in this paper sets the detection head which is more favourable to detect small objects.
Finally, to verify the universality of the improvement proposed in this paper, the improvement method is added to YOLOv3 in this paper, and the improvement is not very comprehensive because the two networks are different.The results are shown in Table 5.
From Table 5, it can be seen that after adding the improved method proposed in this paper to YOLOv3, YOLOv3 has a good improvement of mAP for the detection of the VisDrone-2019 dataset, and it can also be seen that the mAP of each class in the dataset has been improved, which can prove the universality of the method proposed in this paper, and it can help the network model to better extract the detail information of small objects, and then improve the detection accuracy of small objects.
Through the above comparison, the advantages of the algorithm proposed in this paper are confirmed.The improved algorithm in this paper effectively improves the feature extraction capability of small objects, to ensure certain real-time performance, which makes the model more advantageous in the face of UAV object detection tasks.

Conclusion
For the problems of small objects in UAV images, this paper proposes an improved algorithm for object detection based on YOLOv5.Using the introduction of the CA module, Transformer module, and SPD-conv module in the backbone feature extraction network, to enhance the capability of UAV image feature information extraction.The detection end of the network is improved, letting the network output new three detection heads for object detection, and the Transformer module is added for the middle object detection head, to overcome the problem of the low resolution of the image at the end of the network.To verify the performance of the models proposed in this paper, ablation experiments and comparison experiments are conducted, and it is concluded through the experiments that the models proposed in this paper all achieved better performance.Finally, it should be noted that only small object detection has been studied in this paper.In practice, objects in UAV images are placed in any direction, and too much background information will appear in the horizontal object frame, which will also lead to object miss detection.In the future, this paper will further investigate this problem by using a rotating frame detection method to detect objects placed in arbitrary directions.
1) From the perspective of image information acquisition.Multiple attention models are used, and the Transformer module and CA module are introduced, which use attention mechanisms to learn regions of interest in images, and the network can better acquire the feature information with limited computing power.It can also suppress the interference of complex background information on feature acquisition.(2) From the perspective of feature information extraction.The SPD-conv module is introduced in the backbone feature network, to overcome the problem of using strided convolution and/or pooling layers in convolutional networks, which leads to loss of image detail information and less efficient representation of feature learning, and to improve the extraction of detail information of small objects in UAV images.(3) From the perspective of multi-scale detection.A small object detection head is added, and removing the large object detection head at the head end of the network.It can improve the detection accuracy of the model for small objects in UAV images and also speeds up the detection speed.Meanwhile, the Transformer module is introduced in the detection head to obtain global information effectively by

Figure 7 .
Figure 7. Effect contrast of ours and baseline on the test dataset.

Table 1 .
Comparison of the results of ablation experiments.

Table 2 .
Performance of each category of visdrone2019 dataset with mAP@.5.

Table 3 .
Comparison of detection performance of different algorithms.

Table 4 .
UAV detection algorithm performance comparison.

Table 5 .
Comparison chart of improved yolov3 algorithm.