TO–YOLOX: a pure CNN tiny object detection model for remote sensing images

ABSTRACT Remote sensing and deep learning are being widely combined in tasks such as urban planning and disaster prevention. However, due to interference occasioned by density, overlap, and coverage, the tiny object detection in remote sensing images has always been a difficult problem. Therefore, we propose a novel TO–YOLOX(Tiny Object–You Only Look Once) model. TO–YOLOX possesses a MiSo(Multiple-in-Single-out) feature fusion structure, which exhibits a spatial-shift structure, and the model balances positive and negative samples and enhances the information interaction pertaining to the local patch of remote sensing images. TO–YOLOX utilizes an adaptive IOU-T (Intersection Over Uni-Tiny) loss to enhance the localization accuracy of tiny objects, and it applies attention mechanism Group-CBAM (group-convolutional block attention module) to enhance the perception of tiny objects in remote sensing images. To verify the effectiveness and efficiency of TO–YOLOX, we utilized three aerial-photography tiny object detection datasets, namely VisDrone2021, Tiny Person, and DOTA–HBB, and the following mean average precision (mAP) values were recorded, respectively: 45.31% (+10.03%), 28.9% (+9.36%), and 63.02% (+9.62%). With respect to recognizing tiny objects, TO–YOLOX exhibits a stronger ability compared with Faster R-CNN, RetinaNet, YOLOv5, YOLOv6, YOLOv7, and YOLOX, and the proposed model exhibits fast computation.


Introduction
Recently, satellite remote sensing technology has advanced considerably (Guo et al. 2021;Guo, Hackmann, and Gong 2021a), and the spatial resolution of remote sensing images has been immensely enhanced compared with that of previous technologies; in regard to remote sensing image target detection, an increasing number of studies have been conducted (Ge et al. 2021;Y. Li et al. 2021;Vasu et al. 2020).Using remote sensing image target detection, the targets in remote sensing images can be accurately and efficiently identified, and the class and location of the targets can be determined (Gao et al. 2021;Guo, Hackmann, and Gong 2021b).Thus, the technology is crucial for fields such as disaster monitoring (Z.Chen et al. 2022;Mohan et al. 2021;Z. Yu, Chang, and Chen 2022; Z. Yu et al. 2022), urban planning (Chang et al. 2023;Zhang et al. 2022), traffic management (Tao et al. 2023), and resource exploration(Z.Chen et al. 2022).Furthermore, target detection algorithms have evolved from the traditional machine learning era, which was characterized by numerous practical applications, to the current stage, which is characterized by the widespread application of deep learning (Z.Chen et al. 2023;Liang et al. 2022).First, R-CNN (region convolutional neural networks) pioneered the utilization of deep learning for target detection (Cai and Vasconcelos 2018;Girshick et al. 2014); second, the traditional two-stage detection model, namely Faster R-CNN (Ren et al. 2017), emerged; third, the YOLO and SSD (Single Shot Detector) model series that focus on efficiency and computational speed (W.Liu et al. 2016) became prevalent; and, finally, transformer-based detection models (Devlin et al. 2019;Dosovitskiy et al. 2021) are gradually pervading the research domain.However, with respect to small-scale target detection tasks, there are still numerous problems that should be solved; for instance, many of the current advanced target detection models usually exhibit difficulty in obtaining excellent detection results for small-scale targets (Chang et al. 2022;Youliang Chen et al. 2023).In the MS COCO (Microsoft Common Objects in Context) dataset, tiny objects are defined as targets with a < 32 pixel size or a < 1% image area (Krishna and Jawahar 2018;Lin et al. 2014).Because small-scale targets exhibit lower resolution compared to normal-scale targets, less pixel information can be utilized, and the feature maps corresponding to small targets shrink as the convolutional neural network downsamples.Even when the downsampling magnitude is larger than the size of the small target, a large amount of location information may be lost.In addition, the small targets in complex scenarios are usually accompanied by occlusion and dense distribution, and owing to numerous reasons, the detection of small targets becomes more difficult in practical applications.We can generally attribute the difficulty associated with small target detection to these categories: (1) small targets exhibit fewer features available in the image; (2) small targets require high localization accuracy; (3) the presence of large-scale targets in the image weakens the model's focus on small targets; (4) small targets are usually accompanied by aggregation problems; (5) the structure of the numerous object detection models based on CNNs or Transformers does not offer the optimal solution for detecting small targets; and (6) the existing small target dataset is still not quite perfect, and most application scenarios contain a few small targets.
Small-scale targets in images are difficult to extract.To solve the aforementioned problem, Kong proposed Hypernet (Kong et al. 2016), which enhanced the fusion of shallow location and deep semantic information as well as the recall rate of tiny object detection.Lim et al. (2021) proposed a multi-scale approach based on contextual connectivity; with regard to contexts, the researcher utilized features of different depths in the network, and to focus on the tiny objects that constitute images, the features were combined with an attention mechanism.Similarly, due to the FPN proposal (Lin et al. 2017), the detection model is more friendly for tiny object detection.Gong et al. (2021) propose a novel concept (i.e. the fusion factor) for controlling the information that deep layers supply to shallow layers; thus, they adapt FPN to the detection of tiny objects.Kisantal et al. (2019) proposed a copy-enhancement method to increase the number of training samples for tiny objects by copying and pasting them multiple times in an image, which enhances the detection performance.Chen et al. (2020) scaled and spliced the images during training to shrink the objects in the dataset; thus, they enhanced the model's ability to recognize tiny objects from the data perspective.Similarly, some studies have applied GAN (Generative Adversarial Networks) (Creswell et al. 2018) to tiny object detection tasks; Bai et al. (2018) addressed the problem pertaining to the difficult detection of small targets by proposing super-resolution processing for small-scale candidate features generated by the two-stage detection algorithm, and they utilized GAN to enhance the useful information of tiny objects.Generally, anchor-free models are considered more suitable for tiny object detection tasks; the models can directly predict the position of the objects (Tong, Wu, and Zhou 2020).The aforementioned models exhibit fewer parameters in the detection head, are faster to compute, and are more adaptable to the targets' scale variations than anchor-based models (Duan et al. 2019).Generally, for small object detection, the research direction is mainly divided as follows: (1) context learning, which entails incorporating more detailed features for enriching the model's ability to detect small objects; (2) multi-scale learning, which entails utilizing the ideology of divide-and-conquer to enhance the detection effectiveness of small objects; (3) related data enhancement methods, which utilize the data perspective to enhance the model's number of small object learning times; (4) the utilization of generative adversarial networks to enhance the small object's resolution; and (5) the utilization of the Anchor Free object detection model.
The current study proposes a TO-YOLOX model, and it utilizes VisDrone2021 (X.Yu et al. 2020), Tiny Person (P.Zhu et al. 2022), and DOTA-HBB (Dataset for Object Detection in Aerial Images-Horizontal Bounding Boxes) (Xia et al. 2018) datasets to verify the model's effect.The main contributions are as follows: 1) It is demonstrated that the Multiple-in-Multiple-out (MiMo) (Q.Chen et al. 2021) structure may introduce more negative samples when the model is performing classification calculations, and that the Multiple-in-Single-out (MiSo) structure with spatial-shift can alleviate the imbalance between positive and negative samples when extracting high-level information from remote sensing images.2) An adaptive loss function, namely IOU-T, is proposed; for the tiny objects in remote sensing images, IOU-T can enhance the localization performance of the model, and it can be extended on other loss functions, such as GIOU, in a simplified manner.
3) The less spatial pixel information is considered; if the compression of the convolutional neural network model leads to the loss of position information, a novel feature fusion method that can enhance the positioning accuracy of the detection model is proposed.4) To enhance the tiny object perception ability of the detection model in remote sensing images, the study proposes a group-convolutional block attention module (Group-CBAM) that is derived from CBAM attention (Woo et al. 2018); the proposed model exhibits stronger robustness, fewer parameters, and faster calculation speed.5) The novel approach proposed herein can be easily and quickly implemented in other models, and it is equally adaptable to other tiny object detection tasks.
In the subsequent sections, we gradually address the five aspects of the study in depth, and we observe the following order: (1) the selected dataset for this study is introduced; (2) the proposed innovations for rectifying the difficult detection of tiny objects are illustrated in detail; (3) the experimental results of this study are presented; (4) the related issues are highlighted and discussed; and (5), finally, the conclusions are offered.

Material
The current study mainly aims to enhance the tiny object detection performance in remote sensing images (Luo et al. 2011;H. Li et al. 2021).To ensure the robustness and stability of the experimental results, we choose three remote sensing image datasets for tiny object detection, namely Vis-Drone2021, Tiny Person, and DOTA-HBB; moreover, the remote sensing image sources include aerial remote sensing and UAV remote sensing.To maintain the small-scale information of the target and to reduce the computational complexity of the model, the images utilized for model training, validation, and testing are both maintained at a 640 × 640 pixel size.

Visdrone2021
The VisDrone2021 dataset (Cao et al. 2021) includes 263 video clips, a total of 179,264 frames, and 10,209 static images; 6,471 remote sensing images are selected as the training set, 548 images as the validation set, and 3,190 images as the test set.Image categories include pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor.Moreover, in the VisDrone2021 dataset, many remote sensing images contain a large number of dense, small targets; thus, the detection result of the deep learning-based model is affected.

Tiny person
For the Tiny Person dataset, long-distance images are taken by drones, and most objects which contain a variety of poses and angles possess a pixel range of less than 20 × 20 pixels.In the labelled dataset, the Tiny Person dataset contains 1,610 images and 72,651 annotations, including 794 remote sensing images in the training set and 816 images in the validation set (Yu et al. 2020); moreover, it comprises two categories, namely sea person and earth person.

DOTA-HBB
DOTA-HBB, which is applied in remote sensing images, is mainly utilized for object detection, including rotation detection and horizontal detection.A total of 403,318 objects are included, and the pixel range of each image are from 800×800 to 4,000 × 4,000 (Xia et al. 2018).If the DOTA images are directly compressed to 640 × 640 for training, the objects information is lost.Therefore, the images that exhibit a > 1,920 pixel length are cut using the 320 overlap size.With regard to the training set, validation set, and test set, the new dataset contains 6,285, 700, and 776 remote sensing images, respectively.

Methods
Herein, the performance and detection speed of the model was comprehensively considered, and the YOLOX model was chosen as the benchmark model (Huang et al. 2023;Yuan et al. 2022).All improvement experiments are completed on the YOLOX-s model, hereinafter referred to as YOLOX.The YOLOX is an enhanced, high-speed YOLOv3-based model, which exhibits satisfactory performance.With regard to its main enhancements, YOLOX (i.e. an anchor-free model with Focus as its backbone) exhibits features including decoupled headers and SimOTA.YOLOX utilizes CSPDarknet to enhance learning ability, and it exhibits fewer parameters and faster calculation speed.In regard to feature fusion, YOLOX utilizes a feature pyramid network (FPN) with a bottom-up structure; thus, it can efficiently promote the fusion performance of detailed features and semantic features during different scales.Moreover, YOLOX, which utilizes SimOTA to assign positive and negative samples, can observably enhance the detection capability.Finally, for tiny objects in remote sensing images, the current study proposes a novel YOLOX-based object detection algorithm (TO-YOLOX), which utilizes fewer parameters than models such as YOLOX, YOLOv5, and YOLOv6; thus, a more suitable algorithm is proposed (Figure 1).

Multiple-in-Single-out (MiSo)
In regard to YOLOX, the output of the feature fusion module sizes are 20 × 20, 40 × 40, and 80 × 80 (Figure 2), which correspond to large, medium, and tiny object detection, respectively.YOLOX obtains the coordinate relationship between the GT (Ground Truth) centre point and the grids' centre point, and it calculates the cost matrix; thus, it determines the positive and negative samples.Consequently, when the tiny object is trained, the detection heads with both the 20 × 20 and 40 × 40 size contain more negative samples than the detection heads with the 80 × 80 size, and the 20 × 20 and 40 × 40 detection heads exhibit a higher cost in the cost matrix.Moreover, the same scenario will occur in SOTA muti-heads detection models such as YOLOv6 and YOLOv7.
Almost all contemporary SOTA object detection algorithms utilize the MiMo structure; to verify the guesses we have made, the study randomly chooses 7 batches comprising 56 images with annotations, and the ratios of positive samples matched in the three-scale detection heads are calculated respectively (Figure 3).It can be observed that the 20 × 20 and 40 × 40 detection heads match only a few positive samples.Even in batch A, the 20 × 20 detection head does not match any positive sample.In the positive and negative sample matching process of the YOLO series model, if a tiny object is matched by the small-size detection head (i.e. 20 × 20, 40 × 40), the object will usually be matched by the large-size detection head (i.e.80 × 80).
Therefore, in the experimental stage, we first train and test the three scenarios, which are depicted in Figure 2(a-c).Finally, through the experimental results, the study proves that the MiSo structure model is the most optimal model for tiny object detection in remote sensing images, and it reduces a lot of computational cost compared to the MiMo structure.Therefore, because the model that possesses the MiSo structure exhibits greater adaptation to tiny object detection tasks and a lower computational cost, the later experiments of the current study are based on the aforementioned model (Figure 2(c)).

An adaptive IOU-T loss for tiny object detection
The IOU loss function is widely utilized in many contemporary object detection models (Zheng et al. 2020), such as YOLOv5 and YOLOX, as indicated in formulas (1) and ( 2), where A denotes GT and B denotes the predict result (PR) assigned to the corresponding GT.
Because the tiny object exhibits tiny pixels, and is susceptible to prediction, it easily leads to poor positioning accuracy (Wang et al. 2021).Therefore, the IOU-T loss function is proposed.This study adaptively adjusts the position loss weight through the GT pixel area, as depicted in formulas ( 3)-( 6), where GTs represents the pixel area value of all GTs in a patch, and k denotes the coefficient that controls the position loss weight of the model during the training process.When calculating the position loss weight, divided by mean(1 − (gts/100 2 ) k ) to maintain the ratio of the overall position loss and classification loss.Figure 4 illustrates that the utilization of IOU-T loss can enable the model to take a greater penalty when detecting tiny objects.To verify the validity of IOU-T, k is set to 0.5, 1, and 2.

Enhancing the model's performance by introducing shallow information and adding an up-sampling layer
The VisDrone2021, Tiny Person, and DOTA datasets contain a large number of tiny objects; therefore, the model requires more detailed representation information.To enhance the detection ability for tiny objects, this study utilizes two methods.(1)To optimally fuse the deep semantic information and shallow detailed information, the output of each backbone network stage is fused with the feature map, which is the MiSO output.
(2) Because the image contained in DOTA comprises a large number of overlapping tiny objects, the scale of the detection feature map with an 80 × 80 size may also lead to poor detection.This study upsamples the MiSo output from 80 × 80 to 160×160.
Because we utilize the MiSo framework in the overall detection structure, the decision to upsample the feature map at the detection head does not introduce immense computation, and with regard to detecting dense, tiny, and overlapping objects, it will enhance the generalization ability of the model.

Group-CBAM
The attention mechanism can enhance the global perception ability of CNN-based detection models, and it can further enhance the detection performance of tiny objects (C.Zhu, He, and Savvides 2019; Z. Zhu et al. 2019).CBAM attention focuses on both spatial perception and channel perception (Figure 5).In regard to tiny, dense, and overlapping object detection tasks, the spatial perception ability exerts a more crucial impact on the model than the channel perception ability.
To further enhance spatial perception ability, the study proposes Group-CBAM, which exhibits less computation and parameters than CBAM (refer to grouped convolution; Figure 6).It is assumed that the input specification parameters are B, C, H, and W. The parameter quantity of the spatial perception calculated by CBAM is B × H × W × 2, whereas the parameter quantity of the spatial perception obtained by Group-CBAM is B × H × W × 2×G, where G denotes number of groups.Therefore, the Group-CBAM exhibits a stronger spatial perception ability.
To indicate the difference between CBAM and Group-CBAM, we utilize PyTorch and Thop to calculate the number of parameters and the amount of computation, respectively; with regard to the input scale (B,C,H,W), B denotes batches with a value of 50, C denotes channel numbers with a value of 512, and H and W denote the input height and width, respectively, both of which are set at 80.After analysis (Table 1), the current study utilized Group-CBAM with groupnumber 4.

Spatial-shift
In the YOLO series models, a strict correspondence is implemented between the detection of the target and the grid position of the final output structure.Herein, to enhance the connectivity between different windows, the shifted windows proposed in Swin Transformer are considered, and a spatial-shift calculation with a MiSo structure is proposed (Figure 7); thus, the interconnection between different spatial locations and the final detection of the model can be enhanced.In the spatial-shift, if an input tensor is utilized, a small portion of the channel is shifted along four spatial directions, namely left, right, up, and down, whereas the rest of the channel remains unchanged.Moreover, the spatial-shift is not characterized by any parameters or arithmetic calculation (Figure 8).Finally, the novel model (i.e.TO-YOLOX) is constructed (Figure 1).Apparently, its structure is simpler and exhibits fewer parameters than YOLOX.

Evaluation metrics
To measure the performance of the model, this article mainly utilizes AP (average precision) as depicted in Formula (7), mAP(mean average precision) as illustrated in Formula ( 8), and FPS (frame per second).TP, FP, and FN denote True Positive, False Positive, and False Negative, respectively.Precision is depicted in Formula (9), whereas Recall is illustrated in Formula (10).AP represents the integral of Precision on Recall with the confidence threshold from 0 to 1 (0.01), whereas mAP represents the average AP value for all categories.When mAP is higher, the model detection performance becomes enhanced.

Performance of the proposed model
The study utilizes three remote sensing datasets (i.e.VisDrone2021, Tiny Person, and DOTA-HBB) to evaluate the performance of the proposed TO-YOLOX model, and mAP is utilized as the main evaluation indicator.To facilitate tabulation, the YOLOX-benchmark model is defined as follows: A-YOLOX (reserving two detection heads, namely 40 × 40 and 80 × 80); B-YOLOX (reserving only the 80 × 80 detection head); C-B + IOU-T loss (k value is 0.5); D-B + IOU-T loss (k value 1); E-B + IOU-T (k value 2); F-D + feature fusion and up-sampling; G-F + CBAM; H-F + Group-CBAM (Group = 4); and TO-H + Spatial-Shift (TO-YOLOX).

Visdrone2021
Herein, to verify the effectiveness and efficiency of the TO-YOLOX model proposed for the tiny object detection in remote sensing images, the VisDrone2021 dataset was first utilized for validation.Moreover, we compared some SOTA models such as RetinaNet and YOLOv5 with TO-YOLOX.Apparently, in regard to tiny object recognition, it can be observed that TO-YOLOX exhibits the most optimal performance compared to other detection models, and that the proposed model exhibits a more optimal detection performance-detection speed balance (Table 2) (C.Liu et al. 2023;Yang et al. 2023).
To analyse TO-YOLOX, an ablation study was conducted; thus, the contribution of the key components with different hyperparameters was evaluated.Finally, for the VisDrone2021 dataset, TO-YOLOX exhibited a 10.03% mAP increase compared to the YOLOX model (Table 3).
In addition, to vividly observe the differences between YOLOX and TO-YOLOX, two UAV remote sensing images containing multiple tiny objects are randomly selected in the validation set for testing the performance of the models.Figure 9(a and b) indicate that for more dense targets, TO-YOLOX exhibits a more optimal detection effect compared to YOLOX, and Figure 9(c and d) indicate that TO-YOLOX can detect the pedestrian results by YOLOX in the lower left part of the image.Apparently, TO-YOLOX exhibits a more optimal detection capability.

Tiny person
To verify the TO-YOLOX recognition effect for small targets in remote sensing images, Tiny Person is selected as the second dataset.The contents of Table 4 indicate the results of TO-YOLOX and those of some SOTA object detection models.
Furthermore, with regard to Tiny Person, the relevant ablation experiments with evaluation of TO-YOLOX and other models are illustrated in Table 5.In addition, to vividly observe the differences between YOLOX and TO-YOLOX, two UAV remote sensing images containing multiple tiny objects are randomly selected in the validation set; thus, the performance of the models is tested.Figure 10 indicates that TO-YOLOX is more optimal for both task detection in dense areas and for smaller scale character detection.Apparently, TO-YOLOX exhibits the more optimal detection capability.

DOTA-HBB
DOTA-HBB represents the final dataset selected herein to validate the effectiveness of TO-YOLOX for the recognition of tiny objects in remote sensing images.We can clearly observe that the proposed TO-YOLOX model obtains the most optimal small target recognition results compared to other target detection algorithms with a 63.02% mAP in DOTA-HBB (Table 6).Similarly, the relevant ablation experiments that evaluate TO-YOLOX and other models on DOTA-HBB are depicted in Table 7, and TO-YOLOX exhibits an mAP value that exceeds that of the benchmark model by 9.62%.In addition, to vividly observe the differences between YOLOX and TO-YOLOX, two UAV remote sensing images containing multiple tiny objects are randomly selected in the validation set; thus, the performance of the models is tested.Furthermore, it is observed that TO-YOLOX obtains a more optimal detection capability (Figure 11).

The performance of the MiMo-and MiSo-based model
Many contemporary SOTA target detection algorithms such as YOLOv3-v7, YOLOX, PPYOLO, SSD, and Cascade R-CNN utilize muti-heads for object detection.However, this study proposes a TO-YOLOX model that utilizes the MiSo structure; thus, the single-head small target detection in remote sensing images is achieved.Therefore, in regard to the application of this approach to the small target recognition pertaining to remote sensing images, a major development has occurred.The model that exhibits a MiMo structure introduces immense negative samples, as in YOLOX by SimOTA assigning positive samples to each GT, whereas 20 × 20 and 40 × 40 resolution detection heads are often optimized by choosing 80 × 80 resolution detection heads owing to higher cost.Thus, the model is practically robust.
For the detection of small targets in remote sensing images, to comprehensively verify the scientificity and rationality of MiSo, we visualize the prediction results of YOLOX and its MiSo-based structure (Figure 12).It can be observed that the 20 × 20 and 40 × 40 resolution detection heads (Figure 12(a and b)) in YOLOX detect only a small number of targets, and even the 20 × 20 resolution detection head does not detect any targets.Moreover, the objects detected in the 40 × 40 resolution detector head were successfully detected in the 80 × 80 resolution detector head (Figure 12 (c)).To solve the problem of positive and negative sample imbalance, which was occasioned by the application of the MiMo structure to the detection of small targets in remote sensing images, the current study re-trained the MiSo structure-based model by retaining the same training hyperparameters, and the test results are illustrated in Figure 12(d  Moreover, this study preliminarily postulates that in the small target recognition scenario, the detection algorithm based on adaptive positive and negative sample matching is highly suited for the MiSo detection structure, whereas the detection algorithm based on the three scales detection results and ground truth obtained as per the IoU to be matched is highly suited for the MiMo detection structure.To verify the aforementioned proposal, this study trains and calculates the correlation metrics of MiMo (20 × 20,40 × 40,80 × 80) and MiSo (80 × 80) on DOTA-HBB for YOLOv5 and YOLOv7, respectively (Table 8).   Figure 13 indicates that the detection performance of YOLOv5 degrades with the utilization of the MiSo structure; by contrast, although YOLOv7 utilizes MiSo to obtain a substantial reduction in the number of model parameters and computation, the detection performance is not affected by the utilization of a single detector head.

The generalization of the IOU-T loss function
Optimization functions for deep learning-based object detection algorithms usually include category loss and localization loss, where the localization loss has transitioned from the initial use of L1, L2 loss, etc., to Smoth L1 loss, etc., and, subsequently, to IOU loss and related variants.Herein, the IOU-T loss function is proposed on the basis of IOU loss, and its main working principle entails enhancing the loss weight coefficient of small objects and reducing the loss weight coefficient of large objects, which enhances the detection effect of the model on small objects.The effectiveness of the IOU-T loss function is verified in the aforementioned experiments, and the generalization of the IOU-T loss function is subsequently discussed and verified.Herein, three loss functions, namely CIoU, DIoU (Zheng et al. 2020), and GIoU (Rezatofighi et al. 2019), are mainly selected and utilized; Equations (3-6) and Equations (11-14) express the loss functions of CIoU-T, DIoU-T, and GIoU-T for small target detection, whereas IoU c,d,g expresses the CIoU, DIoU, or GIoU formulas.
Loss = Loss × Weight (14) The generalization performance of the IOU-T principle is also verified using the VisDrone2021, Tiny Person, and DOTA-HBB datasets, and Table 9 indicates that the IoU-T principle generalizes well over CIoU, DIoU, and GIoU, where the K-factor is equal to 1.

The perception of TO-YOLOX for tiny objects in remote sensing images
Owing to its lack of interpretability, a neural network-based model is usually categorized as a blackbox model.The current study utilized the Grad-CAM method and randomly selected a remote sensing image from the test set; thus, the feature map was visualized during the network model computation (Figure 14).Usually, a brighter colour indicates the part of the model that is more apparent.Unlike semantic segmentation algorithms, the prediction results of target detection algorithms that utilize the grid centre to determine the location of the target are based on the centre of the prediction frame, such as YOLO series models and RetinaNet.Therefore, a higher level of attention is usually preferable.It can be observed that the attention region obtained by the TO-YOLOX visualization model proposed herein is more accurate; moreover, it can be noted that for the proposed model, the attention to small targets in remote sensing images is sharper compared with other algorithms such as YOLOX, and that its actual detection effect is the most optimal.Therefore, the proposed TO-YOLOX model exhibits crucial research significance and practical application value for the successful recognition of small targets in remote sensing images.

The effectiveness of TO-YOLOX for tiny object
Herein, we propose an end-to-end TO-YOLOX algorithm based on the YOLO series of target detection models, and we include the following innovations: the MiSo structure and the IoU-T loss function.Moreover, the study introduces the notion that spatial-shift can enhance the model's perception of small targets in remote sensing images, and proposes a Group-CBAM based on the CBAM attention mechanism.For the proposed Group-CBAM, the overall number of parameters and computation is less than that of the CBAM attention mechanism; however, the proposed model exhibits a stronger spatial perception capability.Compared with many contemporary SOTA target detection algorithms, the final TO-YOLOX algorithm achieves the most optimal detection results in three remote sensing image datasets containing a large number of small targets, namely VisDrone2021, Tiny Person, and DOTA-HBB.
Because the development of computer hardware devices is still nascent, the computational complexity is also a crucial indicator of the model application value.The TO-YOLOX proposed herein utilizes a single detection head, and to further enhance the perception ability of small targets in remote sensing images, the feature map at the single detection head is upsampled twice; therefore, speed reduction exists.If upsampling is necessary for the scenario that requires a high detection speed, the upsampling calculation can be eliminated.We experimentally verified that compared to RetinaNet, YOLOv5, YOLOv6, and YOLOv7, TO-YOLOX with no upsampling achieves a 43.67%VisDrone2021 mAP, a 24.59%Tiny Person mAP, and a 62.31% DOTA-HBB mAP.SOTA still exhibits an immense improvement, and TO-YOLOX exhibits a faster detection speed and lower computational complexity.
In practical applications, the proposed TO-YOLOX model is suitable for the detection of tiny objects in various scenarios, including remote sensing images and medical images.It is worth mentioning that, due to the MiSo detection structure utilized in TO-YOLOX, the model is not always fully suitable for scenarios containing numerous large objects and a small number of small objects.To solve this problem, it is possible to add a 32x downsampling detector head to the current one, which only contains an 8x downsampling detector head; thus, the model becomes more suitable for multi-scale target detection.For small-target detection tasks, the precision of the model is often higher than the recall.In future TO-YOLOX-based studies, more novel detection structures can be utilized for small target recognition: the advantages of semantic segmentation, full-resolution outputs can be fused to achieve the detection of small targets (Yu et al. 2022), super-resolution models can be incorporated into single-stage detection models, or attention mechanisms and losses that are more suitable for the detection of small objects can be identified.

Conclusion
This study considers the problem pertaining to the difficult detection of tiny and dense objects in remote sensing images as the starting point, and it selects three remote sensing tiny object datasets (i.e.VisDrone 2021, Tiny Person, and DOTA-HBB) for validation.The study proposes a novel model (i.e.TO-YOLOX) with the MiSo structure that can alleviate the positive and negative sample imbalance of the model, a novel adaptive loss that can enhance the model's tiny object performance (i.e.IOU-T), an enhanced attention mechanism with faster computation and fewer parameters named Group-CBAM, and a structure for enhancing the interconnection between different regional objects (i.e.spatial-shift).It can be observed that TO-YOLOX is an effective approach for the tiny object detection of remote sensing images with higher detection accuracy and calculate speed.TO-YOLOX achieves mAP 45.31% (+10.03%) in VisDrone2021, mAP 28.91% (+9.36%) in DOTA, and mAP 63.02% (+9.62%) in DOTA-HBB, and it expresses more optimal tiny object detection results than models such as YOLOv5 and YOLOv6.

Figure 1 .
Figure 1.The overall TO-YOLOX structure, which utilizes CSPDarknet as a backbone and the MiSo detection framework.

Figure 3 .
Figure 3.The ratio of the matched positive samples of different detection heads, where A, B, C, D, E, F, and G represent the different batches.

Figure 4 .
Figure 4.The loss variation graph with different k values: (a) when the regression loss ratio and classification loss ratio are not maintained; (b) when the regression loss ratio and classification loss ratio are maintained.

Figure 5 .
Figure 5.The CBAM spatial attention module and channel attention module.
Setting of the experimentsThis study utilizes the DOTA-HBB, VisDrone2021, and Tiny Person remote sensing image tiny object datasets to verify the detection performance of the TO-YOLOX model.Using Ubuntu18.04operating system, the experimental environment is as follows: python3.7,pytorch1.71,cuda11.1.Moreover, a 12th Gen Intel(R) Core (TM) i7-12700KF CPU and an NVIDIA GeForce RTX 3090 GPU are utilized.All training with epochs of 100 are divided into two stages.In the first stage, the batch size is set at 8, the initial learning rate is 0.001, and the learning rate is updated by each epoch with a 0.92 decay rate.In the second stage, the batch size is set at 4, the initial learning rate is 0.0001, and a 0.92 decay rate is utilized to update the learning rate for each training epoch.All the training utilized Mosaic augmentation, including flipping, scaling, and colour field transformation, and fused four images as one input for training.Moreover, in the last 10 epochs, the training eliminates the utilization of Mosaic data augmentation and utilizes simple flipping, scaling, hue transformation, saturation transformation, and luminance transformation.In the testing stage, we similarly resize the image to 640 × 640 resolution as the input of the model and use only normalization for the input array, with a batch size of 1 and a threshold of 0.3 for NMS.Moreover, to measure the performance of the model, the AP (average precision), mAP, and FPS (frame per second) are utilized.

Figure 8 .
Figure 8. Spatial-Shift, which does not contain any parameters that require training.

Figure 9 .
Figure 9.The YOLOX and TO-YOLOX assay results pertaining to the VisDrone2021 dataset: (a) YOLOX detection results in the first scenario; (b) TO-YOLOX detection results in the first scenario; (c) YOLOX detection results in the second scenario; and (d) TO-YOLOX detection results in the second scenario.

Figure 10 .
Figure 10.The YOLOX and TO-YOLOX detection results in the Tiny Person dataset: (a) YOLOX detection results in the first scenario; (b) TO-YOLOX detection results in the first scenario; (c) YOLOX detection results in the second scenario; and (d) TO-YOLOX detection results in the second scenario.

Figure 11 .
Figure 11.Detection results of YOLOX and TO-YOLOX in the DOTA-HBB dataset: (a) YOLOX detection results in the first scenario; (b) TO-YOLOX detection results in the first scenario; (c) YOLOX detection results in the second scenario; and (d) TO-YOLOX detection results in the second scenario.

Figure 12 .
Figure 12.Detection effects of YOLOX and MiSo structure-based YOLOX: (a) Inspection results of the 20 × 20 resolution inspection head in YOLOX; (b) Inspection results of the 40 × 40 resolution inspection head in YOLOX; (c) Detection results of the 80 × 80 resolution inspection head in YOLOX; and (d) test results of the MiSo structured YOLOX.

Figure 13 .
Figure 13.The detection results of YOLOv5 and YOLOv7 using MiMo and MiSo structures, respectively.

Figure 14 .
Figure 14.YOLOX, RetinaNet, YOLOv5, and TO-YOLOX heat map and test results: (a) YOLOX test results and heat map; (b) Reti-naNet's test results and heat map; (c) YOLOv5 test results and heat map; and (d) TO-YOLOX test results and heat map.

Table 1 .
Computational and parametric quantities of CBAM and Group-CBAM.
Figure 7.The Matrix Transformation of Shifted Windows and Spatial-Shift.

Table 2 .
Performance of TO-YOLOX and other models in VisDrone2021.

Table 4 .
Performance of TO-YOLOX and other models in Tiny Person.

Table 5 .
Ablation experiments in Tiny Person.

Table 6 .
Performance of TO-YOLOX and other models in DOTA-HBB.
).It can be observed that the detection results of the MiSo model are more accurate; the model computation is decreased by approximately 22%, and the model parameters are retained by approximately 26%.

Table 9 .
The generalized performance of IoU-T loss on CIoU, DIoU, and GIoU.