Arbitrary-angle bounding box based location for object detection in remote sensing image

ABSTRACT Object location is a fundamental yet challenging problem in object detection. In the remote sensing image, different imaging projection directions make the same object have various rotation angles, and in some scenes, the object distribution is relatively dense. Most of the existing deep learning-based object detection algorithms utilize horizontal bounding box to locate objects, which causes inaccurate location of the objects with dense distribution or arbitrary direction, thus leading to the detection misses. In this paper, we propose an arbitrary-angle bounding box based object location and embed it into the Faster R-CNN, developing a new framework called Rotated Faster R-CNN (R-FRCNN) for object detection in remote sensing image. In R-FRCNN, we specially improve anchor ratios to adapt to the objects like ship with large aspect ratio and increase the weights of the horizontal bounding box regression to reduce the interference of the arbitrary-angle bounding box on the horizontal bounding box prediction. Comprehensive experiments on a public dataset and a self-assembled dataset (which we make publically available) show the superior performance of our method compared to standalone state-of-the-art object detectors.


Introduction
With the continuous development of modern remote sensing technology, a large number of remote sensing images with higher spatial resolution and richer content have emerged, providing important analysis condition and resources for the research on various fields (Ye et al., 2020). Object recognition from images means to automatically find the object(s) of interest and returning their category and location information. Due to the continuous improvement of the spatial resolution of satellite images, the information contained in single images is growing. The currently existing automated image interpretation approaches are unable to meet the needs of many real-world applications in the remote sensing community . Therefore, the question of how to accurately and quickly extract the position and type of objects from satellite images has come into the focus of research.
Object detection from satellite images is widely used in civil (Chen et al., 2015) and military fields (Zhong et al., 2018). Among the many possible objects to detect, airports, ports, oil depots, ships, and airplane are usually key concerns. The recognition of such objects can be used for airport flight analysis and adjustment, for assistance to rescue ships in distress, for port and maritime traffic monitoring, and it can also provide data for urban planning (Zhong et al., 2018, Hua et al., 2020, Khan et al., 2019, Orengo et al., 2017, Hu et al., 2019. It also plays an important role in modern military systems, providing information for combat deployment, for monitoring the dynamics of airports, ports, and sea areas, and for analyzing the enemy combat strength , Bachagha et al., 2020, Melillos et al., 2018. In summary, object recognition from satellite images is an important research topic and has broad application prospects. Unlike close-range imaging, each satellite image covers a wide area and has a complex background as well as different projection directions. Hence, object detection faces two challenges: (1) Satellite images are shot downward from high altitude. Different projection directions lead to different rotation angles of the same object, resulting in morphological differences. Large object rotation angles can cause loss and misjudgment in object detection. (2) Some scenarios exhibit high object densities, e.g., aircraft in airports, oil depots, and ships in ports. Closely located objects often interfere with each other in detection algorithms, making it difficult to accurately locate and easy to lose them.
Due to the ability to learn high-level semantic features, deep learning and especially Convolutional Neural Networks (CNNs) (LeCun et al., 2015) have been widely and successfully used for object detection. Existing deep learning-based object detection algorithms mostly adopt horizontal bounding boxes to locate the objects, and perform non-maximum suppression processing on all bounding boxes output by the network. The purpose of non-maximum suppression is to remove the redundant bounding box of a certain object, so as to retain the optimal bounding box as the detection result of the object. However, when there are some densely distributed objects in an inclined state, the non-maximum suppression operation not only filters the redundant bounding box of a certain object itself, but also may filter the bounding boxes of other objects, thereby removing many useful bounding boxes, which results in a poor object detection performance on satellite images.
Especially in the dense distribution of objects, such detection misses occur frequently. As shown in Figure 1: (a) The ships show a nearly horizontal direction with a slight inclination, the overlap between most object bounding boxes is relatively small, but there are still a few of the bounding boxes with a large amount of overlap, resulting in one ship being missed; (b) The ships show a large inclination angle, the overlap between some object bounding boxes is more, which is caused by nonmaximum suppression filtering, so two ships are missed.
In recent years, due to the complexity of text distribution in natural scenes, text detection algorithms are increasingly using arbitrary-angle bounding boxes for detection, and a large number of excellent text detection algorithms, such as SegLink (Shi et al., 2017), R2CNN , RRPN , etc., have been produced. Among them, R2CNN ) is a milestone for text detection. Its basic framework is based on the Faster R-CNN. By adding a layer of arbitrary-angle bounding box prediction layer at the end of the Faster R-CNN network, R2CNN can jointly predict the coordinate information of arbitrary angle bounding box. As there is no need to change the main framework of the Faster R-CNN, R2CNN is simple and efficient, and it is widely used in the field of text detection.
With the idea of the R2CNN, we propose an arbitrary-angle bounding box based object location algorithm and embed it into the Faster R-CNN for object detection in remote sensing image. We call this new framework Rotated Faster R-CNN (R-FRCNN). Based on the Faster R-CNN, R-FRCNN introduces the arbitrary-angle bounding box to locate objects without redundancy in the remote sensing image by two improvements. On the one hand, in the anchor generation stage, adding a larger proportion of anchor points to generate candidate boxes with a larger aspect to adapt to some objects with a larger aspect ratio in the remote sensing image. On the other hand, an arbitrary angle bounding box prediction layer is added to the last layer of the object detection network, and different weight information is assigned to the prediction of the horizontal bounding box and the arbitrary angle bounding box. We additionally build a remote-sensing image datasets, namely, Tiny Object Detection for Remote Sensing Images (TODRS) (Sun & Wu, 2020). It consists of 4615 images with aircraft, ships, and oil depots, which are labeled with arbitrary angle bounding boxes.
The contributions of this paper are: (1) We propose an arbitrary-angle bounding box based object location method and add anchor ratios to fit objects like ships with large aspect ratio. With this, we improve the location accuracy of densely distributed and arbitrarily directed objects.
(2) We embed the new object location into the Faster R-CNN and obtain the Rotated Faster R-CNN (R-FRCNN) for object detection in remote sensing images. (3) We construct, present, and publish the dataset TODRS, to aid the development of robust methods for object detection from remote sensing images. The rest of this paper is organized as follows. Section 2 provides a brief overview of the related work on object detection from remote sensing image. In Section 3, we introduce our novel object location algorithm along with a thorough theoretical analysis. Then, Section 4 details the performance evaluation and discussion on two object detection dataset of remote sensing image. Finally, the conclusions and an outlook on future work are discussed in Section 5.

Related work
In recent years, with the development of feature extraction and machine learning-based classification, object detection is commonly regarded as a classification problem. Figure 2 shows the basic flow chart of object detection in remote sensing image, which is mainly divided into the following three steps: (1) Region proposal generation: A strategy is used to generate several object candidate regions in the remote sensing image. Commonly used candidate regions generation methods include sliding window search (Felzenszwalb et al., 2009;Zhang et al., 2013;Zhu et al., 2010) and region of interest (ROI) selection (Harel et al., 2007;Hou & Zhang, 2007;Itti et al., 1998).
In recent years, deep learning is a new technology emerging in the field of artificial intelligence. Among them, CNNs (LeCun et al., 2015) because of their powerful learning ability and feature expression ability have achieved better results in many applications of computer vision compared with traditional methods. In the field of object detection in natural scene, the representative of breakthrough is Region-based Convolutional Neural Network (R-CNN) , which uses Selective Search (Uijlings et al., 2013) to obtain a group of region proposals, that is, to generate a group of candidate regions. Then, use the CNN to extract the features of these region proposals, and finally use SVM to classify them. R-CNN is the first successful attempt of deep learning in object detection, and its detection accuracy far exceeds the traditional detection algorithms. Since then, a lot of deep learning-based object detection algorithms have emerged, which are mainly divided into two categories: two-stage object detection and one-stage object detection.
In response to the problem of repeated feature extraction and inaccurate object location of candidate regions, Fast R-CNN  takes the entire image as the CNN input to obtain the corresponding feature map, and then maps the generated candidate regions to feature maps using RoI pooling operation. This way, a large number of repeated feature extraction calculations are avoided, and the time for object detection is reduced. At the same time, the Fast R-CNN uses the softmax layer for classification and bounding box regression, which improves the accuracy of object detection. However, the Fast R-CNN still use Selective Search to generate regional proposals, which is still very time-consuming. In order to solve this problem, Faster R-CNN (Ren et al., 2017) introduced a region proposal network (Region Proposal Network, RPN) based on Fast R-CNN to automatically generate a small number of highly efficient candidate regions, greatly reducing the time consumption and obtaining a higher detection accuracy than the R-CNN and the Fast R-CNN. So far, the twostage target detection algorithm is carried out in the way of "RPN+CNN". Faster-RCNN has also become the most effective representative of the two-stage object detection algorithm. Subsequent two-stage object detection algorithms are basically based on the improvement of the Faster-RCNN, such as FPN (Lin et al., 2017), mask R-CNN  and so on. In addition, the representative approaches for onestage object detection are YOLO (Redmon et al., 2016) and SSD (Liu et al., 2016). Such approaches regard the object detection as a regression problem, and the most classic approach is YOLO. The YOLO approach uses a single end-to-end network to directly divide the entire image into a fixed grid. The YOLO predicts candidate bonding boxes, location confidence, and all categories' probabilities for all grids at once and complete the end to end process of the input from the original image to the output of the object category and location. Since there is no need to generate candidate regions, the one-stage object detection algorithm is faster than the two-stage object detection algorithm, but the blind meshing also makes the one-stage object detection algorithm detection accuracy lower.
Encouraged by the successful application of deep learning methods on natural scene images, several researchers have tried to use them for detecting objects from remote sensing images. (Cao et al., 2016) and (Yao et al., 2017) used the R-CNN and the Faster R-CNN, respectively, and found that this greatly improved the performance of object detection compared with traditional algorithms in this field. (Deng et al., 2017) proposed the Enhanced Faster R-CNN (E-FRCNN), which uses multi-layer network feature information for feature extraction and further improves the performance. In , Wu et al. provide a performance analysis of state-of-the-art aircraft type recognition and deep learning approaches on remote sensing images. A large-scale and publicly available object detection benchmark dataset is proposed in . This new dataset can help the earth observation community to further explore and validate deep learning-based methods.
These success stories proved the feasibility of deep learning-based methods in this field. At the same time, due to the different imaging principles behind remote sensing and natural scene images, there are still many problems when directly applying deep learning to remote sensing images.
The one challenging problem is how to locate objects in satellite images, as they can appear directed in any angle and are often not aligned with the image axes. Also, in some scenes, the object distribution is relatively dense.
Existing deep learning-based algorithms mostly use horizontal bounding boxes to locate objects. As a result, the overlapping bounding boxes of dense objects lead to interference: In the subsequent nonmaximum suppression processing, many useful bounding boxes will be removed and the objects they contain are overlooked.
In (Chen et al., 2019), expect for minimizing the classification error, Cheng et al. impose a rotationinvariant regularizer and a Fisher discrimination regularizer on the FC7 layer of VGGNet-16 (Simonyan & Zisserman, 2014) to enforce the CNN features to be rotation-invariant and have powerful discriminative capability. In , an unified object detection framework is proposed for combining the RPN and the contextual feature fusion network to extract the proposals and to simultaneously locate the geospatial objects. By adding anchors of multiangle in RPN, the proposed framework can effectively handle the problem of geospatial object rotation variations.
Differing from the above approaches, this article introduces an arbitrary-angle bounding box based object location method, and embeds it into the Faster R-CNN detection framework. By adding the corresponding anchor point ratio setting to adapt to the target with relatively large length and width, and by increasing the weight of the horizontal bounding box regression, the interference of the bounding box at any angle to the prediction of the horizontal bounding box is reduced, so that the network can better locate remote sensing image targets.

Proposed method
We illustrate the framework of the proposed R-FRCNN in Figure 3. The compositions of FEN and RPN are the same as that of the Faster R-CNN. Differing from the Faster R-CNN, we introduce a new object location based on arbitrary-angle bounding boxes for the characteristics of objects in remote sensing image. Especially, we change the setting of the anchor point in the RPN, and add an arbitrary-angle bounding box prediction layer at the last layer of the ODN to perform coordinate regression of the arbitrary-angle bounding box. We will describe the main idea and specific process of the R-FRCNN in detail below.

Arbitrary-angle bounding box location
As shown in Figure 4, (a) is the commonly used object location based on horizontal boundary box. The coordinates of the object boundary box are, where represent the coordinates of the center point of the rectangular boundary box, respectively, the width and height of the rectangular boundary box; (b) represents the object location based on arbitrary-angle bounding box. The coordinates of the object bounding box, where represents the center point coordinate of the rectangular box, represent the long side and short side of the rectangular bounding box, and represent the counterclockwise rotation angle of the long side and horizontal line of the rectangular bounding box.
Obviously, once the object is tilted relative to the horizontal direction, the horizontal bounding box based object location contains more redundant regions in addition to the object itself and at the same time leads to more overlap between bounding boxes of dense objects. However, arbitrary-angle bounding box based object location happens to include the objects, there are not too many redundant areas, and the boundary boxes of dense objects do not interfere with each other, and there is less overlap. Therefore, the arbitrary-angle bounding box based object location is more suitable for densely distributed object with arbitrary directions in remote sensing images.

Proportional of anchor points
In the R-FRCNN, the arbitrary-angle bounding box is used to locate objects. The intuitive idea is to generate multiple anchor points with different rotation angles in the RPN. However, this method needs to set several different angles, which not only greatly increases the number of anchor points generated, but also can not include all angles. When the angle is not proper, it can not adapt to the object of any angle.
Therefore, this method is not suitable for arbitraryangle object in remote sensing image. In fact, for the object, its arbitrary-angle bounding box and horizontal bounding box are not independent of each other. They are actually a relationship of inclusion and inclusion. As shown in Figure 5, the object locations in the horizontal bounding box are as follows: (a) The object is located in the horizontal direction of the horizontal bounding box; (b) The object is located in the vertical direction of the horizontal bounding box; (c) The object is located in the diagonal direction of the horizontal bounding box. In (a) (b) two cases, the arbitrary angle bounding box and the horizontal bounding box of the object are overlapped, while in (3) case, the arbitrary angle bounding box of the object is included in the diagonal of the horizontal bounding box. Therefore, considering the connection between the object's arbitrary angle bounding box and horizontal bounding box, the algorithm still generates a horizontal bounding box at the RPN stage.
In the original Faster R-CNN, there are three different aspect ratios (1:1, 1:2, 2:1) and three different scales (64*64, 128*128, 256*256). According to this setting, for each pixel on the feature map, nine anchor  points centered on the current pixel are generated. In the R2CNN, different ROI pool sizes are used to pool the generated horizontal bounding box for capturing the features with large aspect ratio. The reason why this method is adopted is that the text object in the natural scene often does not have a fixed aspect ratio. Therefore, it is difficult to design a more appropriate anchor point ratio for the text object. However, the object in remote sensing images (such as aircraft, ships, oil depots, etc.) has a relatively fixed aspect ratio. With this, we consider designing anchor point ratios of different sizes to suit the object aspect ratio of remote sensing images.
For most objects in remote sensing images, such as airplanes, oil depots, baseball fields, tennis courts, basketball courts, track and field fields, and vehicles, the aspect ratio is usually 1:1, 2:1 or close to 1:1, 2:1 proportion. Ships usually have a large aspect ratio and are densely distributed, so this approach focuses on the aspect ratio of ships. Ships usually include aircraft carriers and warships. Table 1 counts the size of some aircraft carriers, and Table 2 counts the size of some warships. According to Tables 1 and Tables 2, the ratio of ships is between 3:1, 10:1. Obviously, we cannot set all the ratios, otherwise, the excessive number of anchor points will be generated and it will be very time-consuming. Therefore, in order to adapt to the

Network training
The loss function of the RPN consists of the classification loss L cls ðp i ; p � i Þ and the bounding box regression loss L reg ðt i ; t � i Þ, as defined in Equation (1).
The classification loss is a binary cross-entropy loss, which is defined as Equation (2), and the position regression loss is a Smooth_l1 loss, which is defined as Equation (3).
In Equation (1), i represents the i-th sample in a minibatch, p i represents the classification prediction value of the sample, and p � i represents the true category corresponding to the sample. If the sample is positive, then p � i ¼1, otherwise p � i ¼ 0. For location prediction, a total of four values need to be predicted, i.e. ðx; y; w; hÞ, where ðx; yÞ is the coordinate of the center point of the bounding box, andðw; hÞ is the width and height of the bounding box. t i is the offset of the position of the bounding box of the sample, and t i � is the offset of the position of the object bounding box corresponding to the sample. The definitions of t i and t i � are shown in Equation (4). Only when the sample is a positive sample, there will be a position regression loss. For a negative sample, only the classification loss needs to be calculated. (4) The R-FRCN adds a layer of arbitrary-angle bounding box regression layer in the ODN. According to the object location model of the arbitrary-angle bounding box defined above, this layer predicts five-coordinate values. In summary, the loss function of the ODN is defined as shown in Equation (5). The third term in Equation (5) is the regression loss of the arbitraryangle bounding box. It is the same as the definition of the regression loss of the horizontal bounding box, i.e. the second term, in the equation. We also use the Smooth_l1 for the regression loss.
Here, β 1 and β 2 are the respective weights of the horizontal bounding box regression loss and arbitrary angle bounding box loss. The R2CNN has proved that the horizontal bounding box prediction and the arbitrary angle bounding box prediction are indispensable to the detection results. However, it neglects the importance of each of the two to the final detection, and finally give the same weight to horizontal bounding box and arbitrary angle bounding box regression. In the experiment, it is found that when the prediction of the arbitrary-angel bounding box is added, the prediction of the original horizontal bounding box is affected. As shown in Figure 6: (a) is the detection result of the original Faster R-CNN. This time, only the horizontal bounding box is predicted, and all ships are correctly detected; (b) is the detection result of the R2CNN with the arbitrary angle bounding box; (c) is the detection result of the R2CNN with the horizontal bounding box. We can see that when adding the prediction of the arbitrary-angle bounding box, the prediction and positioning of the horizontal bounding box is inaccurate, resulting in missed and false alarms. The reason is that during the network training process, it is necessary to pay attention to the regression loss of the horizontal bounding box and arbitrary angle bounding box at the same time, so it has a certain influence on the prediction of the horizontal bounding box. To solve this problem, we propose to redistribute the weights for the loss of horizontal and arbitrary-angle bounding boxes to obtain the best prediction results. The specific weight allocation will be discussed in the experimental part. After defining the loss function, we carry out the network training for outputting the predicted values close to the true values by minimizing the loss function. The actual training process is shown in Algorithm 1.

Algorithm 1. Details of the R-FRCN Training Procedure
Stage 1 Use the large-scale classification data set ImageNet (Deng et al., 2009)  Through the above stages, we can complete the training of the R-FRCNN. The above training steps are actually an alternating training method. The RPN and ODN are trained twice, respectively. The first alternating training requires fine-tuning the parameters of the original FEN. The second alternating training the RPN and ODN have completely shared the FEN.

Experiment and analysis
The experimental environment we use is Intel Core i7-7700 processor, Kingston 32GB RAM, NVidia GTX 1080Ti graphics card and ubuntu16.04 operating system. Next, we will conduct an experiment study around data sets, evaluation indicators, experimental results and analysis.

Experiment dataset
We conducted experiments on two datasets, namely, NWPU VHR-10 (Cheng et al., 2016) and TODRS-3, of which NWPU VHR-10 is a publicly available remote sensing dataset for object detection. We relabel the NWPU VHR-10 with the arbitrary angle bounding boxes. In order to further verify the validity of the proposed approach, we constructed and labeled the dataset TODRS-3, which is a new remote sensing dataset for small object detection.
The NWPU VHR-10 contains a total of 800 remote sensing images, of which 715 images are derived from Google Earth, with a spatial resolution of 0.5 to 2 m. The rest 85 images are from the Vaihingen dataset (Rottensteiner et al., 2012) and have a spatial resolution of 0.08 m. The NWPU VHR-10 totally contains 10 types of objects, including aircraft, ships, oil depots, baseball fields, tennis courts, basketball courts, track and field fields, ports, bridges, vehicles. The dataset is divided into the two sub-datasets: positive dataset and negative dataset. There are 650 images in the positive dataset, and at least one of the above object categories exists in each image. The negative dataset contains the remaining 150 images, and none of the above object categories exist in all images. Therefore, only the positive images are used in the actual detection, and some positive examples of the NWPU VHR-10 are shown in Figure 7. For each object in the image, the coordinate information of the upper left corner and the lower right corner of the bounding box and the corresponding object category are labeled.
The TODRS-3 is a dataset of remote sensing images with 4,615 small objects. The spatial resolution is 0.6 m to 3 m. And there are three categories of objects, including aircraft, ships, and oil depots. All images in the above two datasets are derived from the Google Earth. Each image has no fixed size but no more than 1000*1000 pixels. We manually mark the images with bounding box. Some examples are illustrated in Figure 8.
We divide the NWPU VHR-10 dataset and the TODRS-3 dataset into a training set and a test set at a 4:1 ratio. For the NWPU VHR-10, 520 images are for training and 130 images are for testing. For the TODRS-3, 3692 images are for training and 923 images are for testing. The learning rate of each layer during the fine-tuning was set to 0.001, and it would be set to the previous 10% after 5000 iterations, and the momentum unit was set to 0.9. The weight decay is set to 0.0005, the batch size is set to 44, and the number of iterations depends on the convergence of the network model.

Evaluation index
Recall and precision are the basic performance evaluation indexes of object detection, normally defined as: Here, TP is the number of correctly detected objects of this class; FP is the number of misjudged objects of this class in the detected results; FN is the number of undetected objects of this class; TN is the number of correctly detected objects of other classes. Based on the Recall and Precision, we use four commonly used object detection indicators such as Missing Alarm Rate (MAR), False Alarm Rate (FAR), F1 index, and test time to evaluate the performance. The specific definitions of MAR, FAR, and F1 index are shown in Formula (8), (9), and (10), respectively. The test time refers to the average test time of a single image.
The MAR reflects the proportion of objects that are missed among all objects. The FAR reflects the proportion of objects that are misjudged among the detected objects. There is no absolute relationship between the FAR and MAR. F1 index is used for comprehensive consideration of FAR and MAR. The lower the FAR and the MAR, the better the detection effect, and the higher the F1 index, the better the detection effect.

Anchor scale
We give the detection results of the R-FRCNN using different aspect ratios of anchors in Tables 3 and  Tables 4, respectively. On the NWPU VHR-10, when adding four ratios (1:4, 4:1, 1:8, 8:1), the MAR is slightly reduced, the FAR is also slightly increased, and the F1 index is slightly reduced. This is because the increased aspect ratio of anchors in the R-FRCN is  mainly for ships with large aspect ratio, but for the NWPU VHR-10 data set, the number of ship objects with large aspect ratio is less, so the improvement is not obvious for ship, while for other objects, the increased anchor points may cause interference, so the F1 index is reduced. On the TODRS-3 data set, there are more ship objects with larger aspect ratio. After adding four ratios (1:4, 4:1, 1:8, 8:1) to the R-FRCN, the detection effect of ships is improved: the FAR and the MAR are both reduced; and the   final F1 index is increased. In addition, when the proportion of anchors is increased, i.e. the number of anchors is increased, the time consumption is increased accordingly.

Weight distribution
In the R-FRCN, we perform both horizontal bounding box regression and arbitrary-angle bounding box regression. In order to explore the influence of the two on the final detection result, we set up a set of comparative experiments. The experimental results are shown in Tables 5 and Tables 6. It can be seen from the table that when β 1 ¼2 and β 2 ¼1, that is, when the regression of the horizontal bounding box is given more weight, the lowest MAR, FAR, and the highest F1 index are obtained. Note that when β 1 ¼1 and β 2 ¼2, the MAR and FAR are slightly higher than those when β 1 ¼2 and β 2 ¼1. This shows that paying too much attention to the arbitrary-angle bounding box will result in the inaccurate positioning of the horizontal bounding box, and the prediction of the arbitrary-angle bounding box depends on the prediction result of the horizontal bounding box, so the final F1 index is reduced.

Algorithm comparison
We now compare the performance of our method with the four state-of-the-art approaches Faster R-CNN (Ren et al., 2017), RIFD-CNN (Chen et al., 2019), YOLOv4 (Bochkovskiy et al., 2020), and R2CNN . We present the detection results of different object detection algorithms on the two datasets in Tables 7 and Tables 8. Generally, we can see from the table that compared to other comparison algorithms, our R-FRCNN algorithm has the lowest MAR and FAR on both data sets, and its F1 index is the highest. We can also find that the detection results of the R2CNN and the R-FRCNN are better than those of the original Faster R-CNN, which shows the introduction of the arbitrary-angle bounding boxes can effectively reduce the MAR and FAR, and improve the performance of object recognition of remote sensing image. In terms of the test time, YOLOv4 (Bochkovskiy et al., 2020) performs the best on both datasets.
Compared with R2CNN, due to the addition of the corresponding anchor point ratio setting for the ship with a relatively large aspect ratio, and the weight of the horizontal bounding box regression is increased, the proposed R-FRCNN achieves a better detection result. Therefore, our proposal has better applicability in object detection of remote sensing image. In addition, since the number of anchor points is increased and the prediction of bounding boxes at any angle is increased, the test time of the R-FRCNN is a little longer than that of Faster R-CNN and R2CNN, but it does not significantly increase the time consumption and is within an acceptable range. Figures 9 and Figures 10 respectively show some examples of the detection results of the R-FRCNN on the two data sets. From the figure, it can be seen that whether the object is densely distributed or nondensely distributed, this algorithm can obtain more satisfactory detection results.

Conclusion
In this paper, we focus on object location and detection of remote sensing images. For horizontal boundary box location in most current deep learning based object detection algorithms, the inaccurate location of densely distributed and arbitrary directions of remote sensing images leads to missing detections. In order to solve this problem, we propose an object location method based on the arbitrary-angle bounding boxes and embed it into the Faster R-CNN, forming the new detection framework, i.e. R-FRCNN. In the R-FRCNN, we specially add the anchor ratio for ships with a relatively large aspect ratio and increase the weights of the horizontal bounding box regression to reduce the interference of the arbitrary-angle bounding box on the horizontal bounding box prediction. Finally, we conduct experiments on two object detection datasets of remote sensing image, NWPU VHR-10 and TODRS-3. We study the impact of different proportions of anchors on the detection performance, analysis the degree of impact of horizontal and arbitrary angle bounding box regression on the detection results. When compared with other deep learning based methods, the R-FRCNN can effectively reduce the MAR and FAR of object detection of remote sensing images.
Our R-FRCNN cannot completely mitigate the influence of the arbitrary-angle bounding box on the prediction of the horizontal bounding box. Therefore, the R-FRCNN will continue to be improved in the future, so as to more accurately locate the object in any direction.