Optimizing Sector Ring Histogram of Oriented Gradients for human injured detection from drone images

Abstract Developing a system for emergency response and rescue to find injured people within images has been an interest to many researchers. The key challenge is to define a proper feature to describe the human body's appearance. Various features are often extracted from low-level data, such as texture and colour (Zhang et al., Sensors. 19(5):1005, 2020). One of the strong features is the Sector Ring Histogram of Oriented Gradients (SRHOG) that has been successfully applied to human detection tasks. Despite good accuracy in finding humans, an SRHOG detection method produces a large amount of false-negative labels. Locating an injured body after a disaster in drone images using an imaging camera remained a challenge. This study presents a new extension to SRHOG, so-called AdSRHOG, to reduce the number of false labels. In our approach, the gradient filters used by SRHOG defined adaptively depending on the corresponding pixel location The proposed feature was used the Support Vector Machine (SVM) algorithm to detect humans on drone images. The experiments showed a significant improvement of up to 54.3% in reducing the false labels. It was also found that the overall accuracy of the human detection process had a notable improvement of 13.1% over a traditional SRHOG detection technique.


Introduction
Today, natural disasters, including earthquakes, floods, landslides, and avalanches, take place in all parts of the world. In this regard, the most important task is the rescue of injured people. According to the Red Crescent (IRAN Red Crescent 2004), the first step of any search and rescue operation is to locate the injured people. Unfortunately, this can be a very difficult and time-consuming process where only a few extra minutes may be all needed to save a victim's life. To this end, the use of drones has attracted the attention of many researchers (Blondel et al. 2013;Sun et al. 2016). Drones can be used The edge features are useful to find the shape of an object because objects can represent edges. In contrast to colour and texture, the edge features describe objects mainly based on their geometry (Nguyen et al. 2011). Ioffe and Forsyth (2001) used rectangular contours to model human body parts. The technique works best when other objects do not occlude the human body. To describe more complex shapes, template matching techniques have also been used (Gavrila and Philomin 1999). These techniques are often trained by using many sample images and, thus, depend mainly on the training data's completeness.
Another alternative is the definition of body parts with curves and basic shapes, called Edgelets (Wu 2008). The Edgelet is a small part of a line or curve (Wu and Nevatia 2005). It includes information on the shape and brightness of an object and describes gradients. This approach may encounter problems if the object is so far away from the camera that its details cannot be properly recognized. A more complex gradient-based feature is called Shapelet (Sabzmeydani and Mori 2007). It uses a series of gradients with different directions and magnitudes to represent different parts of the human body (e.g. head, hand, and shoulders). Shapelets are used by unsupervised classifiers and are often prone to background complexities.
The Scale-Invariant Feature Transform (SIFT) is one of the most widely used features and applies the Histogram of Oriented Gradient (HOG) (Lowe 2004;Dalal and Triggs 2005). HOG was introduced in 2005 for pedestrian detection but has since been used in numerous applications (Blondel et al. 2013; Mihçio glu and Alkar 2019; Bahri et al. 2020). Nowadays, it has variations of its own. For instance, the Compressed Histogram of Oriented Gradient (CHOG) (Munder and Gavrila 2006) and SRHOG (Liu et al. 2017) are two examples of improvement of the HOG's resistance to the brightness and rotation changes, respectively. Unlike SIFT, HOG uses the gradient of some keypoints and calculates the histograms of pixel gradients in a block. To calculate the HOG, first, the horizontal and vertical gradients have to be calculated; then, the magnitude and direction of the gradient vectors are determined to form a gradient image of which most redundant background detail has been omitted. This is achieved by generating a gradient vector orientation histogram (Dalal and Triggs 2005) (Figure 1). In the studies above, the images used are mostly taken in front of the object or from a certain view angle (Gavrila 1999;Aggarwal and Cai 1999;Liu et al. 2017). However, the drone images use to identify the location of a person after a disaster that is prone to 3D irregular movements. As a result, an individual can appear at various poses, which in turn complicates the detection process.
To overcome the above problems, Liu et al. (2017) proposed SRHOG (an extension to HOG) that uses a dynamically defined local coordinate system r, t ( Figure 2) instead of using the Cartesian coordinate system. SRHOG uses the deformed Sobel filter Approximated Radial Gradient Transform (ARGT) to calculate the gradients (Takacs et al. 2010;Takacs et al. 2013).
The gradient components (Gr, Gt) are defined in r and t directions shown in Figure 2. Therefore, the resulting vector does not change when this local coordinate system is rotated. This makes the feature vector resistant to the image rotations.
The experiments carried out by us (Ghasemi et al. 2020) have shown that drone images containing humans can be successfully and correctly labelled using a classifier that uses SRHOG as input. However, the results also include a lot of false image labels. This study presents a new SRHOG-based feature to overcome this problem that is called Adaptive SRHOG (AdSRHOG). The overall structure of this feature is the same as that of SRHOG. However, the shape of the gradient filter changes with respect to the location of the pixel. As will be shown in the Evaluations section, the proposed feature outperforms SRHOG by reducing the number of false labels. It can be used successfully and correctly for labelling drone images containing humans.
The following sections are structured into three parts. Section 2 explains how to calculate the new feature. The evaluation of the feature and experiments are described in the experimental exercises Section. The result discusses in Sect. 3, and finally, recommendations and suggestions are delivered in conclusion Sect. 4.

Methodology
Detecting humans on UAV images involves two main steps: forming image features and classifying the images using the formed features. In this study, a new feature, so- called AdSRHOG, is developed to reduce false labels produced by the classification technique. In the following section, the methodology of computing the AdSRHOG feature is described. Concerning the classification process, different classifiers could be used. However, since HOG's presentation, most researchers have used the Support Vector Machine (SVM) classifier. The same is true with SRHOG. Therefore, in this study, it is also decided to use SVM as the classifier.
SVM is a supervised Machine Learning (ML) algorithm used in many classification and regression problems. It works by finding an optimal separation line called 'hyperplane' to accurately separate two or more classes in the classification problem. The goal is to find the optimal hyperplane separation through a training process that tries to separate the data linearly. In case the data cannot be separated linearly, SVM creates a hyperplane in the higher dimensional space. A good separation of classes is achieved by having a hyperplane with the largest distance (margin) to the nearest training data points (Hsu & Lin 2002).
The classification is done in two main steps: training and test. In the beginning, a pyramid is created for each image. The images are then scanned using a search window. For each window, the desired feature is calculated and used for training the classifier. The result of the training is a model, which is then used to label the test images. The results are the test images, and each labelled with one class.
In the following, first, it is shown how the AdSRHOG features are computed. Then, in the Implementation section, the way the SVM classifier uses them to detect humans on the UAV images is described.

Computing AdSRHOG feature
As already mentioned, this study aims to improve the SRHOG feature to reduce the number of false labels produced by the classifier. In this section, it is shown how the AdSRHOG features of an image are computed. Figure 3 shows the steps of how to calculate the proposed feature.
The stages in this diagram are mostly the same as those of SRHOG. However, the filters used to convolve the search window are uniquely defined depending on the pixel's angular location within the search window. The overall process is divided into three main steps: Similar to SRHOG (Liu et al. 2017), first, the image is scanned by 128 Â 128 search windows. The search window is convolved with two gradient filters at each location, whose shape changes depending on the polar angle h (Figure 4).
h is the anti-clockwise polar angle of the pixel. Once the gradients are calculated, the search window is divided into sector ring blocks to compute gradients' histogram. In each block, the collection of bin heights forms its feature vector. Putting these vectors together creates the final vector feature of the search window. Also, Mi and Mj are Magnitude vectors for i and j pixel position at XY coordinate system on the image.
The following algorithm shows how to calculate the ADSRHOG feature of a given search window.

Join the vectors together
In summary, the main steps of computing the AdSRHOG feature of any search window are calculating the gradients, forming the histogram of the computed gradients, and forming the search window's feature vector. These steps are described in detail as follows.

Calculating the gradient of search window pixels
To compute the AdSRHOG feature, the first step is to calculate the gradient of the search window pixels. For this, first, the search window is divided into eight 45 angular sections. Then, for each pixel, two filters, along with and perpendicular to the pixel's radial bearing, are used to calculate its gradient values. The main contribution of this study is, in fact, here that to propose a new way of defining these filters. Two filters are proposed, one of which is radial, and the other is tangential. The radial filters are shaped according to h, the pixel's angular bearing with respect to the centre of the search window (see Figure 4).
If h is equal to 0 or 180 , the filters are the same as those of the Sobel filter in the x and y directions. However, if 0<h 45, 45<h 90, and so on, the radial filter is calculated by the values shown in Figure 3. To calculate the tangential filter, the radial filter is rotated by 90 . An important note is that, if h is not an integer multiplication of 45 , the filter coefficients are computed by averaging the neighbouring filters.
Once the search window is convolved with the filters defined as explained, the magnitude and direction of the gradient vector can be computed by the following equations: where G r and G t are the radials and tangential gradients, respectively.

Computing the histogram of gradients
Once the gradients of all search window pixels are computed, the next step is to compute the histogram of the gradients. As described, this part is similar to that of the SRHOG and is briefly explained in this section. In this step, the search window is divided into 15 sectors, each of which is also divided into 16 cells along the sector radius. The numbers, i.e. 15 and 16, are the optimum values for calculating the histogram of gradients (Liu et al. 2017). Each of the four adjacent cells forms a block (see the green part in Figure 5). To compute the feature vector of the search window, we need to calculate the histogram of gradients of these blocks. As can be seen, each block has some overlap with the preceding one ( Figure 5a). This is done to make the feature more resistant against illumination changes. As shown in Figure 5(c), the horizontal axis of the histogram corresponds to the gradient directions ranging from 0 to 160 (i.e. 9 parts per 20 ). To compute the histogram of gradients for each block, we need to define some bins and assign each block pixel to one of the bins. The bins are considered at angles 0 , 20 , 40 , 60 and so on until 160 . To fill in the bins, any block pixel whose gradient direction value is in the first half of a bin is put in that bin; otherwise, it is assigned to the following bin (Takacs et al. 2010). The vertical axis contributes to the height of the histogram bar and is proportional to the weighted sum of the pixel gradient. The weight of each pixel gradient is calculated based on its distance to the centre of the search window and its angle to the x-axis.

Forming the feature vector
Once the histograms of all the search window blocks are computed, the final step is to define its feature vector. For this, the vector is generated by putting the block feature vectors together. It should be noted that the blocks have some overlaps. Therefore, the number of feature vectors that have been amended together is much higher than that of the window cells.
In the following sections, first, it is explained how the AdSRHOG feature is used by SVM to label UAV images (the Implementation section). Then, various tests carried out to evaluate the efficiency of the proposed feature for human detection in drone images are described.

Implementation
As stated in the introduction section, this study aims to develop a method for human detection utilizing drone images after natural disasters. Figure 6 illustrates the proposed process. As already mentioned, depending on their distance from the drone, humans appear in different shapes and dimensions. The detection process must, therefore, be carried out at different scales. For this purpose, an image pyramid is formed in four levels, each level of which is the product from the preceding level multiplied by 1.05.
In this research, the Matrix Laboratory (MATLAB) platform and Python is used to program the proposed algorithm. The supervised classification technique SVM was applied to detect humans. There are three main features of SVM: (i) use of an optimised hyper plain to distinguish the two classes, (ii) establishment of the maximum boundary margin between the classes and (iii) the minimal noise effects when the quality of the data is low (Cortes and Vapnik 1995). As a result, it has been traditionally used in most HOG-based classifications to detect humans within images (Dalal and Triggs 2005;Pang et al. 2011;Cao et al. 2011). SVM consists of two stages of training and test. During the training stage, a classification model is built to define whether an image (i.e. the test image) has to be labelled positive or negative. Positive samples contain a human being (either in whole or in part), whereas a negative sample indicates no human presence in the image. The image here refers to the 128 Â 128-pixel samples, which could be either positive or negative.
As shown in Figure 6, the whole process begins by extracting the AdSRHOG features of the training samples, which are then fed into the SVM classifier. The result is a training model that includes the classification parameters. At this point, we can evaluate any sample to see if it is positive or negative. For this, similar to the training stage, the test image is scanned using a sliding window. The dimensions of this window are equal to those of the training samples. To define the label of the sliding window, its AdSRHOG features are extracted and passed to the trained SVM. Therefore, if the test sample contains a body, either in whole or in part, a positive label is given; otherwise, a negative is given.
In the experiments carried out, unfortunately, the authors could not get hold of images taken after a real disaster. However, different image sets that almost entirely resemble real search and rescue task conditions have been used. The first dataset used is the INRIA dataset, which has been commonly used along with the HOG feature as a comprehensive benchmark for testing human and pedestrian detection algorithms in several studies (Ojala et al. 2002;Dalal and Triggs 2005;Benenson et al. 2014). The dataset includes ground images from human positions in standing, walking, and upright views only. In addition, since a person is viewed mostly from a topdown/oblique angle in a drone, many additional images taken by several drones at different angles were acquired and used. These include images taken by a DJI Tello from an oblique angle at low altitude, and those taken by Phantom 4 Pro, Inspire 2, and Parrot ArDrone 2.0, all taken nadir at different heights. Among these, in one of the sets, students were spread in a relatively large area. They took different poses, including standing, lying down, and sitting to closely resemble UAV images used in a real search and rescue operation. A Phantom 4 Pro took these images at 60 m height. Details of these images are given in the following lines.
INRIA is a dataset (Figure 7) from which 2164 positive and 432 negative samples are used for training and 1126 positive and 453 negative samples for testing. The dimensions of the samples are all 64 Â 128 pixels. However, as mentioned above, the image samples must be 128 Â 128 pixels for our AdSRHOG-extraction feature. Therefore, the border parts of the samples were duplicated to produce 128 Â 128 pixel samples (Figure 7a).
Also, the drone images acquired by the authors were 592 and 200 positive and negative samples, respectively. Figures 8 and 9 show some examples used in these evaluations.
The Tello images were taken at lower altitudes to enrich the training dataset with images having different view angles. Table 1 shows the number of samples taken from various sources to test the proposed feature. In the following section, the tests carried out to evaluate the efficiency of AdSRHOG for classifying UAV images taken in SAR operations  Table 2).

Evaluations methodology
The Recall index shows the ratio of the correctly identified positive samples over the total number of positive windows and is computed by: In effect, Recall shows how strong the proposed feature is in identifying the positive samples.
The second index is Recall value for negative samples, which is named Recall_neg refers to the ratio of correctly identified negative samples over the total number of all negative samples, and it is calculated by: A bigger Recall_neg suggests that the procedure is stronger, and it makes fewer mistakes.
The precision refers to the ratio of correctly identified positive windows over all of the positive windows, and it is computed by: Precision shows the overall accuracy of the method.

Results and discussion
In order to evaluate the performance of the improved feature, five experiments are carried out.
Comparing the overall accuracy of the proposed feature to that of SRHOG: This was done twice; at first, the INRIA dataset is used to examine the strength of AdSRHOG in detecting humans in nadir images.
Evaluating the effect of the equivalent angle on the accuracy and time of computation: As indicated above, the gradient filter arrays utilizing the corresponding pixel bearing angle are computed. It means that a unique filter must be established for each pixel. It involves a lot of calculations and is not cost-effective. As an alternative, the search window can be divided into equal angular sections, all pixels of which use the same h, socalled the equivalent h, for filter computations. In other words, any pixels within a 45 or 22.5 section will use the same h to calculate its filter elements (see Figure 10).
This experiment aims to find the optimum number of sections. Evaluating the efficiency of AdSRHOG against SRHOG in detecting humans appearing in different/more complex situations: As already mentioned, the primary purpose of this study is to detect injured humans in drone images. In general, humans can appear in different poses, like standing, sitting, lying down. Other objects can also occlude them. Besides, there is no guarantee that the images will be taken under proper lighting conditions. Thus, in a third test, the proposed feature's performance is studied for human detection under such circumstances. Depending on the system configuration, these steps can be time-consuming. This issue more crucial in SAR operations as saving the life of a victim of an accident could depend on a few extra minutes (Silvagni et al. 2017;Jurecka and Niedzielski 2017). Thus, the calculation times are also studied in this section.
a. Test on real data: This test was performed in conditions close to the actual conditions of natural disasters. The test aimed to determine the potential of the proposed approach under real conditions. b. Study the impact of height on human identification: In this experiment, the performance of AdSRHOG is compared against that of SRHOG using two different datasets taken at different heights. The aim is to get a better insight impact of the drone height on the human identification results.
The results of the experiments are presented in the following sections.

Comparing the overall accuracy of AdSRHOG against SRHOG
The purpose of this test was to determine whether the change made to the gradient calculation led to an improvement in the performance of the SRHOG method or not. Each algorithm was individually trained with INRIA dataset training samples. The algorithms are evaluated on the 1126 positive search windows and 453 negative test sample windows on the INRIA dataset. Then the evaluation indices were calculated for them. Table 3 illustrates the results of this study. As shown in Table 3, the precision of SRHOG is 73.69%, while the precision of the proposed method is 94.21%. This difference is 20.52%. Although the precision index increased significantly, the Recall value that indicates the number of correctly identified humans has not improved much (1.1%). In other words, out of 1126 positive windows, 846 windows were correctly classified by using SRHOG, while only 16 more humans (862 in total) windows were correctly identified using AdSRHOG. However, as it can be seen, out of 453 negative samples, SRHOG classified 151 cases correctly, and the AdSRHOG correctly labelled 400 items. This means the Recall value of the negative data has significantly increased (from 33.33% to 88.30%). Therefore, the improvement in precision is the result of a reduction in miss-classifications (shown by Recall_neg in Table 3).
To evaluate the ability of the proposed feature and to detect humans from the top view, 60 positive and 60 negative samples by Parrot ArDrone were taken. Then, the drone images (Figure 8) were extracted and tested by applying SRHOG and AdSRHOG. Table 4 depicts the results of the test.
From the 60 positive samples, 52 cases were correctly classified by SRHOG, while AdSRHOG correctly classified only 48 cases. However, only 20 cases were correctly classified by SRHOG among 60 negative samples, while 51 cases were correctly classified by the improved method. Therefore, achieving a precision of 84.21%, AdSRHOG performed better than SRHOG (56.52%).

Comparison of different estimates of AdSRHOG
As mentioned above, dividing the search window into equal angular sections ( Figure  10) can reduce the computation time. However, the question is the number of sections that, in addition to reducing the time, would have a minimal impact on the accuracy of the results. To do this, we need to see if the number of sections is changed and how the performance of the proposed feature is affected, from the speed and accuracy points of view. To define the optimal figure, the search window is divided into a different number of angular sections. The proposed feature was computed using the equivalent h and used to classify the images in each case. Besides, the test was carried out by using actual h values, i.e. without dividing the search window into angular sections. All tests are performed on the INRIA dataset, including 1126 and 453 positive and negative samples, respectively. The results are shown in Table 5. the feature vector computation speed of each scenario (last column), is calculated using 100 samples. The computer used in these experiments was an IntelV R core TM i7-7700 HQ CPU and 16 GB RAM. As can be seen, there is no significant difference between the time of calculation and also the value of Precision or Recall_neg. However, the reference value for 45 sections is 76.55%, whereas that for 22.5 or bigger is 90.76% or more. This means, although dividing the search window by 16 sections or more does not significantly impact the computation time, it leads to an improvement of 21.14% over the accuracy of detecting positive samples.
Overall, it was revealed that the algorithm's performance is particularly improved by dividing the central angle into 16 sections. Therefore, the number of sections is used to perform the following experiment (see next section).

Evaluation of AdSRHOG in various human states
Injured people may appear in various positions, such as standing, sitting, lying down, and part of the body occluded or be in the shadows. There is also no guarantee that   a human being will be imaged under proper lighting conditions. In this section, the performance of SRHOG is compared to that of AdSRHOG using images in which humans appear in different shapes. In this experiment, the gradient sections are computed at 22.5 angles, as justified above. For each situation, 500 positive samples were used ( Figure 11).
Only are used the Tello images in this experiment. This dataset could capture different standing, standing, sitting, lying down, occluded, and being in imperfect lighting conditions. The experiment was carried out by using 500 images. Figure 11 depicts examples of the Tello images.
The images have been selected in such a way that humans appear in different angles of view, including sitting, standing and lying down poses. In some situations, the upper or lower half of the person's body is occluded by other objects, such as trees or shadows. Also, in some images, the lighting conditions are not very good, as shadows partially cover the human body. The images were once again used to test the proposed feature for human detection. Table 6 illustrates the results of the third experiment.
As can be seen, both features are performing relatively well in standing and lying down situations. In all situations, however, the AdSRHOG-based classification has a higher recall value than that of SRHOG. In a lying down position, this value is, of course, only 3%. As shown, a human being has a better chance of being detected by both features when in a standing and lying down position. The Recall values for SRHOG in standing and sitting states are 73.42% and 51.63%, respectively. Also, in sitting cases, the accuracy is 21.79% less than that of standing poses.
In weak lighting conditions, the Recall value is 55.48%, a relatively low value concerning the standing cases. The Recall values for AdSRHOG in standing, sitting situations are respectively 90.60%, 77.20%. In weak lighting conditions, this value is equal to and 69.40%, which is a relatively low value. Perhaps improving the contrast could help the results.
The low percentage of the Recall value for occluded human detection is worth noting. Only 26.2% of images containing humans have been detected using the SRHOG algorithm, while this figure has been increased to 36.21% by implementing the proposed feature. Despite the 10% increase, this is not an acceptable figure for an image classification technique. Perhaps the low accuracy of the results is due to the lack of adequate training data set. In our experiments, algorithms have been trained mainly using images containing the entire human body. As a result, the classification gave poor results when trying to test data that only partially contained humans.

Evaluation of AdSRHOG with images similar to those taken in real SAR operations
As mentioned before, the authors were not able to get access to images taken after a real disaster. To this end, nadir images taken by a Phantom 4 Pro at 60 m height from several people spread in a relatively large area are used. The test area is located around the city of Qazvin in Iran, which is located at longitude 49.85 and latitude 36.9 . Figure 12 illustrates examples of these images. As in previous tests, 125 Â125 windows were extracted from 546 images at different pyramid levels of each image frame. Half of the windows did not contain a human, while the other half did. In this test, the people presented different postures of the human body, including standing, sitting and lying down. These images were taken at 60 meters' height.
At this stage, 1400 positive windows,1400 negative windows were used for training and 600 positive windows and 600 negative windows for testing. The results are given in Table 7. Table 7 illustrates the accuracy of the ADSRHOG method, and it is 88.63%, while the accuracy of the SRHOG method is 56.12%. According to the tests relevant to Sect. 3.1, a gap of 32.51% is expected. Similar to the test in Sect. 3.1, this difference between the precision of both approaches results from SRHOG misclassification of negative samples. The number of negative samples correctly classified is 531 out of a total of 600 negative samples. Regarding the positive samples, the Recall value of AdSRHOG is 89.67%. It shows an improvement of 4.17% compared to SRHOG, which is 85.50%. Therefore, confirming the previous tests' findings, this experiment also suggested that the proposed feature is an effective one, even in situations close to real SAR applications.

Study the impact of height on human identification
In this experiment, the performance of AdSRHOG is compared against that of SRHOG by using two different datasets taken at different heights. For this, images taken by Phanotom 4 Pro at 60 m and 100 m heights are used. In both datasets, 300 samples (200 positive and 100 negative samples) were used for training and 120 samples (i.e. 60 positive and 60 negative samples) for the test. In each case, both training and test samples were from the same image set, i.e. of either 60 m or 100 m height. Then, the images were classified once using AdSRHOG and next with SRHOG. The results are shown in Table 8.
As can be seen, at different heights, the Recall values of both AdSRHOG and SRHOG are almost identical. Also, in both cases, the Recall_neg of AdSRHOG is around 30% bigger than that of SRHOG. Therefore, it is seen, the precision obtained using AdSRHOG is nearly 30% better than that achieved using SRHOG features. This means changing the height has not had a notable effect on the relationship between SRHOG and AdSRHOG. In other words, even when the height is changed, similar to the previous tests, AdSRHOG is still the preferred feature for human detection tasks  when using UAV images. However, when moving from 60 m to 100 m images, the precision has slightly decreased by around 1.5% for both features. Overall, it does not seem that changing the height has a great effect on the accuracy of the classification, provided that both training and test images are taken at the same height.

Conclusions
We introduced a new feature in this paper. This study concluded that they could work well with the human detection classification techniques on drone images. Replacing the Sobel filter with a new one, the proposed filter (AdSRHOG) showed more accurate than SRHOG in reducing the number of false detections. To compare the performance of the proposed feature with that of SRHOG, several experiments were carried out in which humans appeared in drone images under different posing and lighting conditions. The experiments were carried out using different datasets. Table 9 illustrates a summary of the results discussed in the previous sections. As suggested by this table (Table 9), it can be concluded that despite some weakness in detecting humans appearing in some difficult cases, the proposed feature outperforms SRHOG in all scenarios. This study shows a 54.3% improvement in Recall_neg, which means the number of false labels is substantially lower than that of the figure obtained using SRHOG. Also, the Recall value that represents the overall accuracy of an algorithm is increased by 13.1%. It was observed that some of the pose and lighting conditions are the main bottleneck. The results relating to the former may be enhanced by the extension of the training dataset, while contrast enhancement techniques could improve the latter. Both issues are proposed to be explored in future studies. Besides, in the experiments presented, the effect of drone view angle on the results was not studied. This is also an important and interesting issue to be looked at in future research activities.

Data and code availability
Data and code are available upon request

Disclosure statement
The authors declare no conflict of interest.