A point and density map hybrid network for crowd counting and localization based on unmanned aerial vehicles

ABSTRACT Crowd counting and localisation are essential tasks in crowd analysis and are vital to ensure public safety. However, these tasks via UAV bring new obstacles compared with video surveillance (e.g. viewpoint and scale variations, background clutter, and small scales). To overcome the difficulties, this research presents a novel network named PDNet. It employs the multi-task learning approach to combine the point regression and density map regression. PDNet includes a backbone to extract multi-scale features, a Dilated Feature Fusion module (DFF), a Density Map Attention module (DMA), a density map branch and a point branch. Aims of DFF is to address the difficulties of small targets and scale variations by establishing relationships between targets and their surroundings. DMA is created to address the challenges of complicated backgrounds, allowing the PDNet to focus on the target's location. In addition, the density map branch and point branch are designed for density maps regression and point regression, respectively. Experiments on the DroneCrowd dataset demonstrate that our proposed network outperforms state-of-the-art approaches in terms of localisation, L-mAP (53.85%), L-AP@10 (59.14%), L-AP@15 (63.64%), and L-AP@20 (66.21%), and we improved counting performance and significantly reduced inference time. In addition, ablation experiments are conducted to prove the modules' effectiveness.


Introduction
Crowd counting and localisation are essential tasks in crowd analysis. They are vital to ensure public safety. In recent years, because UAVs can fly freely and cover a larger area than video surveillance, crowd counting and localisation tasks are realised by drone surveillance (Du et al., 2020;Liu et al., 2021). Utilizing drones for crowd counting and localisation can provide a more comprehensive security.
In recent years, a lot of crowd counting and localisation research based on video surveillance has been conducted. Some researchers (Liang, Xu, Zhu, et al., 2022;C. Liu et al., 2019;Sam et al., 2017) have proposed using the density map in the field of crowd counting to increase network performance, but just knowing the number of pedestrians in a scene is insufficient for further tasks such as crowd tracking. Several efforts, such as Liang, Xu, Zhu, et al. (2022), have been made to combine crowd counting and crowd localisation by first obtaining a density map of the same size as the image and then using the non-maximum suppression (NMS) approach to acquire locations of the local maximum on the density map. Although these works can achieve better performance in counting, the presence of longtailed distributions  in the dense regions of the density map can lead to inaccuracies in the localisation task. Some works attempted to improve the density map by proposing techniques like (Liang, Xu, Zhu, et al., 2022), but the extent of the NMS's local maximum is artificially determined, lowering the performance of these approaches in localisation.
Some authors have proposed regression-based methods, such as pseudo box-based (Y.  and point-based methods (Song et al., 2021;Wen et al., 2021), to circumvent the issue of long-tailed distributions induced by density maps as intermediate representations. Since annotations of the majority of current crowd counting datasets (Wen et al., 2021;Zhang et al., 2016) are points, they cannot provide the size of the box information for pseudo-box-based methods, which inevitably leads to poor performance. In contrast, point-based methods remove the intermediate representation (e.g. density map) and directly regress locations of targets, which simplifies the network and achieves better results, such as Song et al. (2021).
There are different problems with UAV-based and video surveillance-based crowd counting and localisation (view point and scale changes, background clutter, and small scales) as shown in Figure 1. Consequently, employing the point-based technique directly on aerial images will degrade performance.
Based on the above analysis, the density map-based approach can help the network achieve better counting performance, while the point-based method can help the network achieve better localisation performance. Therefore, we propose a point-based and density map-based hybrid multi-task network called PDNet. STNNet (Wen et al., 2021) has a similar structure to PDNet. It takes the sum of predicted density maps as the counting result and the predicted points as the localisation result, which means that not all targets have location information. Therefore, STNNet is not reasonable in practice. In contrast, PDNet uses point regression as the primary task and density map regression as the secondary task. And the counting result is the number of predicted points. This mechanism makes PDNet more useful in practice. In addition, we tried to introduce the predicted density maps of the auxiliary task outputs as spatial attention into the network, so we designed the density map attention (DMA) module. It does not need to conduct complex structures but improves the network's ability to discover targets, which is different from other spatial attention techniques. And to overcome problems of small targets and scale variations in aerial images, we designed a new FPN-based  feature fusion method, dilated feature fusion (DFF). Most of the areas in the aerial imagery are buildings, and only small parts of the aerial image contain targets. Establishing relationships between targets and the global context may introduce too much error information and lead to degradation of detection performance (Li et al., 2019;, so DFF only builds the relationship between the target and their surroundings. We picked EfficientnetV2-S (Tan & Le, 2021) instead of VGG16 as the backbone to make PDNet has a higher inference speed.
Experimental results on DroneCrowd show that PDNet outperforms existing methods in all metrics (L-mAP, L-AP@10, L-AP@15, L-AP@20) of localisation; we also improve the accuracy of crowd counting in some scenarios and significantly reduce inference time. To sum up, because PDNet is a method based on UAV, it is more flexible than the method based on traditional video monitoring. And it has better counting and localisation performance and higher speed, which makes it possible to use PDNet in practice. In summary, this paper has three contributions: • We propose a new framework called PDNet, which uses point regression as the primary task and density map regression as the secondary task. Unlike STNNet, PDNet uses the number of targets located as the counting result. • We propose a new method for fusing features called DFF, which can build relationships between targets and their surroundings in multi-scale feature fusion stage. The principle of DFF makes it easier to detect small targets. • Due to the similarities between the density map and spatial attention, we propose a new spatial attention module called DMA. It doesn't require designing complex network structures and can point out target areas like other spatial attentions.
The rest of this paper is organised in the following structure. We present the related works in Section 2. In Section 3, we detail the proposed method. In Section 4, we report experimental results. Section 5 concludes this paper. Code of this paper will be released in https://github.com/cracknum/PDNet.

Density map-based methods
The most common supervised method in crowd counting is to employ the density map first proposed by Lempitsky and Zisserman (2010), which sums up the prediction density map as the counting result. In recent years, the accuracy of crowd counting methods has been enhanced through network construction and density map innovations. Zhang et al. (2016) employs a multi-branch network structure with different convolutional kernel sizes to address the scale variations of the crowd. Sam et al. (2017) uses a multi-branch network like , but the difference is that (Sam et al., 2017) includes a switch layer to determine which branch to generate the prediction density map. Li et al. (2018) and Wang et al. (2022) use dilated convolutions with different dilation rates to obtain deeper features and larger perceptual fields to adapt scale variations. Cao et al. (2018) employs a multi-branch convolutional module at each stage to extract features at various scales and then uses deconvolutions to generate the predicted density map. Jiang and Jin (2019) and Fan et al. (2020) proposed improvement schemes for generating density maps in the feature extraction and loss supervised stages, respectively.
Because the density map represents the spatial distribution states of targets but does not care about the specific locations of targets, density map-based methods cannot provide accurate localisation information Liang, Xu, Zhu, et al., 2022;C. Liu et al., 2019). This problem is also explained in Bai et al. (2020), Ma et al. (2019) and L. . Therefore, the proposed method considers all targets as a points collection and then regresses them directly, which avoids the inherent drawbacks of density maps as intermediate representations.

Localization based methods
Due to the inherent inaccuracy of density map-based methods on crowd localisation, some researchers have proposed bounding box-based techniques (Basalamah et al., 2019;Sam et al., 2021) for detection. Basalamah et al. (2019) randomly annotates heads and then generates a scale map using the sizes of annotations. (Sam et al., 2021) uses the distance between the target and the nearest surrounding target to calculate the size of the pseudo bounding box. Y.  used the strategy as same as Sam et al. (2021) to initialise pseudo bounding boxes and proposed an online update method to ensure the precision of pseudo bounding boxes. However, because annotations of the most recent datasets Zhang et al., 2016) are points, these methods are not universal. Point-based regression algorithms Song et al., 2021;Wen et al., 2021) have been proposed. These methods remove the intermediate representation and directly regress the location of objects. Song et al. (2021) proposed the P2PNet, which is a pure point-based approach with no intermediate states to regress objects.  presents KMO-based Hungarian method, which reconsiders label assignment from the standpoint of contextual instances rather than isolated examples. STNNet, a multitask framework presented by Wen et al. (2021), employs a density map-based sub-network for counting tasks and a point-based sub-network for localisation.
Not the same as existing methods, the proposed method treats the density map-based sub-network as an auxiliary network to provide information about the spatial distribution of targets for the point-based sub-network and treats the counting result as a by-product of the localisation task.

Network construction
In this section, we detail our proposed model, PDNet, which includes five modules: backbone, DFF (Section 3.2), DMA (Section 3.3), density map branch (Section 3.4), and point  branch (Section 3.5). Finally, we will introduce the loss functions of the density map branch and point branch (Section 3.6). The overall framework is shown in Figure 2. Firstly, We restructured the first 7 stages of EfficientV2-S as our backbone to generate multi-scale features, and the structure comparison between EfficientnetV2-S and our backbone is shown in Table 1.
After inputting an image I with width I h and height I w to the backbone for feature extraction, it produces the features F s , s = 4, 8, 16, 32, where s represents downsampling stride. The width and height of F s are w s and h s , respectively. The area of F s is A s = w s * h s and where F s i denotes the semantic information of the s * s region in the original image.

DFF
Aerial images have problems such as broad view and complex background, and using general spatial attention would introduce too much unnecessary correlated information and degrade the model's performance. Therefore, we hope to build an appropriate correlation between an object and its surroundings. On the other hand, the scale of target variations is another factor that affects the model's performance. To introduce relevant correlation information between the object and its surroundings and reduce the impact of scale variations on the network, we designed a feature fusion method based on FPN named DFF. The structure is shown in Figure 2.
DFF contains three sub-modules: DFF3, DFF4, and DFF5. DFF3 is a multi-branch structure composed of three 3 × 3 convolutional layers with different dilation rates (1, 2, 5) and a 1 × 1 convolutional layer. Firstly, DFFx employs the multi-branch structure to extract multiscale local contextual information, where x = 3, 4, 5. Secondly, DFFx adjusts the number of channels through the 1 × 1 convolutional layer after channel concatenation. Since DFF3, DFF4 and DFF5 are designed to adapt to different receptive fields, DFF4 and DFF5 are very similar in structure to DFF3. The difference is that the multi-branch structure of DFF4 uses two 3 × 3 convolutional layers with different dilation rates (1, 2), while DFF5 employs a convolutional layer with a dilation rate of 1 to replace the multi-branch structure.
Unlike Wang et al. (2022) and Li et al. (2019), they use parallel convolutional layers with different dilation rates to extract the multi-scale information of the features output from the last layer or Y.  uses the same set of dilated convolutional layers on the different sizes of features. We employ DFF3, DFF4, and DFF5 in different stages of FPN to adapt to the receptive field change.

DMA
Density maps can illustrate crowd distribution and smooth point annotations. However, directly employing the heatmap to supervise the localisation task can not provide accurate positioning information. In Figure 3, (a) is the visualisation of the annotated points in the image, (b) is the visualisation of the density map converted from these annotated points using the density map generation method proposed by Zhang et al. (2016), and (c) is the visualisation of the points after using the non-maximum suppression (NMS) on the generated density map. A comparison of (a) and (c) represents that some of the annotated points in (a) have disappeared. Therefore, only using the density map to supervise the localisation task cannot achieve better performance.
We need to rethink the meaning of the density map. Zhang et al. believe that the density map describes the spatial distribution of the crowd in a given image. And spatial attention mechanism in computer vision can be summarised as which parts we want the model to  focus on. Therefore, the density map and spatial attention are similar. With this intuitive idea, we designed a plain module DMA, which uses a sigmoid function to map the predicted density map obtained from the density map branch to a range between 0 and 1 and then multiply the 8× feature map from DFF with it to highlight regions. Finally, add the above result and the 8× features element-wise to enrich the features. The structure of DMA is shown in Figure 4.

Density map branch
We employ the adaptive density map generation method proposed by Zhang et al. (2016) to convert annotated points in an image to a ground truth density map. For each target p i in the image, d i j represents the average distance from p i to nearest neighbours, j ∈ {1, 2, . . . , m}, thus the average distance from p i to nearest neighbours isd i = 1 m m j=1 d i j . And the generation formula of each point is as follows: where δ(·) is the delta function, N σ i (·) is the two-dimension Gaussian function, σ i is the weighted average distance from p i to nearest neighbours, β = 0.1. The ground truth density map is represented as D = {M i | i = 1, 2, . . . , I h * I w }, where M i is the value of the ground truth density map.
To reduce the network performance degradation caused by the sharp reduction of the number of channels, the density map branch (DMB) is composed of four consecutive 3x3 convolutional layers. It can adjust the number of channels of features to generate density maps according to c → c 2 → c 4 → c 8 → 1, where c is the number of channels of the features.

Point branch
We set up a reference point R i = (r i x , r i y ) in the centre of the patch represented by F s i , thus, the size of reference points is A s . All of the reference points could be represented as For R i , we generate a prediction pointp i = (x,ŷ), so we have A s * 1 prediction points. The set of prediction points isP = {p i | i ∈ {1, 2, . . . , A s }}. Each ground truth point is represented as p i = (x i , y i ), and there are K ground truth points in total, so the set of ground truth points is P = {p i | i ∈ {1, 2, . . . , K}}.
The Point branch contains a regression branch and a classification branch. Firstly, in the regression branch, we obtain the boosted feature map BF s from DMA and then generate the offset p i = ( x i , ŷ i ) of a point. The offsets of all points are P = { p i | i = {1, 2, . . . , A s }}. Then multiply P with coefficient α, and add it with the reference points R to obtain the coordinates of the proposed points, so the coordinates ofp i arê Meanwhile, in classification branch, using the boosted feature map BF s to generate the confidences of proposed pointsĈ = {ĉ i | i = {1, 2, . . . , A s }}.

Match strategy
We use the match strategy proposed in P2PNet. After obtainingP andĈ, we use the Hungarian matching algorithm to obtain 1v1 matching results. The calculation method of the matching matrix is as shown in formula (4). Finally, we get the optimal matching results ξ ∈ {1, . . . , K}, positive proposalsP pos where cost point is the distance matrix between predict points and ground truth points, cost cls is the confidence value of individual, λ 1 = 1, λ 2 = 0.05.

DMB loss
Most of the existing estimation methods use the Euclidean loss function, which is defined as follows: whereD i is the predicted density map of the ith sample, N is the number of samples, · 2 2 denotes the Euclidean distance.
In the density map estimation models, the better effect of density map generation, the higher performance can be achieved. Therefore, structural similarity (SSIM) (Wang et al., 2004) is used in this branch to measure the similarity between the predicted density map and the ground truth density map on the multi-scale. SSIM is defined as follows: where μ D s , μD s , σ D s and σD s are the local mean and standard deviations of predict density mapD s and ground truth density map D s , respectively. And σ D sDs is the local covariance of D s andD s . C 1 and C 2 are constant.
Thus, the loss of density map branch is defined follows:

Point loss
In this part, we will detail losses of regression and classification branches.

Regression branch loss function. We use MSE loss to measure the loss between
ith predicted points setP ξ i and ground truth points set P i . It is defined as follows:

Classification branch loss function. The reference points system will introduce
a large number of negative samples, which includes many hard samples. In order to overcome this disadvantage, we introduce the focal loss (Lin, Goyal, et al., 2017), which can be defined as follows: whereĉ i,ξ(j) is the confidence ofp ξ(j) of ith sample, α = 0.65, γ = 2.0. Therefore, the integrated total loss is as follows: where λ 1 = 0.0002, λ 2 = 0.01

Crowd localisation
We use the evaluation metrics proposed by Wen et al. (2021), including L-mAP, L-AP@10, L-AP@15 and L-AP@20 to compare the localisation performance, where 'L' means localisation and the numbers represent the distance between the ground truth point and the prediction. Most of the calculation processes of these indicators are consistent with the AP of general object detection. But these indicators use the Euclidean distance as the criteria instead of the IoU. To fairly compare the performance of all models, we use the toolkit proposed by Zhu et al.

Crowd counting
We use the Mean Absolute Error (MAE) and Mean Squared Error (MSE) as evaluation metrics to compare the performance of PDNet and others on counting. These two metrics are defined as: where E i is the number of P i ,Ê i is the number ofP i .

Experiment setting
Our backbone uses pre-trained EfficientnetV2-S on ImageNet, and other convolutional layers use the default initialisation method of PyTorch. Inspired by ConvNeXt , we used AdamW (Loshchilov & Hutter, 2019) optimiser in model training and warm-up CosineLR-Scheduler (Loshchilov & Hutter, 2017) as the learning rate scheduler. In the preprocessing stage, we first use the adaptive method to generate a density map, and then the image and density map are cropped with the same uniform sampling distribution. After that, we obtained four 256 * 256 size patches of the image and four same size patches of the density map. In the test phase, we use 0.5 as the confidence threshold to filter prediction points. All of the experiments adapt PyTorch 1.10.0 framework and a single NVIDIA 3080ti GPU.

Dataset
DroneCrowd is a large crowd analysis dataset based on UAVs and can be used in crowd counting, localisation and tracking studies. It contains 112 video clips of 70 different scenes in 4 cities in China, totalling 33,600 frames. And it includes different target sizes (small, large), diversity of backgrounds (cloudy, sunny, night) and target density changes (crowded, sparse). Since the temporal information in the DroneCrowd dataset is not required for this study, we take one of every ten consecutive samples from the original training set. Finally, we get 2,460 images as the training set. In addition, we use the original testing set and validation set for performance evaluation. The statistical information of DroneCrowd is shown in Table 2.

Accuracy
To demonstrate the accuracy of our method, we evaluated the counting and localisation performance on DroneCrowd and compared them with other state-of-the-art methods. For the model without localisation capability, we extract the peaks of its predicted density map as localisation points. And the visualisation of the localisation performances of P2PNet, STNNet, and PDNet is shown in Figure 5.

crowd localisation comparison
As shown in Table 3, our method achieved 53.85% L-mAP, which is 13.4% higher than the second best STNNet, indicating that PDNet can produce more accurate localisation information. MCNN, CAN, CSRNet and DM-Count are density map-based methods, so their localisation results do not show better performance on localisation. Although P2PNet is slightly worse than STNNet and PDNet, it is also better than MCNN, CAN, CSRNet, and DMcount, which indicates that the point-based method is better than the density map-based method on localisation. The gap between L-AP@10 and L-AP@15 of P2PNet is relatively  large, reaching 20.66%, which indicates that P2PNet cannot identify the features of small objects and handle the scale changes well. On the contrary, the difference between L-AP@10 and L-AP@15 of PDNet is only 4.28%, which also shows that PDNet is better than pure point-based methods in localisation, which also represents on STNNet.

Crowd counting comparison
As presented in Table 4, CSRNet, CAN, DM-count, and STNNet achieved the best counting results on some scenes, indicating that the density map-based method has better counting capability. Meanwhile, PDNet also achieved the best counting results in some scenarios. It represents the feasibility of studying the hybrid framework. The counting performance improvement of the hybrid framework is an important part of our future research.

Discussion
Although PDNet achieved state-of-the-art counting performance in some scenarios, we find that PDNet is not competitive with STNNet in the night scene. As shown in Figure 5, due to obstacles such as the existence of shadows of targets and poor light at night, PDNet is unable to locate the target well. STNNet's counting result is obtained by summing the density map, which gives STNNet a better counting performance due to the density map's characteristics and the tracking self-network providing information on the pedestrian's trajectory.

Computation complexity
The goals of PDNet are to provide sufficient localisation and counting performance and higher inference speed. As shown in Tables 3 and 4, the inference speed of MCNN is the fastest, reaching 28.98 fps, but it did not achieve better performance compared with several better models on localisation and counting performance, such as CAN, DM-Count, and STNNet. The reason is that the structure of MCNN is relatively simple, so the speed is relatively fast. Although STNNet achieves better counting and localisation performance and achieves the best results in some scenarios, its complex structure and three sub-networks lead to slow inference speed, so it's not practical. DM-Count also has a relatively high speed in counting with high counting accuracy, but its localisation accuracy is not good. PDNet not only has relatively high counting and localisation accuracy but has a high inference speed of 17.7 FPS. Therefore, PDNet is more practical.

Ablation experiment
To validate the effectiveness of modules proposed in PDNet, we conduct ablation experiments on DroneCrowd. In addition, we use the part of PDNet after removing the DFF, DMA and DMB as the base model named BM. And we use the 8× feature map to generate the prediction points and confidence. Figure 6 is the structure of BM.

Backbone selection
To get a network that has not only high counting and localisation performance but also high speed, we performed experiments using different backbones on BM, such as VGG16 (Simonyan & Zisserman, 2015), Resnet50 (He et al., 2016), Densenet161 (Huang et al., 2017) and EfficientnetV2-S (Tan & Le, 2021). For a fair comparison, because some models cannot get 32× features, such as VGG16, we copy its last downsampling layer and convolutional module as a 32× feature extraction module. As shown in Table 5, VGG16 and Densenet161 have achieved the best counting and localisation performance, but their inference speeds are far behind EfficientnetV2-S and the sizes of their parameters are much larger than EfficientnetV2-S. Although the inference speed of Resnet50 and EfficientnetV2-S are very close, the counting and localisation performances of Resnet50 are lower than that of EfficientnetV2-S. To sum up, the EfficientnetV2-S is the most suitable compared with others in our network.

Crowd counting
As shown in Table 6, we constructed four ablation experiments on crowd counting to demonstrate the effectiveness of our modules. After adding the DFF to BM, counting performance has improved in some scenes compared with BM, such as small and night scenes. That is because DFF established the association between the target and its surroundings and alleviated the flaw that can't extract available features. In the third experiment, we    added DMB and DFF to BM. Compared with the results of the second experiment, the counting performance of the third experiment improved in all scenarios. It demonstrates that a multi-task learning approach through combining point regression and density map regression can improve the model's performance. We added DFF, DMB, and DMA to BM in the last experiment. The experimental result of the last one shows counting performance in overall, small, cloudy, crowded, and sparse scenarios achieved the best.

Crowd localisation
As shown in Table 7, when we add the DFF, DMB and DMA modules to the base model BM one by one, the localisation performance is gradually improving in all metrics. It demonstrates the effectiveness of the our proposed modules.

Conclusion
In this paper, we propose a novel practical method for crowd counting and localisation based on the UAV called PDNet. It's a point and density map hybrid multi-task network. To introduce appropriate contextual information at different stages and motivated by the similarities of the density map and spatial attention mechanism, we designed DFF and DMA, respectively. Experimental results on a large UAV crowd analysis dataset, DroneCrowd, demonstrate that PDNet outperforms the state-of-the-art methods in terms of localisation metrics (L-mAP, L-AP@10, L-AP@15, L-AP@20) and improves the performance of counting. The inference speed of PDNet is also faster than most state-of-the-art methods. Although PDNet has achieved the best results in localisation, there is still a gap in counting with STNNet in some scenarios, especially in the night scenario. We argue that is due to the tracking sub-network of STNNet, but the tracking sub-network of STNNet causes the inference speed becomes very slow. In future work, we hope to find a practical structure that not only provides the model with information about the trajectory of the target but also has a high inference speed.

Disclosure statement
No potential conflict of interest was reported by the author(s).