A news picture geo-localization pipeline based on deep learning and street view images

ABSTRACT Numerous news or event pictures are taken and shared on the internet every day that have abundant information worth being mined, but only a small fraction of them are geotagged. The visual content of the news image hints at clues of the geographical location because they are usually taken at the site of the incident, which provides a prerequisite for geo-localization. This paper proposes an automated pipeline based on deep learning for the geo-localization of news pictures in a large-scale urban environment using geotagged street view images as a reference dataset. The approach obtains location information by constructing an attention-based feature extraction network. Then, the image features are aggregated, and the candidate street view image results are retrieved by the selective matching kernel function. Finally, the coordinates of the news images are estimated by the kernel density prediction method. The pipeline is tested in the news pictures in Hong Kong. In the comparison experiments, the proposed pipeline shows stable performance and generalizability in the large-scale urban environment. In addition, the performance analysis of components in the pipeline shows the ability to recognize localization features of partial areas in pictures and the effectiveness of the proposed solution in news picture geo-localization.


Introduction
A number of news or event pictures are created today, which have valuable content to be analyzed. Especially with the emergence of social media such as Twitter, Instagram, and Sina Weibo, people are willing to take photos and share the popular events happening around them. The massive relevant content of the event can easily be obtained from the internet, and it is possible to track and analyze current trends or public sentiment (Lin et al. 2022). Event pictures can also be used in fake news detection , caption generation (Feng and Lapata 2013) and content recommendation (Yang, Long, et al. 2020). However, only a limited portion (1% to 4%) of messages are geotagged in the case of Twitter (Nizzoli et al. 2020). Due to the absence or inaccuracy of location, it is difficult to perform spatial visualization and analysis.
Geo-localization of news or event pictures plays a vital role in analyzing the content of the images, such as quickly detecting and monitoring the location, scope, and number of events. The method of extracting location information has a wide range of applications, including tourism, politics, crime, environment, disaster management, and transport (Lock, Bednarz, and Pettit 2021).
Further applications usually require a combination of the semantic features reflected in media images or the events described in the news. For example, natural disaster events have been analyzed through information posted on the web (Bhuvana and Arul Aram 2019;Zou et al. 2019;Skripnikov et al. 2021). The local government can respond quickly according to the situation of the incident and issue safety announcements. The analysis results can also be used as a reference for the government to formulate emergency plans. The images with geotags are also used in urban environmental analysis (Cheng et al. 2017;Meng et al. 2020;Jing et al. 2021), such as building audits (Kelly et al. 2013;Ning et al. 2022), greening analysis (Li 2021;Xia, Yabuki, and Fukuda 2021), and urban landscape analysis (Li, Ratti, and Seiferling 2017;Li et al. 2021).
For some reason, it is difficult to obtain accurate locations directly from the event pictures, while it is inefficient to find the image location manually. For security or privacy concerns, users usually send content without geotag information or only a vague semantic location (Shang et al. 2022). For example, removing EXIF is an effective way to avoid revealing the image location. However, picture coordinate information may be lost during upload, compression, or copying. The lack of location information creates obstacles to analyzing the content of the images.
Previous methods have been focused primarily on street view image geo-localization and have achieved impressive performance. Most of the work takes clear street view images as query images, where there are obvious visual features, and the images with noisy content have not been widely analyzed in the literature. Furthermore, most media or crowdsourced images are taken with non-professional equipment in complex environments, making it difficult to recognize semantic features. To overcome these problems, this paper proposes an automated pipeline for urbanscale news picture geo-localization, based on deep learning and street view images. A difficult query dataset with the theme of 'Hong Kong's Extradition Turmoil' is collected, in which most of the images have clash content with extreme view changes and some images are taken at night. The geo-localization pipeline includes three steps. First, the CNNs with attention mechanism are utilized to extract high-level features representing location information of local regions in news images. Second, the features extracted from CNNs are aggregated and the similarity function is used to implement the news image retrieval task. Finally, a kernel density prediction (KDP) method is proposed for estimating location, which integrates the similarity ranking and spatial distribution of results. This paper is structured as follows. Section 2 introduces the related work about the current geolocalization methods. Section 3 describes the study area, datasets, the proposed news picture geolocalization pipeline, and the evaluation metrics in this paper. Section 4 shows the experimental results. The discussion is in Section 5. Finally, Section 6 presents the conclusions.

Related work
There are currently three main methods for geo-localization: 2D ground-to-ground image-based, 2D ground-to-satellite image-based and 3D structure-based methods.
The 2D ground-to-ground image-based methods utilize large-scale geo-tagged datasets taken at ground level for geo-localization. The methods (Torii et al. 2015;Cheng et al. 2018;Ge et al. 2020) compare the similarity of visual content in both the query images and dataset images by extracting compact features (Babenko and Lempitsky 2015;Tolias, Ronan, and Hervé 2015;Kalantidis, Mellina, and Osindero 2016;Radenović, Tolias, and Chum 2019) or local features (Noh et al. 2017;Tian et al. 2020;Tolias, Tomas, and Ondřej 2020;Yang, Nguyen, et al. 2020;Zhi et al. 2021). The query images are located by the best similarity result location (Kim, Dunn, and Frahm 2017;Liu and Li 2019). To retrieve a large-scale dataset, features are aggregated into compact vectors (Sivic and Zisserman 2003;Perronnin and Dance 2007;Jégou et al. 2010) for fast similarity computation.
2D ground-to-satellite image-based methods are transformed into a cross-view image retrieval problem (Toker et al. 2021;Shi et al. 2022;Wang et al. 2022;Zhu et al. 2022) in which the highresolution satellite images are constructed as a reference dataset to query ground-level images. It is possible to solve the geo-localization of areas not supported by ground images because the highresolution satellite images cover almost the whole earth and are easily accessible, such as Google Earth. However, ground and satellite images have very different visual appearances (Wilson et al. 2021). Although recent works (Zheng, Wei, and Yang 2020;Wang et al. 2022) have used drones to capture the side views of streets and buildings, there are still issues to overcome such as the geometric distortion of the objects and inconsistent orientation between query and reference images.
3D structure-based methods (Sattler et al. 2018;Sarlin et al. 2021) calculate the 6-DOF pose of the camera in the world coordinate system, which requires a camera to capture 3D data and is often used in augmented reality (Park et al. 2022), 3D reconstruction (Tancik et al. 2022), and navigation (Bruno et al. 2022). In addition, it is also possible to align a 2D query image against a 3D point cloud or model (Lehtola et al. 2015) after establishing efficient 2D-3D matches. This method can achieve higher accuracy, but requires additional computation. 3D models are available from LiDAR sensors. 3D reconstruction from street view images (Cheng et al. 2018) is also another approach to obtain 3D geo-reference data.
News images are usually taken at the location of the event, and the visual content of the image implies clues of the geographical location, which provides a prerequisite for geo-localization. However, news images have complex content, varying from shooting angles and illumination, in which buildings are usually obscured. In addition, dynamic targets such as vehicles, pedestrians, and billboards in urban scenes can also impact matching. These problems make it difficult to apply 3D structure-based methods to construct 3D models and geo-localization news images. Moreover, the visual content obtained from satellite imagery differs significantly from media images on the internet. Thus, this paper employs a 2D ground-to-ground image-based method to estimate the location where the news pictures were taken by matching the geotagged image, which can extract local location clues from crowdsourced news pictures. Street view images of the city where the incident occurred are used as a reference dataset to match news pictures because the city street view images are densely and widely distributed in the city (Zhang and Liu 2021) and have a ground perspective view that is similar to news pictures. The street view images also provide richer clues, such as building textures and street layouts.

Study area and data
The study area of Hong Kong Island is the political, economic and commercial center of Hong Kong, China. The study area covers approximately 50 km 2 . The northern island is one of the core urban areas of Hong Kong. There are many landmarks, such as the Central Government Offices, Legislative Council Complex and Golden Bauhinia Square. The main central business districts are also located in this area, such as Central, Admiralty, Wan Chai and Causeway Bay, which have led to Hong Kong becoming the active area for assembly or tourist gatherings. This paper used 833,664 perspective images projected from 34,736 Google Street View equirectangular panoramas of the Hong Kong area as a reference dataset, as shown in Figure 1(a). Panoramas with a 2:1 aspect ratio are transformed into perspective images by gnomonic projection. Figure 1(b) displays the local distribution of street view images. These images have longitude and latitude information at intervals of 10-12 m and are evenly distributed on the road. As shown in Figure 1(c), each 3,332 × 1,666 pixel panorama is split into 24 perspective images of 640 × 480 pixels with 12 yaw directions (45°intervals [0, 45, 90, … , 315°]), 60°horizontal field of view (FOV) and 5°, 20°, 35°pitch directions. The perspective images overlap each other by approximately 80%. This paper collected 81 news query pictures about 'Hong Kong's Extradition Turmoil' from the internet (Google, Wikipedia, Flickr and news websites), as shown in Figure 2. These pictures were chosen for this study, and the following strategy was used to sample them. Pictures that are relevant to the topic and contain visual information about the streets were chosen from keywords search results from multiple sources. The pictures about people, indoors and outside of the study area are removed manually. Then, the pictures that can be marked as latitude and longitude are selected further, while the pictures that cannot be recognized as exact locations are ignored. The marked images often contain more or less recognizable visual content, such as landmarks or distinctive street layouts. In the end, a total of 81 images were tagged for the experiment. These images are varied from illumination, perspective and scenes. These images are distributed on the northern island, which is clustered mainly around government buildings, commercial centers, etc. Most query pictures do not have the EXIF or location information, while the context of pictures only mentions an event that occurred in that place. To obtain the exact location of the image, the approximate location range of image capture was confirmed first, from the selection of retrieval results by retrieval of street view images using news pictures SIFT features. Then, the images were manually located using Google Street View Maps as a reference.
Considering that it is difficult to generate a noise-free training set by street view images with only location information, landmark datasets are used to train networks, which also makes the network robust in large-scale urban environments. On the other hand, the purpose of this paper is geo-localization news or crowdsourced images, and the previous models trained by existing geo-localization datasets are difficult to apply to other regions, particularly in the case of media pictures taken in the complex environments covered in this paper. This paper used a part of the Google Landmarks Dataset (Noh et al. 2017) (GLD) to train the network, which has human-made and natural landmark images, as depicted in Figure 3. To clean the noisy data, an image-retrieval-based method was performed to eliminate abnormal images in each class. GLD was used to train ResNet101 first. Then, the fully connected layer of the network was removed to extract the output of the average pooling layer as image features. The feature clustering center of each class was calculated, and the top N images closest to the centroid were selected and this step was repeated once. Finally, the remaining images are regarded as the benchmark in each class. Cleaned training GLD datasets (a total of 1,500 classes) were constructed using a random selection method.

Pipeline architecture
The architecture of pipeline in this paper is shown in Figure 4. First, the feature extraction network is constructed to extract image features. Second, features are aggregated, and the similarity function is employed to retrieve street view image results. Finally, the KDP method is used to estimate location.

Network architecture
An end-to-end CNN is built to extract features of the street view images and the news pictures, as shown in Figure 5. The first part of the network consists of an FCN, which is used to extract dense features from images. In this paper, the FCN is constructed by removing the final pooling and fully connected layers of ResNet101 (He et al. 2016). The feature selecting module (HOW module) (Tolias, Tomas, and Ondřej 2020), after the FCN, is used to extract and select distinguishing  features based on the score of the attention layer, including attention w( · ), smoothing h( · ) and whitening o( · ). In the training and feature extraction steps, different strategies are used to represent the features for the output of the network.
For an input image I [ R W ′ ×H ′ ×3 , the output of FCN is a 3D tensor of activations F [ R W×H×D , which can also be represented as The attention function w(v) = ||v|| is used to weight the local feature v, which is l 2 normalized, and the parameters are fixed. In feature extraction, the first n features with higher scores are extracted, and other weak features are not considered. The smoothing function h( · ) tends to align the large activation values of features on multiple channels. The function is applied by M × M average pooling, and its parameters are fixed without training. The whitening function o( · ) is integrated mean subtraction, whitening, and dimensionality reduction (R D R D ′ ), which is implemented by an 1 × 1 convolution with bias. The parameters are trained by a set of local features before the training network and fixed in the training step. The convolution kernel of the average pooling layer in the smoothing layer is set to 3 × 3, and the number of channels in the whitening layer is 128, which means that the W × H D ′ -dimensional features (D ′ = 128) are output.

Training strategies
In this step, the output of the network is aggregated by sum-pooled of convolutional features (SPoC) for efficient training, which is given by: where v is the local features, the weight matrix is the attention function w(v), and is the corresponding feature after the smoothing and whitening layers. The local features are transformed by SPoC descriptors into an 1 × 1 × D ′ shape to calculate the loss. The contrastive loss, used to optimize the network parameters, is given by: where d is the distance of the feature between samples, and y is a label for whether samples match. The value is 1 if it is the same category and 0 if the opposite. Margin is the threshold value and is set to 0.7. The network trained by the street view dataset has a lack of ability to recognize the location information of images with confusing content, so GLD is used as the training dataset to optimize the network parameters and improve the generalization performance. The aim of the training process is to migrate the model to the task of news picture localization rather than to improve the accuracy of GLD. Figure 6 shows the loss change on the training dataset, and the hyperparameters are set as follows to avoid over-fitting the model. The optimizer is Adam with an initial learning rate of 1 × 10 −5 and a weight decay of 1 × 10 −4 . The exponential learning rate decay mode with a value of 0.99 is employed, and the networks are trained for 100 epochs.
Before each epoch of the network optimization, a series of tuples is randomly generated based on the image class labels to feed into the network and calculate the loss, as shown in Figure 7. Each tuple consists of a query image, a positive sample and several negative samples. The query image is randomly selected as the benchmark of the tuple from the training dataset, and the positive samples are randomly selected from the same labels. The top n images that are most similar to the query image are selected from the negative sample pool as negative samples, while the negative sample pool is composed of randomly selected negative samples. To match query images, the images in the negative sample pool are extracted SPoC and sorted according to Euclidean distance.
The number of tuples can be appropriately adjusted according to the time for extracting features and mining negative samples, where the range of 2000-5000 is reasonable (Tolias, Tomas, and Ondřej 2020;Berton et al. 2022). In this paper, a total of 2500 tuple pairs were generated, and 20,000 images were randomly selected before each iteration to extract SPoC as a negative sample pool. Each tuple is generated by selecting the first 5 images from the pool that are most similar to the query image and belong to different classes as negative samples. The batch size of each iteration is 5. Before each iteration, the images in the negative sample pool are randomly selected again to reconstruct the tuple. The network is trained by GLD, which has annotated category labels. In this paper, landmark images under the same class are used as positive samples, and images of different classes are used as negative samples.

Feature extraction
In this step, the images are scaled at different scales and fed into the network to generate multiple features, in which the scale is set to [2.0, 1.414, 1.0, 0.707, 0.5, 0.353, 0.25], and then the dense features of the image are extracted. The first 1000 local features, each with 128 dimensions, are selected according to the ranking of the attention layer values from large to small. Meanwhile, the feature scale, feature location and feature attention values corresponding to the local feature are also recorded. The feature location is calculated by the center pixel of the receptive field of the FCN corresponding to the local feature.

Feature aggregation
The feature aggregation method transforms local features into compact vectors for a fast match. VLAD (Arandjelović and Zisserman 2013) is implemented for each image to aggregate a k × d feature in the paper. The local features of an image are denoted by n d-dimensional vectors X = {x 1 , . . . , x n }. The features are quantized by K-means quantizer q(x) into k centroids X c = {x c1 , . . . , x ck }, which produces a k × d matrix by sum residuals. In other words, the local feature of the image is assigned to the centroid that is closest to itself. Then, the sum of the residuals between local features and the corresponding centroid is calculated to obtain an 1 × d dimensional vector. The features of k centroids are combined to form a k × d aggregated feature. The VLAD descriptors can be represented as: wherer ( are residuals by the original local feature and K-means centroids. To obtain the k centroids, local features of 1/10 images of the reference dataset are randomly selected for training feature clustering, and the number of centroids is set to 262,144. This paper uses Faiss (Johnson, Douze, and Jégou 2019) to implement this training process and generate the feature codebook composed of k centroids to aggregate features of news pictures and street view images of the reference dataset. Since the result of feature retrieval is a feature ranking according to similarity, an inverted index table is constructed for querying the corresponding street view images by features. This paper generated a dictionary of 'feature': 'image' key-value pairs to build the inverted index file.

Image retrieval
This paper employs the similarity function to measure feature distance. For the aggregated feature between the image X c and the image Y c , the similarity function is implemented by aggregated selective match kernel (ASMK) (Tolias, Avrithis, and Jegou 2016), which is given by: , / = 3, and t ≥ 0, s a ( · ) is the similarity kernel function between both of the images, in which the matching kernel is implemented on the features of each centroid. u is the dot product of a single vector of each centroid between two images. The similarity function s a ( · ) removes the mismatched features and enhances the impact of similar features by power operations. The nonlinear function sign( · ) is used to select a positive value, which means that the angle of the descriptors in this centroid is in [0°-90°] and has more similarity in vector space. The normalized aggregated featuresV(X c ) are computed offline. For each aggregated feature of news pictures, the similarity function is employed on the street view image features dataset to return the results ranking by scores, and then the street view image results and location are returned according to the inverted index table.

Image location prediction
For most news picture retrievals, a series of correct results can be returned by retrieval but is not the top 1 result. Some of these results are generated from the same panorama, which are duplicate locations. Other positive retrieval results are spatially close due to the similarity of adjacent street views. If Top 1 in the retrieval result is used as the location of the query image, the above situation will be ignored. Another method is that the similarity values at the same location are summed, and then the top-ranked result is selected as the location of the query image, which cannot consider spatially adjacent correct results. Moreover, the results generated by the interpolation method have a strong dependence on the original points, in which the local maxima of the interpolation plane have a high repetition rate with the retrieval results. As shown in Figure 8, the kernel density prediction method, which aims to consider both the spatial location distribution and the similarity value of the retrieval results, is implemented in the paper to estimate the position of the query image from the previous results. The method is based on the assumption that there are correct results in the retrieval method. The KDP method aggregates the spatially adjacent points, given by: For (x − x i ) 2 + (y − y i ) 2 , r where S(x, y) is the similarity in (x, y) coordinates, and r is denoted by the search radius. The Top 100 retrieval results are considered in the location estimation. The query radius r is set to 150 m.
The local maxima of the kernel density method are extracted by focal statistics as the location of the query images.

Evaluation metrics
This paper adopts the distance-based method mentioned in the paper (Zheng, Han, and Sun 2018) and further defined a ranking-based accuracy to evaluate the prediction location accuracy. Let q = {q 1 , . . . , q n } be the set of all news pictures for prediction. The method being evaluated predicts a location l(q i ) for picture q i . The predicted location l(q i ) is expected to be as close to the groundtruth location l * (q i ) as possible. The error distance (ED) is denoted as the Euclidean distance between l(q i ) and l * (q i ), which is given by: where the original coordinate system of the street view images is WGS-1984, and the same is true for the manually located news pictures. ED(q i ) is computed in the Hong Kong 1980 Grid System (EPSG:2326) by coordinate transform. For the distance-based evaluation method, the threshold d of error distance is predefined, which reflects the tolerance for geo-localization errors. If ED(q i ) is less than d, the predicted result l(q i ) will be considered correct. The distance-based accuracy (Acc@d) metric is indicated as the proportion of the correct prediction l(q i ), which is given by: The method in this paper predicts a ranking list L(q i ) rather than directly treating the Top 1 result as the only predicted location and ignoring other locations on the list. The ranking list can provide candidate locations for further applications. Thus, ranking-based accuracy within distance d (Acc@d, k) is further defined in this paper. If any ED(q i ) between the ground-truth location and the Top k results L k (q i ) is less than d, these results are regarded as correct. Acc@d, k is given by:

Results
This paper performed the geo-localization experiment on 81 news pictures with the theme of 'Hong Kong's Extradition Turmoil'. First, the feature extraction network is constructed to obtain the local image features with location information. Second, aggregated features are computed, and the similarity function is employed to retrieve street view image results. Finally, the KDP method is used on the top 100 results of the retrieval method to obtain the predicted location. In this paper, the CNNs are constructed by PyTorch, and the experimental results for evaluation were completed on an Ubuntu 20.04 system equipped with an RTX3090 GPU, an INTEL I9-10900 K CPU and 64 GB of RAM. Table 1 shows the results of the proposed pipeline implemented on the full street view dataset, in which the kernel density radius r is set to 150 m. The local maxima of the KDP are sorted and composed into a list L k , in which the locations are used to calculate ED. Approximately half of the news pictures return correct results on the Top 1 with 50 m threshold d (Acc@50, 1). A total of 22.22% of the image location results are between 50 and 300 m. A total of 27.16% of the results have more than 300 m error. The accuracy improvement of Acc@d, 1 to Acc@d, 20 increases as the d increases from 50 m to 300 m. Acc@50, 20 is 7.4% batter than, Acc@50, 1, while Acc@300, 20 is 24.69% batter than Acc@300, 1.
Because the query images are distributed mainly in northern Hong Kong Island, some street view images in this area are selected for the experiment to analyze the performance on different dataset sizes. The subexperimental area includes 6,650 panoramas, as shown in Figure 9. Table 2 shows the subexperimental results of the proposed pipeline, where the kernel density radius is set to 150 m. Compared with the results of Table 1, the pipeline shows stable performance on the large urban environment. For example, the Acc@300, 1 in Table 1 (72.84%) is only 0.77% lower than the Acc@300, 1 in Table 2 (74.07%).
In addition, the comparison experiment using GFS (Chu et al. 2020), used for the geo-localization of street view images, is performed on the subdataset considering that the model takes a longer time for feature extraction on large-scale datasets. Table 3 shows the subexperimental results of the GFS method. Overall, the pipeline in this paper has significantly improved compared to GFS for k ≤ 10. The Acc@50, 1 improved by 17.29% from 33.33% to 50.62%. The proposed pipeline is slightly higher or equal to the GFS method for 10 , k ≤ 20. Moreover, experiments using global descriptor methods, GeM (Radenović, Tolias, and Chum 2019) and NetVLAD (Arandjelović et al. 2016), are also performed on the subdataset. Tables 4 and 5 show the results of these methods. The performance of global descriptor methods is inferior because they are affected by noisy objects in the image, which prevents the representation of features with location information in the overall descriptor. Table 6 describes the comparison of the statistics of ED with Top 1 results of the methods. Compared with GFS, the median error is reduced by 82.27 m from 130.04 m to 47.77 m. The average ED and standard deviation ED of this paper's pipeline are also lower than the average ED and standard deviation ED of the GFS method. As shown in Figure 10, the improvement of ED is computed for each image of our method and GFS, and the images are reordered according to the residuals, which is given by Among these images, for 34.57% of the image ED r improvement is insufficient (ED r , 0); for 27.16% of the image ED r improvement interval is in [0, 50], and 35.80% of the image ED r has a significant improvement (ED r . 50). Figure 8 also shows examples of query images for different ED distribution intervals. The ED r . 50 examples contain nighttime or low light images, which also demonstrates that the HOW-KDP is more robust to changes in illumination. In addition, the average and standard deviation of the ED are strongly influenced by extreme values. As shown in Figure 10, a small portion of images are location failures and have extremely high ED, resulting in a high average and standard deviation of the ED.

The feature extraction network analysis
The performance of the network used in this paper was analyzed from two aspects. First, the feature map of the attention layer of the network was visualized, which shows that the network can extract local features of images with location information. Figure 11 shows the results of the attention layer visualization of example news images in the scales of [2.0, 1.0, 0.5, 0.25]. The yellow highlighted part in the image indicates a high attention score. Affected by the shooting perspective, the upper part of the image is more prominent, corresponding to visual content related to buildings, while roads and   (1) and (2) correspond to recognizable buildings, and if a building is the main visual content in the image, the network focuses more on the building outline, such as the images of (3) and (4). In feature maps of small-scale images, local features tend to represent larger region content rather than the edges of buildings or a small portion of landmarks. Next, the recall capability of the network was analyzed. In image-retrieval-based geo-localization methods, identifying and retrieving positive results is the basis for subsequent localization. Only if positive results can be retrieved can subsequent approaches such as KDP be used to improve recall. Table 7 shows the results of the retrieval method from Step 2 of Section 3. The coordinates of street view images are regarded directly as the location of the result for computing ED. The locations in

Kernel density prediction analysis
The performance of different query radii in the KDP method is analyzed first. Figure 12 shows the Acc@d, k for different kernel density radii with the threshold d taken from 50 m to 300 m, in which the overall trend of the Acc@d, k increases and then decreases with increasing the kernel density radius. The highest Acc@d, k is reached when the radius is taken from 50 m to 200 m. A radius that is too large or too small may impact Acc@d, k. One of the reasons is the characteristics of the street view images. The landmark buildings in the query images may appear in street view images at a range of 100 m or even farther, which leads to the correct retrieval results that are distributed over extended roads. Since the distance between adjacent street view images is 10-12 m, a small radius hardly aggregates adjacent retrieval results in spatial terms. Then, this paper analyzes the improvement of the KDP method for ED. Compared with the retrieval method in Table 7, the Acc@d, 1 of the KDP method in Table 1 further improved the threshold d of 50 m to 300 m. In addition, the Acc@d, k of the KDP method improves less with increasing k and is lower or equal to the retrieval method in some k cases. The KDP clusters multiple spatially adjacent images into a single point that is finite in the prediction plane, resulting in insufficient correct candidate points in the ranking list L(q i ). Figure 13 illustrates the top 1 positive result samples of query images, which shows the ED of the retrieval method and the KDP method below the result images. The purpose of the KDP method is to estimate the location, so the Top 1 images are returned by the nearest street view images. The location of the retrieval method is provided by the street view coordinates, so ideally, the minimum value of ED is the distance between the query image and the nearest street view image. However, adjacent street view images with similar scenes are also returned, which can increase the geo-localization error. Thus, the KDP method extracts the spatial focus of the retrieved results as the predicted location. In general, although the ED of the KDP method is larger than the ED of the retrieval method in some cases, the ED of most prediction locations is tolerable.

Error distance analysis and pipeline limitations
As shown in Figure 14, the number of images with ED , 50 m is the most concentrated and then gradually decreases until ED ≥ 1000 m reaches another increase. The query images with ED < 50 m account for approximately half of the total, with high geo-localization accuracy, and there are 77.78% query images with ED , 750 m. The images with ED ≥ 1000 m can be regarded as localization failure, which is caused by retrieval failure in most cases. The example results of Figure 14 also show that the query failed images have less obvious features of buildings in  the visual content, or the marching crowd interferes and obscures the scene. Another reason for failure is the different shoot angles. Some of the images were taken from the upper floors of the building or from the side of the building, which is extremely different from the street view images in the reference dataset. Figure 14 shows that 22.22% of the images failed to localize at k = 1 (ED . 1000m). These images can be localized by manual recognition of subsequent candidate images or improving the accuracy of the retrieval method in the following work. For example, the robustness of the model to view changes and localization ability can be improved by generating a multiview building training dataset from spatially adjacent street view images.
There are still some limitations in the proposed pipeline. The existing ground-to-ground localization datasets and Hong Kong Street View imagery are not used as the training set. The former datasets have impressive performance in the other regions' street view datasets, but they do not perform well in other types, i.e. news pictures. The reason for the general performance of the latter dataset may be related to the image quality and the sampling strategy used in the training process. Therefore, the GLD is used for the training network. In the future, the general methods to generate training datasets using the urban landmark categories and street view images of the study area will be studied in the expectation of learning building features in a weakly supervised method. Another approach is to improve the generalizability of the method trained on existing geo-localization datasets to enable it to be applied to other scenarios. Another limitation is that KDP performance relies on the results of image retrieval because this process generates location from the former results. The pipeline can be optimized further and the end-to-end model will also be explored in the future. In addition, considering that the time to extract image features increases proportionally with the number of images, the street view reference dataset will be thinned in future work to reduce the offline time spent on feature extraction.

Conclusions
This paper proposes a geo-localization pipeline based on deep learning, which uses street view images as reference datasets to offer geotags to news pictures. The proposed pipeline employs CNNs with HOW modules to extract local image features, which are aggregated into a compact vector by VLAD for fast retrieval. ASMK is used as a similarity function for image matching in news image large-scale retrieval task. Finally, KDP method estimates the location of news pictures. This paper implements the experiment on the news pictures about 'Hong Kong's Extradition Turmoil'. In the comparison experiments, the pipeline shows stable performance in the large-scale urban environment and improves the geo-localization accuracy of news pictures under 50 m from 33.33% to 50.62% compared with the previous method (GFS). The median error distance is reduced from 130.04 m to 47.77 m. The average and standard deviation errors are also lower than those of the GFS.
The main work of this paper is the geo-localization of news pictures with complex content in an urban environment. The portability of the model to other types of regions and images will be investigated in the future, especially in the non-landmark street views with similar buildings. Moreover, the application of the method will also be further explored. For example, the urban canyon effect often occurs around urban landmarks or building clusters, which distorts the accuracy of GPS. These areas with weak GPS signals could be improved by the method. For non-landmark street views, the localization results often include a series of similar locations on visual content that can be used to analyze the association or similarity pattern of a city.

Data availability statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.