Coupling ground-level panoramas and aerial imagery for change detection

Abstract Geographic landscapes in all over the world may be subject to rapid changes induced, for instance, by urban, forest, and agricultural evolutions. Monitoring such kind of changes is usually achieved through remote sensing. However, obtaining regular and up-to-date aerial or satellite images is found to be a high costly process, thus preventing regular updating of land cover maps. Alternatively, in this paper, we propose a low-cost solution based on the use of ground-level geo-located landscape panoramic photos providing high spatial resolution information of the scene. Such photos can be acquired from various sources: digital cameras, smartphone, or even web repositories. Furthermore, since the acquisition is performed at the ground level, the users’ immediate surroundings, as sensed by a camera device, can provide information at a very high level of precision, enabling to update the land cover type of the geographic area. In the described herein method, we propose to use inverse perspective mapping (inverse warping) to transform the geo-tagged ground-level 360 photo onto a top-down view as if it had been acquired from a nadiral aerial view. Once re-projected, the warped photo is compared to a previously acquired remotely sensed image using standard techniques such as correlation. Wide differences in orientation, resolution, and geographical extent between the top-down view and the aerial image are addressed through specific processing steps (e.g. registration). Experiments on publicly available data-sets made of both ground-level photos and aerial images show promising results for updating land cover maps with mobile technologies. Finally, the proposed approach contributes to the crowdsourcing efforts in geo-information processing and mapping, providing hints on the evolution of a landscape.


Introduction
The concept of Volunteered Geographic Information (VGI) refers to involving human volunteers in gathering photo collections that can be further used to feed geographical information systems. In fact, every human is able to act as an intelligent sensor, equipped with such simple aids as GPS and camera or even the means of taking measurements of environmental variables. As stated by Goodchild (2007), the notion that citizens might be useful and effective sources of scientifically rigorous observations has a long history, and it is only recently that the scientific community has come to dismiss amateur observation as a legitimate source.
Over the past few years, VGI has become more available, for instance through web services. A range of new applications are being enabled by the georeferenced information contained in "repositories" such as blogs, wikis, social networking portals (e.g. Facebook or MySpace), and, more relevant to the presented work, community contributed photo collections (e.g. Flickr (https://www.flickr.com/) or Panoramio (http://www. panoramio.com/). The advantages of VGI are its temporal coverage, which is often better both in terms of frequency and latency than traditional sources. How-CONTACT Sébastien Lefèvre sebastien.lefevre@irisa.fr ever, they come with a loss in data quality since user inputs are usually made available without review and without metadata (e.g. data source and properties). Georeferenced photo collections are enabling a new form of observational inquiry, which is termed "proximate sensing" by Leung and Newsam (2010). This concept depicts the act of using ground-level images of close-by objects and scenes rather than images acquired from airborne or satellite sensors. While large collections of georeferenced photos have recently been available through the emergence of photo-sharing websites, researchers have already investigated how these collections can help a number of applications. Research works in this context can be classified into two main categories according to Leung and Newsam (2010): (i) using location to infer information about image and, (ii) using images to infer information about a geographical location. In the first category, methods for clustering and annotating photos have been proposed (Moxley, Kleban, and Manjunath 2008;Quack, Leibe, and Van Gool 2008). Images are labeled based on their visual content as depicting events or objects (landmarks). Other approaches such as Hays and Efros (2008) attempted to estimate the unconstrained location of an image based solely on its visual characteristics and on a reference data-set. In the second category, some researchers tried to address the problem of describing features of the surface of the Earth. Examples of works in this area include: using large collections of georeferenced images to discover interesting properties about popular cities and landmarks such as the most photographed locations (Crandall et al. 2009); creating maps of developed and undeveloped regions (Leung and Newsam 2010), where the problem faced is then related to spatial coverage non-uniformity of images collections; computing country-scale maps of scenicness based on the visual characteristics and geographic locations of ground-level images (Xie and Newsam 2011); or more recently recognizing geo-informative attributes of a photo, e.g. the elevation gradient, population density, demographics using deep learning (Lee, Zhang, and Crandall 2015;Workman, Souvenir, and Jacobs 2015). Although Xie and Newsam (2011) demonstrated the feasibility of geographic discovery from georeferenced social media, they also reported the noisiness of obtained results.
The work presented in this paper is closely related to Leung and Newsam (2010); Xie and Newsam (2011) since we are here exploring content of georeferenced photos to infer geographic information about the locations at which they were taken. But here we are not excluding available aerial or satellite images. Instead, we propose to use them in conjunction with recently available ground-level images. The purpose of this work is therefore to update and check up existing maps (built from standard remote sensing techniques) based on change detection performed with available ground-level images. We thus investigate the application of proximate sensing to the problem of land cover classification. Rather than using only airborne/satellite imagery to determine the distribution of land cover classes for a given geographical area, we explore here whether ground-level images can be used as a complementary data source. To do so, we present some first work aiming to compare recently acquired ground-level images to a previously acquired remotely sensed image using standard techniques related to computer vision and image analysis. In this context, our work share some similarity with Murdock, Jacobs, and Pless (2013;, where it is proposed to use webcam videos together with satellite to estimate cloud maps. However, conversely to these previous studies, we do not use aerial imagery solely for training the ground-based image recognition process. Our goal is rather to perform comparison between both available data sources.
The remainder of this paper is organized as follows: Section 2 describes the study area and the data-set considered in the experiments. The technical approach is presented in Section 3. We detailedly carried out experiments and discuss obtained results in Section 4 before providing some conclusions and directions for future research.

Study area and data set
The study area focuses on several cities in France: Vannes, Rennes, and Nantes in Brittany, and Dijon in Burgundy. For the sake of conciseness, we provide visual illustrations for the Vannes city only, but reported accuracy includes the whole data-set. Experiments on Vannes city focused on the surroundings of the Tohannic Campus which hosts Université Bretagne Sud and IRISA research institute where the authors are affiliated. This choice is motivated by: (i) the availability of ground truth that can be assessed by in situ observations and (ii) the appearance of many new buildings over the last few years (with availability of data acquired both before and after these changes). It covers a 1-km 2 area. The geographical extent is provided in Figure 1.
Ground-level images were grabbed from Google Street View (https://www.google.com/maps/streetview/) or taken in-situ from people involved in this work equipped with mobile camera. Both kinds of images consist in panoramic views covering 360 • (resp. 180 • ) field of view horizontally (resp. vertically). We assume here the following scenario: given the acquired image is georeferenced, it is possible to download an associated map from existing sources (Bing Maps, Google Maps, and OpenStreetMap). We consider here maps of 150 × 150 m 2 downloaded through a Bing Maps (https://www.bing.com/maps/) request according to measured GPS position.
For the sake of clarity, we denote the images with the following terms in the sequel: • A: Aerial image, or high flying UAV image (dimensions m × n). • P: Panoramic image, or wide field-of-view image from user mobile device or Google Street View (dimensions p × q). • T: Top-down image, or bird's eye view of the ground (dimensions r × r).
Beyond Vannes city for which it was both possible to build ground truth from in situ observation and acquire panoramic photos with crowdsourcing activities, we also consider three other data-sets to evaluate the robustness of our methods. On these other cities, panoramic photos are grabbed from Google Street View and were subsequently used to build ground truth through visual interpretation.

Proposed method
Since the images were taken from up to three different kinds of sensors (Google Street View's car, user's camera and aerial vehicle), several image preprocessing  steps are required before change detection can be performed. The flowchart of the proposed method including these different preprocessing steps is given in Figure 2.

Top-down view construction
The panorama images used in this work cover (360 • , 180 • ) field of view on (horizontal, vertical) dimensions. For a given scene, the panorama image P is warped to obtain a bird's eye view T (as shown in Figure 3) following the method proposed by Xiao et al. (2012). First, the world coordinates of the panoramic image P are computed using the inverse perspective mapping technique. To do so, the 3D model of the image P is generated using the following equations (Xiao 2012): where r = c h | cot θ y |, θ y = πy/H is the angle between the optical axis and the horizon and θ x = 2πx/W represents the angle between the projection of the optical axis on the flat plane (y = 0) and z axis. Finally, c h denotes the camera height (in meters) from the ground plane.
In the next step, the obtained 3D image is projected onto the plane Y = 0 to get a new image T representing the top-down view of the original image. This remapping process produces image T(u, v) by recovering the texture of the ground plane as shown by the following equations (Muad et al. 2004): where α is the camera angular aperture, W × H are the image dimensions, θ = tan −1 (c h /(x 2 + Z 2 ) 0.5 ) and γ = tan −1 (Z/X). The color for the ground location is obtained using bi linear interpolation from the panorama pixels.

Ground-level image to aerial view registration
The next step aims to detect the area occupied by the top-down view in the aerial image. It is considered as a fine localization problem that can be formulated as matching image descriptors of the warped groundlevel image T with descriptors computed over the aerial map A. The proposed solution (see Figure 4) is inspired from the work from Augereau, Journet, and Domenger (2013). Various image descriptors are available to perform this matching. A recent study (Viswanathan, Pires, and Huber 2014) comparing the performance of SIFT, SURF, FREAK, and PHOW in matching ground images onto a satellite map has shown that SIFT obtains the overall best performance, even with increasing  complexity of the satellite map. We thus rely here on the SIFT descriptor (Lowe 2004) in the matching process.
First, SIFT keypoints are detected and relative descriptors (feature vectors) are extracted for both aerial map A and top-down view T. Then, the similarity between the ground sample T descriptor vectors q and each descriptor p from A is computed. Each match m(q i , p j ) is considered as correct or incorrect based on the Euclidean distance |·, ·| 2 .
In order to select the best match among candidate ones, we adopt the common approach relying on k-NN (nearest neighbor) classifier. Its complexity is however quadratic as a function of the number of keypoints. The multiple randomized kd-trees algorithm (Silpa-Anan and Hartley 2008) has the advantage of speeding up k-NN search. Thus, we used FLANN (Muja and Lowe 2009) library that provides an implementation of this algorithm where multiple kd-trees are searched in parallel. We note that for the randomized k-d trees, the split dimension is chosen randomly from the top 5 dimensions with the highest variance.
In order to find the geometric transformation between matched keypoints, homography matrix H is computed by (Agarwal, Jawahar, and Narayanan 2005). At this level, RANSAC algorithm (Fischler and Bolles 1981) is used in order to discard outliers. In fact, the aim of geometric transformation estimation step is to split the set of matches between good matchings (inliers) and mismatches (outliers) using RANSAC algorithm. In order to estimate the 9-parameter transformation matrix H between key points of T denoted P 1 and their correspondences in A denoted P 2 , the most representative transformation among all matches is sought. The matrix H has the following shape: where locations of P 1 and P 2 are represented by homogeneous coordinates. Finally, if at least t inliers are validated, T is considered to be situated in the aerial image A. We chose here t = 4, which is the minimum number of points necessary for homography computing. An illustration of this process is given in Figure 5. We can observe that the technique is robust to a certain level of changes between the content visible in T and A.

Ground-level image and aerial view comparison
Several change indices have already been proposed for estimating the change of appearance at two identical locations, from simple image difference or ratio to more elaborated statistics such as the Generalized-Likelihood Ratio Test (GLRT) (Shirvany et al. 2010) or the local Kullback-Leibler divergence (Xu and Karam 2013).
For the sake of illustration, we have chosen here to rely on the well-known correlation index between the top-down view and the portion of the aerial map corresponding to it (see Section 3.2). The correlation coefficient r between two images a and b of size N is computed as follows: where i is the pixel index, I ai and I bi are the intensities of the two images for pixel position i. In the correlation image, a low correlation value means a change. However, Liu and Yamazaki (2011) pointed that even if there was no change, some areas might be characterized by a very low correlation value. In this respect, they propose a new factor z used to represent changes, which combines the correlation coefficient r with the image difference d. The latter is defined by: where I ai and I bi are the corresponding averaged values over a M = k × k pixels window surrounding the ith pixel. We follow here a standard setting, where the window size is set as 9 × 9 pixels.
The factor z is then expressed by: where max i (|d|) is the maximum absolute value of difference d among all pixel coordinates i, and c is the weight between the difference and the correlation coefficient. Following Liu and Yamazaki (2011), we weight the difference as 4 times the correlation, in order to omit subtle changes, which means that c is set to 0.25. A high value of z means high possibility of change. We adopt here the threshold value used by Liu and Yamazaki (2011) and consider the areas with z > 0.2 as changed areas.

Experiments and results
We recall that our method was evaluated with preliminary experiments on several cities in France (see Section 2), namely Vannes, Nantes, and Rennes in Brittany, and Dijon in Burgundy.
Aerial images have been extracted from Bing Maps. Ground-based imagery have been either downloaded from Google Street View or captured in situ by some volunteers involved in these experiments (for the Vannes site only). Aerial images are dated from 2011 to 2012 while ground-level data were taken either in 2013 (Google Street View) or in 2015 (Google Street View or in situ observations). 100 significant locations were selected for the study site and therefore related ground-level P and aerial A images were included in the experiments. Figure 6 shows the 100 panoramic images used in our study, while Figure 7 shows the 100 corresponding aerial images. As we can see, the data-set shows significant differences in landscapes and visual content.
Let us recall that our goal is to explore how groundbased (possibly crowdsourced) imagery could help to perform change detection in terms of land cover/land use. Visual interpretation has thus been conducted on the whole set of images (i.e. both a ground-based panorama and an aerial image for each of the 100 locations) to label each location as changed or unchanged. Obtained z values for these 100 images yield a variation between 0.00528137 and 0.417598, with a change threshold set equals to 0.20 as in Liu and Yamazaki (2011).
Experimental results were analyzed through standard statistical measures and the confusion matrix is provided in Table 1 from which are derived producer accuracy (recall) and user accuracy (precision), as well as overall accuracy (ratio of correctly classified elements among all elements). We can observe that the proposed method achieves an overall accuracy of 54%, but with significant difference between recall and precision rate for the two different classes. More interestingly, the method shows a rather high recall rate (71%) for the changed class, at the cost of a lower precision (34%) though. In other words, a location for which a change in land cover/land use occurs is barely to be missed by our method. This is highly preferred to other situations, e.g. missing changes that can be recovered only by manual analysis of the whole data-set. Conversely, manual refinement of the results consists in filtering out false positives only.
In order to assess more precisely the behavior of the method, we have performed a finer classification, where we distinguished between structured areas containing built-up objects and unstructured ones. Thus, each ground-level image is manually assigned to one of these two classes based on its visual content. This reference classification is then compared against an automatic procedure inspired from Leung and Newsam (2010). To do so, we compute lines descriptors on each image to quantify the distribution of edges at different orientations. Indeed, images of structured areas have a higher proportion of horizontal and vertical lines than images of unstructured scenes. Hough transform   (Quattoni, Collins, and Darrell 2004) classifier to label individual images based on their line descriptors. We report in Table 2 the confusion matrix for this second classification experiments. Let us note that unchanged unstructured areas have been extracted from Vannes and Dijon cities, while unchanged structured areas and changed structured areas are both coming from Vannes, Rennes, and Nantes cities.
Again, we are focusing on changed (structured) areas. We can observe that including a structured/ unstructured preclassification step allows to achieve better accuracy. Indeed, considering only images containing built-up structures, the recall for changed areas is reaching 80%, for a precision of 42%. More generally, we can see that the misclassification between changed/unchanged areas is more important with structured areas than unstructured ones.
Beyond accuracy evaluation, we have also measured the computational efficiency of the proposed approach. The goal is to assess its usability in a crowdsourcing context. To do so, we have averaged computation time over 100 runs, considering a standard PC workstation (CPU: i7-4600@2.10 GHz, RAM: 8 GB). Results are reported in Table 3. We can observe that CPU times are very low, the overall process being performed in 3-4 s. This makes the proposed approach a realistic crowdsourcing solution for change detection.

Discussion
An in-depth analysis of situations where the proposed method was failing to identify land use/land cover changes was thus performed. We thus observe the strong   Moreover, since buildings are seen from their roofs in the aerial view and from their sides or facades in the ground-level images, a lot of unchanged structured areas also lead to false positives being classified as changed by the proposed method (Figure 8, third line). In the future work, this kind of errors would be removed by considering methods for aerial to ground building matching (Bansal et al. 2011).

Conclusions
In the herein presented work, land use/land cover changes are detected from comparing new acquired ground-level images to less recent aerial images. To do so, we propose to transform the geo-tagged panoramic photo onto a top-down view as if it had been acquired from a nadiral aerial view. Once reprojected, the warped photo is compared to a previously acquired remotely sensed image using a technique combining correlation coefficient and image difference. We have conducted an experiment including 100 images from four different cities in France. The obtained results show a high recall rate for the changed areas, with nevertheless a lower precision rate. Let us underline that recall is here more important than precision, since it is always possible to proceed with further manual inspection of potential changes. This emphasizes the feasibility of change detection by comparing ground level to aerial views. Besides, a more careful analysis distinguishing between structured and unstructured areas has been performed to understand the current bottlenecks of the proposed method.
In the aim of enhancing current results, we will now consider more advanced images comparison methods and will complete our preprocessing pipeline by other steps such as photometric correction. Comparing ground-based and aerial imagery is still a challenging issue, as noticed by a recent study from Loschky et al. (2015). Other future works include enlarging geographic extent of the study area and increasing the volume of test data and metrics. The final goal would be to perform land cover updating with our method, to illustrate the strength of crowdsourcing as an ancillary but important information source for geoinformation management.

Notes on contributors
Nehla Ghouaiel is a research fellow at the Institute of Research in Computer Science and Random Systems (IRISA), Université de Bretagne-Sud. Her current research interests include image analysis, machine learning, and computer vision.
Sébastien Lefèvre is a professor at the Institute of Research in Computer Science and Random Systems (IRISA), Université de Bretagne-Sud. His current research interests include image analysis and machine learning for remote-sensing data.