Height estimation from single aerial imagery using contrastive learning based multi-scale refinement network

ABSTRACT Height map estimation from a single aerial image plays a crucial role in localization, mapping, and 3D object detection. Deep convolutional neural networks have been used to predict height information from single-view remote sensing images, but these methods rely on large volumes of training data and often overlook geometric features present in orthographic images. To address these issues, this study proposes a gradient-based self-supervised learning network with momentum contrastive loss to extract geometric information from non-labeled images in the pretraining stage. Additionally, novel local implicit constraint layers are used at multiple decoding stages in the proposed supervised network to refine high-resolution features in height estimation. The structural-aware loss is also applied to improve the robustness of the network to positional shift and minor structural changes along the boundary area. Experimental evaluation on the ISPRS benchmark datasets shows that the proposed method outperforms other baseline networks, with minimum MAE and RMSE of 0.116 and 0.289 for the Vaihingen dataset and 0.077 and 0.481 for the Potsdam dataset, respectively. The proposed method also shows around threefold data efficiency improvements on the Potsdam dataset and domain generalization on the Enschede datasets. These results demonstrate the effectiveness of the proposed method in height map estimation from single-view remote sensing images.


Introduction
Elevation information is of paramount importance in the analysis of fine-grained 3D structures of ground objects.To this end, obtaining height information from digital surface models (DSMs) has proven to be useful for various remote sensing applications, including mapping, digital terrain analysis, and 3D object detection, as demonstrated in previous studies (Bittner et al. 2018a;Ding et al. 2020;Na et al. 2018).Moreover, DSMs have been shown to be beneficial for more challenging tasks, such as semantic labeling and change detection, as evidenced by recent research (Carvalho et al. 2019;Dong, Zhao, and Wang 2021;Steinnocher et al. 2019).The DSM is commonly acquired by utilizing a Light Detection and Ranging Laser Scanner (LiDAR) or an Interferometric Synthetic-Aperture Radar (InSAR), as well as through the use of a Structure-from-Motion (SfM) methodology, or by employing stereo image pairs (Bittner et al. 2018b;Krauß, d'Angelo, and Wendt 2019).However, DSM estimation using these approaches may require difficult and sophisticated acquisition techniques or suffer from poor performance in the presence of complex reflective and refractive bodies, resulting in high costs associated with DSM acquisition.
With the recent and rapid advancement of sensor technologies, monocular vision-based reconstruction methods are gradually becoming more widely used (Fu et al. 2018;Garg et al. 2016;Naderi et al. 2022;Stucker and Schindler 2022).Emerging studies are exploring methods for extracting 3D information from single images, such as shape from shading and shape from texture, which can be followed by the use of stereo or temporal sequences from 2D images.However, in the absence of environmental assumptions such as lighting models, establishing a clear mathematical relationship between color, gray scale, and 3D coordinates in a 2D space becomes unfeasible.Consequently, estimating 3D information from single-view image still poses a significant challenge.Recently, with the wide application of convolutional neural networks (CNNs) in computer vision, a growing number of deep learning methods have been developed towards the task of monocular depth estimation (Eigen and Fergus 2015;Hu et al. 2019).These approaches utilize the nonlinear expressive capability of CNNs to infer the implicit relationship between color pixels and depth.Similarly, height estimation using single-view remote sensing image is gradually gaining the attention of researchers thanks to the expanding availability of open datasets and advancements in deep learning networks (Mou and Zhu 2018;Srivastava, Volpi, and Tuia 2017;Zhao, Persello, and Stein 2021;2022;Zhu et al. 2017).Scholars have investigated various solutions for this task, which can be classified into three subcategories: those based on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Multitask Learning (Amirkolaee and Arefi 2019;Carvalho et al. 2019;Paoletti et al. 2020;Xing, Dong, and Hu 2021), respectively.Estimating height from single-view remote sensing images is challenging due to the ambiguity of integrating intensity or color measurements into height values, and the limited availability of contextual information in low spatial resolution orthographic images covering large areas.
Height estimation from a single view image is highly influenced by the availability of large-scale DSM as label data.To improve the performance of CNNs for this task, it is becoming increasingly important to leverage unlabeled remote sensing images.Self-supervised learning has recently emerged as a promising approach for different tasks in computer vision as it enables the learning of meaningful representations from large amounts of unlabeled data.Earlier self-supervised learning approaches such as rotation, exemplar pretext tasks, or contrastive learning (Gidaris, Singh, and Komodakis 2018;He et al. 2020) focused on extracting semantic properties from images in order to enhance the performance of downstream tasks such as classification, object detection, or semantic segmentation.Recent studies have shown that spatial structure plays a fundamental role in understanding scene depth (Chen et al. 2021;Naderi et al. 2022).By leveraging the inherent structure and redundancy in the data, self-supervised learning can effectively learn to estimate depth from images without requiring explicit depth annotations (Godard et al. 2019;Zhou et al. 2017).Geometric features are often inherently present in monocular aerial images, and we hypothesize that these features can serve as important image descriptors that highlight the critical role of height prediction in enhancing the performance of aerial image analysis models (Li et al. 2021;Xing, Dong, and Hu 2021).Therefore, we propose to leverage a contrastive learning pre-training strategy to extract the geometric representation of aerial images, which can enable more precise estimation of height information with high data efficiency.
Effectively learning fine-grained, shape-preserving features is another significant challenge when applying CNN-based methods for predicting depth information from 2D images.Typically, depth/ height estimation methods consist of two main components: an encoder for dense feature extraction, and a decoder for prediction.Following the dense feature extraction, a range of techniques are used to aggregate feature maps from higher resolutions, including multi-scale networks, skip connections, dilated convolutions, and atrous spatial pyramid pooling (ASPP) (Chen et al. 2017;Eigen and Fergus 2015;Godard, Mac Aodha, and Brostow 2017).In contrast, current research on depth estimation does not rely solely on reducing the size of the receptive field by removing the final few pooling layers.Instead, it often involves reconstructing the network with atrous convolutions to leverage previously learned weights.As a result, these approaches use denser features and execute the majority of the decoding process at that resolution, followed by a simple upsampling to regain the original spatial resolution.Inspired by recent advances in 3D reconstruction using implicit neural representations (Chen, Liu, and Wang 2021), we propose to develop novel local implicit constraint (LIC) layers positioned at different stages throughout the decoding phase to indicate explicit relationships for resolving to full resolution.This enhancement module would enable the input layers to learn 4D plane coefficients and jointly use them to achieve full height estimations.
In summary, we propose a two-stage CNN framework for height estimation from a single view aerial image.The first stage is the momentum contrastive learning method for encoder pretraining.A self-supervised representation learning scheme is introduced to learn the geometric features of an image with non-labeled independent aerial images for performance improvement.In particular, we generate gradients as the unique geometric representation fields of the image to compose the positive and negative pairs for momentum contrast learning.The second stage is a supervised network for height estimation, which involves a coarse-to-fine refinement decoder for the prediction.We utilize effective LIC layers at decoding stages to preserve the high-resolution features.In specific, we added LIC layers to the decoder at multiple stages with resolutions of 1/8, 1/4, and 1/2.These layers guide feature maps to the desired depth based on local planar assumptions and are combined to predict depth in full resolution.Our approach differs from other decoding methods as the proposed layers learn 4D plane coefficients and reconstruct depth estimations in full resolution.Nonlinear combination during training leads to distinctive training of individual spatial cells in each resolution.The effectiveness of our method is validated by implementing experiments on two public open access datasets, the ISPRS Vaihingen and the Potsdam dataset (Rottensteiner et al. 2012).We also demonstrated the robustness and data efficiency of the model by testing its generalization ability on the Enschede dataset.
The remainder of this paper is presented as follows.We first briefly overview the related studies in Section 2. The method is described in Section 3. Section 4 introduces the experimental materials.Section 5 discusses the outcomes.Section 6 concludes the remarks.

Monocular depth estimation and height estimation
Monocular depth estimation is a challenging task in the field of computer vision, as it involves estimating the depth information of a scene from a single 2D image.Various approaches have been proposed to address this challenge, including supervised learning, unsupervised learning, and self-supervised learning (Eigen and Fergus 2015;Fu et al. 2018).Supervised learning methods typically require large amounts of labeled data for training, which can be difficult and expensive to obtain.Unsupervised learning methods, on the other hand, aim to learn depth representations without the need for labeled data, but their performance may be limited by the lack of supervision.Self-supervised learning methods use auxiliary tasks to train the model, such as predicting image rotations or colorization, and have shown promising results in monocular depth estimation.Additionally, researchers have explored different techniques for improving the accuracy and robustness of monocular depth estimation, such as multi-scale approaches, attention mechanisms, and geometric constraints (Chen et al. 2021;Karatsiolis, Kamilaris, and Cole 2021).
Height information from remote sensing images has traditionally been extracted using techniques such as stereo-pair photogrammetry, SAR interferometry, and Lidar processing.However, these methods can be time-consuming, labor-intensive, and expensive.In recent years, deep learning networks such as CNNs and generative adversarial networks (GANs) have shown promise in estimating height information from a single-view image.However, the problem of single image depth estimation is inherently ill-posed in the photogrammetric society, as a single input image can project numerous possible depth maps.
The primary approaches taken by researchers in aerial image height estimation using deep learning are as follows: (a) training with additional data, (b) parallelizing auxiliary tasks with the depth estimation, (c) using deeper models with skip connections between layers, and (d) using generative models (such as GANs) with conditional settings.Alidoost, Arefi, and Tombari (2019) used additional structural information about the buildings in the image, such as lines from structure outlines, to perform a knowledge-based 3D building reconstruction.Mou and Zhu (2018) presented the IM2HEIGT encoder-decoder convolutional architecture, which employs a single, yet functional, skip connection between the first residual block and the second to last block.Ghamisi and Yokoya (2018) firstly propose to use a conditional GAN for the purpose of simulating elevation data from a single color image.Similarly, Paoletti et al. (2020) proposes to do image-to-image translation using variational autoencoders (VAEs) and GANs, therefore producing DSMs from optical images.However, given the high cost and limited availability of nDSM data, there is a growing need for developing more data-efficient methods for aerial image height estimation using deep learning.Increasing the efficiency of these methods can significantly reduce the reliance on ground truth data, making it possible to train models with limited or no access to DSM data.This can lead to more accurate and reliable height estimation from single view aerial images, without the need for expensive and time-consuming data acquisition processes.

Self-supervised and contrastive learning
Due to the fact that the majority of images are unlabeled, much research has been conducted to optimize neural network training using a small number of annotated datasets.Self-supervised learning creates a pretext task using just unlabeled input in order for the network to acquire valuable visual representations from the image prior to performing task-specific supervised learning operations like classification or object detection.Doersch, Gupta, and Efros (2015) divides an image into many nonoverlapping patches and trains neural networks to estimate their relative placements.In Gidaris, Singh, and Komodakis (2018), a rotation pretext is added that rotates the input image, and the network predicts the amount of rotation applied.Gao, Sun, and Liu (2022) proposed a self-supervised approach for scene classification, which leverages the spatial relationship between object proposals and their context.Similarly, Li et al. (2022) proposed a contrastive learning approach for semantic segmentation of aerial images, which learns to segment objects based on their visual similarity and spatial context.
As a subfield of self-supervised learning, the contrastive learning aims to learn representations of data by contrasting similar and dissimilar instances of the data (Hadsell, Chopra, and LeCun 2006).Positive and negative pairs are used in contrastive learning to assist learners acquire new information by contrasting the target patch with a different position within the same image or with patches from distinct images.In contrastive learning, the network is trained to maximize the similarity between positive pairs (instances of the same class) and minimize the similarity between negative pairs (instances of different classes).This approach allows the network to learn rich and discriminative representations of the data without the need for labeled data.In order to maximize the effectiveness of contrastive learning, many strategies have proposed increasing the proportion of negative sample pairs used throughout the learning process.SimCLR (Chen et al. 2020) is a state-ofthe-art contrastive learning method that learns representations by maximizing the similarity between augmented views of the same image.MoCo (He et al. 2020) is another popular contrastive learning approach that uses a memory bank to store and retrieve negative samples for contrastive training.Henaff (2020) separates the image into many nonoverlapping patches, with the aim of predicting the pixel values of the next unseen patch.For remote sensing field, Jain, Wilson, and Gulshan (2022) uses a patch-based approach to learn local features and a global aggregation module to capture the overall context of the scene or object.Wang, Zhong, and Zhang (2023) proposes a selfsupervised contrastive learning method for change detection in remote sensing images.
Despite recent advances in contrastive learning, there are still challenges and research gaps that need to be addressed, particularly in the application of this technique to complex and high-dimensional data such as remote sensing imagery.Moreover, developing effective strategies for constructing positive and negative samples is critical for improving the performance and applicability of contrastive learning for downstream tasks.Motivated by the concept of contrastive learning, we have developed a gradient-based self-supervised learning network with momentum contrastive loss to enable deep networks to extract geometric information to enhance the accuracy of height estimation from high resolution remote sensing images.

Methodology
The proposed height estimation framework is illustrated in Figure 1.In the initial stage of unsupervised pretraining, two encoders with identical structures, known as the query encoder and the key encoder, are utilized.Back-propagation with contrastive loss is used to update the query encoder for geometric representation learning, and the momentum for the key encoder.The second stage is supervised learning progress.We apply the query encoder trained in the first stage as the feature extractor for the height prediction.The corresponding decoder is refined at multiple scales using the LIC layers.

Momentum contrastive learning with gradient field
In order to improve the training effect of encoders under unsupervised conditions, we present contrast learning, which tries to generate an embedding space in which similar positive pairs' distances are reduced and the distances between distinct negative pairs are raised.The core of contrastive learning is the construction of positive and negative samples.Usually, the positive samples are obtained by various types of transformations such as rotation, cropping, Gaussian noise, etc., and negative samples are other data of different categories within a batch.As shown in Figure 1, Our method aims to train an encoder that can learn a geometric visual representation of an image I along with its gradient field G.The encoder maps the image and its gradient field to the feature space Z and the head then projects its feature to the low-dimensional head space H to prevent overfitting.Through the use of contrastive loss, the representation of a query image I q in the head space H is encouraged to become closer to the representation of the positive pair G + and farther from the representation of the negative pair G − from another image I − .For contrastive learning, two encoders with the same structure are used.Back-propagation with contrastive loss is used to update the query encoder, whereas momentum is used to update the key encoder.Following geometric representation contrastive pre-training, the query encoder is used as the height estimation network's feature extractor.
the encoder begins by mapping the positive and negative pairings in the feature space Z using the latent vector z extracted from the input image.Then, a head module projects z onto a lower-dimensional feature space H.After that, the contrastive loss is determined by comparing the projected latent vectors h of positive and negative pairings in the low-dimensional feature space.
We follow the design of MoCo (He et al. 2020), which uses a dynamic dictionary as a queue of latent vectors h of input data K to reduce batch size dependence of contrastive learning.During training, the encoded key value h k is piled in the dictionary to produce large negative pairs.Whenever the number of latent vectors in the dictionary surpasses the maximum size of the dynamic dictionary K, the oldest values are removed.MoCo divides the query and key data into two encoders to avoid quick changes in the key value h k from the same image I.The key encoder is updated using momentum rather than contrastive loss to keep the key latent vector h k constant.
As opposed to the approach of MoCo that used augmented images as both key and query and via color distortion, we define the aerial image I as query data, and I's gradient field G as the key data.This selection aims to train the encoder to learn structural invariance in images.Structure information is proven to be effective in-depth prediction (Chen et al. 2021).We hope to improve the effectiveness of the network in height estimation during the testing phase by learning the relationship between the remote sensing image blocks and their gradient fields.The gradient field G of image I was generated by modified Canny detector (Canny 1986) and Sobel operator, which extract the magnitude value of the dominant gradient and its location.In this manner, the pixel of the gradient field G will have different intensity values in terms of edge dominance.The generation of gradient field can be defined as follow: where Since the magnitude range of the gradient field might be different from the input remote sensing image, normalization pre-processing from 0 to 1 was done to ensure a faster and stable learning.The contrastive loss function L q by InfoNCE (Oord, Li, and Vinyals 2018) was applied to measure the similarity between the query and key latent vectors (h q and h k ), which is formulated as follow: where the subscript '+' represents a positive pair, and t is a hyper parameter for controlling the distribution density.The query encoder is then applied as the input variable in the proposed network for height estimation, which follows an encoding-decoding scheme.To summarize, with the modified Canny approach and momentum-based contrastive learning, we pre-train the height estimation network's encoder to learn the image's geometric representation.

Multiscale local implicit constraint refinement
Existing methods typically employ skip connections and simple nearest-neighbour upsampling layers to restore features to their original resolution during the encoding stage.Different from these methods, we introduce novel LIC layers based on a local planar assumption to restore features to their original full resolution.The basic goal of this design is to effectively describe direct and clear relationships between internal characteristics and the final output.As shown in Figure 2, we placed the LIC layers at each decoding phase to locate the geometric guidance that will lead to the desired height estimation.Additionally, we add a 1 × 1 reduction layer to provide the most precise estimate of c1×1 [ R h×w×1 after the final upconv layer.Finally, the outputs of the proposed layers are concatenated and fed into the final convolutional layer to obtain the estimated height d.
To be more precise, The LIC layers estimate 4D plane coefficients for each spatial cell that match a locally determined k × k patch on the full resolution H for a feature map with spatial resolution H = k.The final convolutional layers aggregate the predictions in order to produce the output.With the help of the final convolutional layers, each output is reduced1 × 1 combined with the other LIC layer outputs for global interpretation.Consequently, they may have discrete ranges that can be used as a base or accurately adjusted with respect to the base value at a specific geographical location.To direct features under the local planar assumption, we use ray-plane intersection to transform each predicted 4D plane coefficient to k × k local height cues: where n = (n 1 , n 2 , n 3 , n 4 ) represents the plane coefficients predicted by the model, and (u i , v i ) represents the k × k patchwise normalized coordinates of pixel i.We employ base network skip connections to link internal outputs with suitable spatial resolutions.
Figure 3.The local implicit constraint (LIC) layer.We use a series of 1 × 1 convolutions to obtain 4-dimensional coefficient estimates (i.e.H/k x H/k x 4).These channels are then split to undergo two separate activation mechanisms, which helps to ensure that the plane coefficients satisfy certain constraints.Finally, the coefficients are input into the planar guidance module, where locally-defined relative depth estimates are computed.
Figure 3 shows the details of the proposed layer.The 1 × 1 convolutions are at first implemented to repeatedly reduce the channel amount by a factor of 2 until the value reaches 3. A feature map with the size of H/k × H/k × 3 is firstly obtained which follows a square-input assumption.Then, the feature map is passed in two different ways to estimate the local plane coefficient.The first approach is the conversion to a unit normal vector (n 1 , n 2 , n 3 ), while the other approach calculates the perpendicular distance n 4 between the plane and the origin.n 4 estimation can be done by a sigmoid function.After that, the output is multiplied with the maximum distance k to obtain the actual depth.A unit normal vector can only deviate from its predefined axis by two degrees (i.e.polar u angles and azimuthal f angles).Therefore, two angles are regarded as the first two channels of the given feature map.They are then unit-normalized to vectors by following: (4) Finally, the vectors are concatenated again, and the value is used to estimate ck×k by Equation 4.
In the local regions of k × k patches, the local depth cue is also intended to be an additive depth definition.The final depth is predicted by combining features in the same spatial area at various stages.The last convolutional layer will work as follows: where f ( * ) is an activation function.We use exponential linear units (Clevert, Unterthiner, and Hochreiter 2015) as the activation function and employ nearest neighbour upsampling followed by a 3 × 3 convolutional layer for upconvolution.W j , j [ {1, 2, 3, 4} denotes the convolution by a corresponding linear transform operation.Figure 4 illustrates how the LIC layers behave.The boundaries of the building from LIC 8 × 8 (yellow rectangle) and LIC 4 × 4 (green rectangle) are adjusted for in the final clear estimates by the outputs from LIC 2 × 2 (blue rectangle) and reduc1 × 1 (black rectangle).By adding LIC layers, the network can learn the details for the regions with sharp curvatures at the finer scales and the major structures at the coarser scales during the training.

Structure-aware loss function
To further improve the network's learning of structured information in images, we referred the losses designed in Hu et al. (2019), which considers depth loss, gradient loss and normal loss.The majority of previous research has used the l1 loss, which is the difference between the depth estimate d i and its ground truth g i .In order to balance the contribution of different distance pixels to the prediction, the depth-balanced Euclidean loss is used here.
F(e i ), ( 6) where e i = d i − g i 1, and F(x) = ln (x + a). a is a hyper parameter we define.Additionally, to address the issue arising from step-edge structures in depth maps, we incorporate a loss function that penalizes errors in depth gradients around edges.we first define a discrete scale invariant gradient g as (7) And based on this, the gradient loss which used to penalize relative depth errors between neighbouring pixels are obtained as follow: Five different spacings h [ {1, 2, 4, 8, 16} are used to cover gradients at different scales.This loss stimulates the network to compare depth values within a local neighbourhood for each pixel.This also releases the problem of distortion and blur of edges.
Furthermore, to handle small-scale depth structures and improve the detail of depth maps, we explore an additional training loss that assesses the accuracy of the estimated depth map's surface normals relative to the ground truth.This loss is computed by taking the difference between the estimated depth map's surface normal (denoted by ) and the ground truth surface normal (denoted by ), and is defined as follows: where n g i , n g i signifies the vectors' inner product.This loss is sensitive to minor depth structures since it calculates the difference in angle between the normals of two surfaces.The three losses are orthogonal to each other and, ultimately, the total loss is defined as the sum of these individual losses: where l, m are weighting coefficients.

Dataset
Two public open-accessed ISPRS benchmark datasets, Vaihingen and Potsdam, were used for the performance assessment of our proposed networks (Figure 5).Both datasets are very common-used in sematic segmentation and terrain modeling tasks, and are available at https://www.isprs.org/education/benchmarks/UrbanSemLab/semantic-labeling.aspx. (

Experimental setup
We implemented our height estimation networks in PyTorch.Both the contrastive learning and height estimation tasks to eliminate the factors affecting the network performance were re-experimented.For momentum-based contrastive learning-based encoder training, DenseNet-161 with four stages is selected as the baseline (Huang et al. 2017) in the encoder.To ensure the consistency of the key latent vector, we update the key encoder with the momentum of 0.9.We set 64 for batch size, and 16,384 for the size of the dynamic dictionary.The temperature parameter t is set as 0.07.The learning rate is 0.0015 with the weight decay of 0.0001.We also experimented with different sigma values for Canny detector from 1 to 5 and found that a sigma value of 3 provided a good balance between noise reduction and detail preservation for our task.
After the output of the dense feature extractor, contextual information was added by atrous spatial pyramid pooling to capture the large-scale variations among the observation.Sparse convolutions with various dilation rates (r [ {3, 6, 12, 18, 24}) are applied.In all trials, the weights of the loss function elements are set to l = 1 and m = 1, and the mapping function's a parameter is set to 0.5.In fine-tuning the supervised height estimation network, input size of 512 × 512 was set and Adam optimizer with a learning rate of 0.0001 was used.We set batch size as 4 and 1 for training and testing, respectively.

Evaluation metrics
To ensure a fair comparison, we use three evaluation indicators to quantify the height estimation performance: mean absolute error (MAE), root mean square error (RMSE), which indicates the degree of absolute error at each pixel in meters, and zero-mean normalized cross correlation (ZNCC), which quantifies the spatial correlation between the predicted height and the ground truth height (Xing, Dong, and Hu 2021).
where H r denotes the reference height, H e denotes the estimated height, and H denotes the estimated pixel count.m and s are the mean values and standard deviations of H r and H e , respectively.

Experimental results
The experimental results on the Potsdam (Figure 6) and Vaihingen and (Figure 7) datasets are visually evaluated to show the performance of the proposed method.A qualitative analysis indicates the visualization results are closer to the ground truth.To achieve a qualitative performance assessment, a comparison with three published height estimation methods, namely, the U-IMG2DSM (Paoletti et al. 2020), the DCNet (Amirkolaee and Arefi 2019), and PLNet (Xing, Dong, and Hu 2021) was also conducted in the experiment.Compared with the other three comparators in terms of predicted height values and detailed information, our proposed method apparently achieve a better height estimation, especially for the boundaries and rooflines of the buildings.The results obtained by competing approaches are more distorted and blurred.

Quantitative comparison with other methods
Among three published height estimation methods, except for GAN-based U-IMG2DSM, our method and the other two comparators both employ an encoder-decoder structure and incorporate features at different levels.However, in terms of layer connection, the DCNet method only uses dilated convolution to capture context information in the skip connection.Our method employs the LIC layers to gradually refine the low and high-level features.To ensure fairness in evaluation, we keep the original experimental settings as reported in the corresponding literature for the comparators, only except for the training dataset and testing dataset.Table 1 conducts the metric calculation to assess the performance of different methods.
As shown in Table 1, our method can obtain better height estimation results than the methods compared to all metrics with a large margin.More specifically, PLNet has the best performance among the studies considered.Our proposed method performs better by 12.7%, 18.5% and 1.1% on the Vaihingen dataset and 6.4%, 20.4% and 2.1% on the Potsdam dataset in terms of MAE, RMSE, and ZNCC, respectively.The Potsdam dataset has a higher spatial resolution than the Vaihingen dataset, so that each patch covers a small area of a region and contains a less salient height-related structure.This poses additional challenges for network learning.As shown in Table 1, our method can achieve significantly better height estimation accuracy, which indicates consistency on the different datasets.Our method also shows robustness, as the geometric features are utilized within the raw images for height estimation.

Ablation studies
We also conducted thorough ablation investigations on a variety of components used in the framework.As a baseline model, we choose DenseNet-161 with deconvolution operations without adding pretraining stage.As mentioned in section 3, there are three main enhancements of our proposed network compared to the baseline, namely, pretraining with Gradient field, LIC refinement, and structure-aware Loss Function.Therefore, we augment the network with core modules to evaluate how the additional component affects performance.All experiments were conducted for a total of 50 epochs on the Potsdam dataset and evaluation metric were calculated as shown in Table 2.

Impact of pretraining with gradient field
We tested the effect of adding a pre-training phase of contrast learning.The training and evaluation are conducted using MoCo and gradient field method (Figure 8), respectively.Table 2 indicates that after adding the original MoCo pre-training, the results already got improved.In contrast, with the geometric representation pre-training, we achieved a significant reduction in MAE and RMSE by 0.399 and 0.949, respectively, compared to the baseline model.The results indicate that our method effectively extracts geometric information with non-labeled images in pretraining stage in aerial images and significantly improves height estimation tasks.
For applications like height estimation, pre-training the network to extract image semantic information for classification or object identification does not always result in a better initialization.Thus, the network's dominant representation of captured images is determined by how we pre-train the network's encoder, regardless of task-specific fine-tuning.The LIC layers establish a relationship between internal feature maps and the target prediction, allowing for more efficient network training.

Impact of structural-aware loss function
The numerical results given in Table 2 show that the height estimation results are improved by adding the structural-aware loss function.According to image statistics, the real-world scene can be divided into two types of surfaces: smooth surfaces and sharp discontinuities between them, the latter of which corresponds to the borders of objects in the scene.In order to accurately rebuild those discontinuities that appear in depth as step edges, it is necessary to have precise reconstruction capabilities.As a result, it is envisaged that the structural-aware loss is sensitive to positional shift and minor structural changes.

Discussion
This paper explores the application of self-supervised contrastive learning for remote sensing image-based height estimation.Deep learning models always require a large number of training samples, while labeled data are usually difficult to obtain, which has prompted substantial studies on the use of unlabeled data.In this research, we employ a self-supervised pre-training model, which constructs a positive and negative sample set from the dataset to derive the prior knowledge distribution of the image for a given task (DSM estimation).Pre-training enables the encoder to get beneficial results in formal training and prediction.The experimental results demonstrate that our strategy can effectively increase the baseline model.We consider that this idea also has a potential in other remote sensing applications.On this basis, we would like to further discuss the model generalization and the advantage of data efficiency using this scheme.Specifically, Data efficiency refers to the ability of a model to achieve high performance using a relatively small amount of training data.

Labeled data efficiency
Our model is also evaluated on a data subset with annotated depth labels and then compared with random initialization (He et al. 2015).We gradually limit the number of labeled data in our experiments from the ratio of 1% to 10% on the Potsdam dataset, to finetune the height estimation network.As shown in Table 3, our method can be trained in a good performance with only 1% and 5% of the labeled data.Furthermore, training with either 1% and 5% data can both outperform the random initialization trained with 10% of labeled data for all accuracy metrics.This indicates that the gain in data efficiency of our method is progressively accentuated as the number of labeled data decreases.Here, data efficiency is three times that of the 1% labeled data and two times that of the 5% labeled data.This finding can be explained by the network that has been trained by the gradient descent.It can be seen that the initialization of parameters becomes more important in the case of reduced label training data.

Evaluation on domain generalization
We also test the domain generalization ability of our method on real scene images.We adopt the height estimation network with the encoder of DenseNet-161 training on the Potsdam dataset, and evaluated on the study area of Enschede without training and further fine-tuning.For the custom Enschede dataset, we selected one aerial image tile with the size of 5120 × 5120 pixels (Figure 9).The spatial resolution of the image is 25 cm.The images are acquired with the help of Netherlands' Cadastre, Land Registry and Mapping Agency (Kadaster).The aerial image is not ortho-corrected by the dense matching process and there are still some shadows and angles in image tiles.Thus, it is closer to the real data compared to the ISPRS benchmark.The tiles cover parts of the urban area of Enschede, the Netherlands.The nDSM data are obtained from Public Services on the Map (Publieke Dienstverlening Op de Kaart, PDOK, available at https://www.pdok.nl/),and we resampled the tile to the same spatial resolution as images from the original 50 cm.We crop the image into 512 × 512 patches.Table 4 confirms that the proposed pre-training strategy outperforms the MoCo initialization on the Enschede dataset.As seen in Figure 10, a pre-trained height estimation network using our method gives considerably more plausible height output and retains more features such as building edges and tree branches than MoCo and SimSiam initialization or even the noisy ground truth nDSM.This is because our proposed method trains the network to extract robust geometric aspects from the image, such as vertical or horizontal structural information.
The primary goal of geometric embedding was to exploit the similarities between the aerial image and DSM map in the area of the remote sensing scene next to geometric edges.In other words, we want to direct the encoder to generate more precise spectral features for each spatial point using the DSM map.This may assist the encoder in acquiring knowledge about the image's structural invariance.

Conclusion
This study investigates CNNs based height estimation from single-view aerial images.To aid the network in learning the geometric representation of the remote sensing image, we used a gradient field-based momentum-based contrastive learning pre-training approach.In addition, we utilized effective LIC layers in the decoding stage to refine the high-resolution features in a coarse-tofine manner for height estimation.Compared with the other comparative methods, our method shows a minimum decrease of 12.7%, 18.5% and 1.1% on the Vaihingen dataset and 6.4%, 20.4% and 2.1% on the Potsdam dataset in terms of MAE, RMSE, and ZNCC, respectively.Qualitative results also indicate that our method can obtain finer structural details.Our method also shows advantages in data efficiency improvements and domain generalization on the Potsdam and Enschede datasets.Moving forward, we aim to investigate various self-supervised learning transformation methods to achieve greater generalization and optimization of height estimation techniques within an unsupervised learning framework.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Overview of the proposed architecture, which consists of Stage I: Contrastive pre-training and Stage II: Height estimation.Our method aims to train an encoder that can learn a geometric visual representation of an image I along with its gradient field G.The encoder maps the image and its gradient field to the feature space Z and the head then projects its feature to the low-dimensional head space H to prevent overfitting.Through the use of contrastive loss, the representation of a query image I q in the head space H is encouraged to become closer to the representation of the positive pair G + and farther from the representation of the negative pair G − from another image I − .For contrastive learning, two encoders with the same structure are used.Back-propagation with contrastive loss is used to update the query encoder, whereas momentum is used to update the key encoder.Following geometric representation contrastive pre-training, the query encoder is used as the height estimation network's feature extractor.
h×w denote the horizontal and vertical gradient of I, E denotes the gradient magnitude calculated by the Sobel operator, and B Canny refers the binary result generated by the Canny algorithm.The operator ⊗ :(R h×w , R h×w ) 7 !R h×w represents pixel-wise multiplication.

Figure 2 .
Figure2.Overview of the encoder-decoder network for height estimation.An ASPP, LIC layers, and their dense connection for final height estimation make up the network.The LIC layers' outputs have full spatial resolution H, allowing decoding shortcuts.We employ base network skip connections to link internal outputs with suitable spatial resolutions.

Figure 4 .
Figure 4. Examples demonstrating the LIC layers' behavior.The boundaries of the building from LIC 8 × 8 (yellow rectangle) and LIC 4 × 4 (green rectangle) are adjusted for in the final clear estimates by the outputs from LIC 2 × 2 (blue rectangle) and reduc1 × 1 (black rectangle).
) The Vaihingen dataset was carried out by the German Association of Photogrammetry and Remote Sensing (Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation, DGPF).It includes 33 image tiles of different sizes with three bands: red, green and nearinfrared.The DSM was generated via dense image matching.A normalized DSM with 9 cm ground sampling distance (GSD) is used as ground truth.The true orthophoto is also with the spatial resolution of 9 cm.In this study, 16 tiles were used for training and the other 17 for testing.(2) The Potsdam dataset contains 38 image tiles of a fixed size 6000 × 6000 with four bands: red, green, blue, and nearinfrared.The height information was also dense image matching with Trimble INPHO 5.6 software and then mosaic rasterized into true orthophoto (TOP) with Trimble INPHO OrthoVista software.Void data ('hole') was interpolated in the final released TOP and DSM products.Three different channels of TOP are provided, namely IRRG (IR-R-G), RGB (R-G-B), and RGBIR (R-G-B-IR).In this study, we only select RGB during the experiment.The true orthophoto and normalized DSM are with 5 cm GSD.We select 24 tiles for training and the other 14 tiles for testing.

Figure 6 .
Figure 6.Sample results of the height estimation on the Potsdam dataset.From left to right: Aerial images, U-IMG2DSM, DCNet, PLNet, our method and ground truth.

Figure 7 .
Figure 7. Sample results of the height estimation on the Vaihingen dataset.From left to right: Aerial images, U-IMG2DSM, DCNet, PLNet, our method and ground truth.

Figure 8 .
Figure 8. Gradient field map.From left to right: input image patch, binary masks generated by B Canny , the magnitude of the gradient E, and generated Gradient field G of image.

Table 1 .
Model comparison on the ISPRS benchmark dataset.
5.3.2.Impact of LIC layersThe results in Table2demonstrate that the LIC layer further brings a decrease of MAE and RMSE with 0.191 and 0.064, respectively.It also improves the result of ZNCC by 0.043.The results indicate that the setting of LIC layers can effectively refine low-level and high-level features and obtain a better height estimation result.
Figure 9. Image tile and corresponding nDSM of Enschede dataset.

Table 4 .
Generalization results on test images of Potsdam dataset.Figure 10.Sample results of height estimation on the Enschede dataset.