Strategies in training deep learning models to extract building from multisource images with small training sample sizes

ABSTRACT Building extraction from remote sensing data is an important topic in urban studies and the deep learning methods have an increasing role due to their minimal requirements in training data to reach outstanding performance. We aimed to investigate the original U-Net architecture’s efficiency in building segmentation with different number of training images and the role of data augmentation based on multisource remote sensing data with varying spatial and spectral resolutions (WorldView-2 [WV2], WorldView-3 [WV3] images and an aerial orthophoto [ORTHO]). When the trainings and predictions were conducted on the same image, U-Net provided good results with very few training images (validation accuracies: 94-97%; 192 images). Combining the ORTHO’s and WV2’s training data for prediction on WV3 provided poor results with low F1-score (0.184). However, the inclusion of only 48 WV3 training images significantly improved the F1-score (0.693), thus, most buildings were correctly identified. Accordingly, using only independent reference data (other than the target image) is not enough to train an accurate model. In our case, the reference from WW2 and ORTHO images did not provide an acceptable basis to train a good model, but a minimal number of training images from the targeted WV3 improved the accuracy (F1-score: 69%).


Introduction
Growing urbanization significantly changed the natural environment in recent decades (Ahmed et al. 2020;Uttara, Bhuvandas, and Aggarwal 2012).According to the 2018 Revision of World Urbanization Prospects produced by the United Nations, more than half (55%) of the world's population lives in urban areas, and this proportion is expected to reach 68% by 2050 (UN Report 2019).In addition, global population growth and the emergence of megacities (cities with more than 10 million inhabitants) exert increasing pressure on Earth's ecosystems (Ao et al. 2016;Koop and Leeuwen 2017;Li et al. 2019).Changes on this scale are generating a number of economic, social and environmental challenges in an increasing number of countries (Beaverstock, Smith, and Taylor 1999;Riffat, Powell, and Aydin 2016;Taubenbock et al. 2009).Addressing these issues has become a top priority in many cities and, as a result, sustainability and sustainable urbanization have become a key principle of urban development strategies (Kadhim, Mourshed, and Bray 2016;Roy 2009;Shen et al. 2012).Although urban regions are highly complex, advances of remote sensing have made a wide range of applications available to help urban planners overcome difficult challenges, such as monitoring climate change or the extent of the built environment (Avtar et al. 2020;Breunig et al. 2009;Griffiths et al. 2010;Maktav et al. 2005;Miller and Small 2003;Oșlobanu and Alexe 2021;Wellmann et al. 2020).
One of the most common applications of urban remote sensing is the object detection.Both government agencies and the private sector have an increasing need for accurate and reliable spatial data (Eslami and Mohammadzadeh 2016;Rose et al. 2015).The purpose of urban object detection can vary widely depending on the quality of the data: very high-resolution images or video feeds provided by UAS or traffic cameras can be used e.g. to distinguish different means of transportation (cars, buses, bikes etc.) (Leung et al. 2019;Peppa et al. 2018;Zhou et al. 2017), while datasets with coarser resolutionmost commonly airborne or satellite imageryare used to map entire cities for different purposes: to detect buildings, road networks or vegetated areas (Fischer et al. 1998;Gavankar and Ghosh 2018;Rottensteiner et al. 2014;Ao et al. 2016).Accordingly, object detection, such as finding buildings, needs very high spatial resolution, preferably <2 m where the edges and the texture can be visualized.
Many factors make automatic building extraction a complex task.Urban surfaces are highly heterogeneous with many different materials in close proximity to each other, so the contrast between buildings and other objects will be relatively low (Jaynes, Riseman, and Hanson 2003;Xu et al. 2018).Moreover, buildings in a given area are usually very diverse regarding shapes, roofing materials, condition and age of roofs etc.In addition, shadows cast by different objects can mislead the segmentation algorithms and bias the result (Han et al. 2022;Jung, Lee, and Lee 2022;Schlosser et al. 2020;You et al. 2018).Thus, traditional pixel-based classification techniques usually have thematic accuracy issues or require large amount of training data.Reference data is not always available in city level and the data collection is time-consuming and needs relevant field work.
Deep learning (DL) is an increasingly popular method in fundamental remote sensing applications.The increase in the quality and quantity of available data has led to the development of DL approaches that extract complex features more effectively than traditional ones, resulting in higher accuracies (He et al. 2018;Ma et al. 2019;Yuan et al. 2020;Zhang, Zhang, and Du 2016).The significant and continuous increase in computing power has made it possible to develop state-of-the-art techniques, such as convolutional neural networks (CNNs).In the field of remote sensing, CNNs have been successfully applied to many tasks, e.g.change detection (Mou, Bruzzone, and Zhu 2019;Wang et al. 2019), semantic segmentation (Kampffmeyer, Salberg, and Jenssen 2016;Panboonyuen et al. 2019;Yuan, Shi, and Gu 2021), and image enhancement (Hoque et al. 2019;Hu et al. 2021).Accordingly, numerous new CNN architectures are emerging.AlexNet was introduced by Krizhevsky, Sutskever, and Hinton (2017), which ensured the processing of huge amounts of data relatively quickly utilizing the computer's graphics processing unit (GPU).Another very significant milestone was the proposal of the fully convolutional networks (FCN) by Long, Shelhamer, and Darrell 2015.In order to label every pixel, the FCN architecture uses an encoder/decoder structure, and because it does not contain any fully connected layers, images of any size can be selected as input.Based on FCN, Ronneberger, Fischer, and Brox (2015) proposed the U-Net.One of the key differences between the two architectures is that U-Net implements skip connections in its structure to achieve better results.Furthermore, U-Net has been modified in a way that precise segmentation can be achieved with fewer training data (Bardis et al. 2020).
Deep neural networks are prone to overfit the model (Antoniou, Storkey, and Edwards 2017;Srivastava et al. 2014) due to several factors, but the most common source of error is in the training data.If the amount of data available is insufficient, not representative enough, or too noisy, the model aims to reach the best fit to the training data learning the characteristics only the inputs but generalizes poorly to test data previously unseen by the model (Brownlee 2018;Rice, Wong, and Kolter 2020;Ying 2019).Novel deep convolutional neural networks rely heavily on big data.However, given the limited reference data available in many application areas, it is necessary to expand the data to avoid overfitting.The earliest and most straightforward example of data augmentation is based on simple image transformations, creating duplicates of the original images using traditional methods such as: flipping, rotating, cropping, shifting etc. (Shorten and Khoshgoftaar 2019;Perez and Wang 2017).Another, increasingly popular augmentation techniques are GANs (Generative Adversarial Networks) (Goodfellow et al. 2020) creating synthetic copies of existing training images, while retaining their main characteristics (Antoniou, Storkey, and Edwards 2017;Chlap et al. 2021).
Although deep learning has emerged as a powerful tool in many application areas, the performance of these models relies heavily on large amounts of annotated training data (Yang et al. 2022).In certain domains (such as remote sensing), this data is often limited or difficult to obtain as highquality annotation requires qualified experts.Moreover, the same object can vary widely across different images depending on the sensor, the time of acquisition, and the angle of the imaging (Jia et al. 2021;Sun et al. 2021).To address this issue, researchers have proposed various approaches in the field of few-shot learning (FSL), which aims to assist models to quickly adapt to new tasks while minimizing the need for substantial training data (Dong and Xing 2018;Sun et al. 2019;Wang et al. 2020b).One approach to FSL is transfer learning (Weiss, Khoshgoftaar, and Wang 2016), which involves using a pre-trained model in a source domain to get an improved prediction on a target domain, i.e. transferring the knowledge to the target domain where labeled data may be scarce (Song et al. 2022), hence transfer learning has been widely utilized in the field of remote sensing (Thepade and Dindorkar 2022;Thirumaladevi, Veera Swamy, and Sailaja 2023;Zhang, Liu, and Shi 2020).We also employed transfer learning to enhance the accuracy of predictions on the target domain with limited annotated available data.Another increasingly popular approach to solve FSL related problems is meta-learning, which is also referred to as 'learning to learn', as it aims to learn new tasks more quickly and effectively (Hospedales et al. 2022;Huisman, van Rijn, and Plaat 2021).Meta-learning has shown promising results in improving the efficiency of deep learning models and reducing the amount of data required for training (Cha et al. 2023;Gella et al. 2023;Tseng et al. 2021).
Building segmentation using different deep learning networks is a widely researched area, with the primary goal of previous studies is most commonly to develop novel model variants, thereby increasing the accuracy (He, Fang, and Plaza 2020;Wang et al. 2020a;Wang and Miao 2022).In addition, other researchers have often used existing, large datasets in their work e.g. the WHU (Ji, Wei, and Lu 2019) and Inria (Maggiori et al. 2017) datasets (Bischke et al. 2017;Wang and Miao 2022;Yu et al. 2022), thus few samples learning utilizing multisource remote sensing data with varying spatial and spectral resolutionsis often overlooked.
The aim of this study was to investigate how effectively the original U-Net network can be applied to segment buildings in densely built-up urban areas with a limited number of training images, under different circumstances: (i) when different regions of the same imagery were used for training and prediction; (ii) prediction using data from multiple sensors of different spectral and spatial resolutions, but without training data from the given image; (iii) incorporating additional data of the predicted imagery for the modeling.

Study area
Study area was an urban environment in Debrecen, Hungary's second largest city with a population of around 200,000 (Figure 1).We have designated a smaller area in the northeastern part of the city where reference data were collected.The development of the selected area started in the 1970s and consists mainly of detached houses with roofing materials predominantly made of one of three materials: tile, asphalt or asbestos (older ones), but the newer ones have almost exclusively tile roofing.

Datasets
Three different remote sensing data types were studied: a WorldView-2 (WV2) and a WorldView-3 (WV3) satellite image, and an aerial orthophoto (ORTHO).The WV2 image was acquired on 24.07.2016consisting of 8 spectral bands and the spatial resolution was 2 m.Since the imagery also included a 350 nm wide panchromatic band with a resolution of 0.5 m, we applied the Gran-Schmidt pan-sharpening method to increase the spatial resolution of the multispectral bands.The WV3 image was captured on 16.09.2019.Although the WV3 image has additional bands (SWIR, CAVIS) compared to the WV2, we had access only to the panchromatic and VNIR bands.While these bands had similar spectral properties to the WV2, the spatial resolution of the panchromatic band was better for the WV3 (0.3 m).The aerial orthophoto used in this study was provided by Lechner Knowledge Center (Hungarian institution for architecture, land registry and GIS) acquired in 2011 with an Ultracam X digital aerial camera system with 4 bands (blue, green, red and NIR), and a spatial resolution of 0.4 m (Table 1).Although three years have elapsed between the dates of the three images, the development of the area was active in the 1970s, and there was no significant development or change in land-use in recent years.
The reference dataset (training and validation) for all images was based on field observation (located with GPS with attribute data) followed by visual interpretation of the given images.In case of orthophoto and the WV2 image, all houses were vectorized within an area of 2 km 2 , while for the WV3 this was 0.3 km 2 .The difference between the two areas is because for WV3 we were interested to see how the model performs with less training data.Reference data was split into training and validation in 80:20 ratio.Validation accuracy and loss were measured across each epoch during the modeling.In addition to the reference data, a test area of 0.3 square kilometers was selected from the WV3 image.The houses in the area were vectorized similarly to the training data; the resulting database consisted of 600 buildings.The classified images produced by the modeling were evaluated using this new test dataset through confusion matrices.Model performances were compared with the precision (positive predictive value) and recall (true positive rate) obtained from the confusion matrix.F1-score was also determined as the harmonic mean of precision and recall.

U-Net based image segmentation
The U-Net architecture consists of two main paths: (i) encoder or contraction path and the (ii) decoder or expanding path.(i) The encoder section consists of convolutional and max pooling layers when the model captures the main features and context of the images, but, at the same time, loses information about their location.(ii) During the contraction phase, the size of the images is reduced while the number of dimensions (bands) is increased.Since our goal is to classify all pixels of the original image as a result of the semantic segmentation, it is necessary to extract information about the localization as well as to restore the original image size.This is achieved by the decoder path of the U-Net, which uses transposed convolution for up-sampling (Abdollahi, Pradhan, and Alamri 2020;Du et al. 2020;Fan et al. 2022;Ronneberger, Fischer, and Brox 2015;Yan et al. 2022).

Data augmentation
Manual collection of the training data is a very time-consuming task; in our case, more than 3000 buildings were vectorized across the images.Although U-Net is known to provide good results even with limited data, data augmentation was a necessary step to get better thematic accuracy, as it provided sufficient data for the model training.We used traditional data augmentation methods: the original images were flipped horizontally and vertically, and rotated 90, 180 and 270 degrees.Thus, beside the original image, we obtained 5 additional images due to the augmentation.

Implementation details and model evaluation
Models were trained for a maximum of 50 epochs.We implemented an early stopping mechanism, which stopped the training process when the model performance did not improve in 15 epochs.For all models, we chose an initial learning rate of 0.001 and used the Adam method (Kingma and Ba 2014) as the optimization algorithm with a batch size of 20. 512 × 512 pixel images were applied for the training process.In total, 40 images were used for the WV2 and the aerial orthophoto, and 10 for the WV3.Combining the two different data sources (aerial and satellite imagery) caused problems in the modeling process as the spectral resolution of the data is different (4 and 8 channels respectively).In order to generate models using the training data from the two sources together, the 4 bands (blue, green, red, near-infrared) closest to the spectral range of the aerial orthophoto were selected from the 8 bands of the WV images.
Implementation was conducted in Python using TensorFlow (Abadi et al. 2016).All training experiments were run on a Nvidia RTX 3090 with 24GB of VRAM.(Figure 2 and Table 2) 3. Results

Assessing the accuracy of models when predicting on the same imagery
Both the WV2 and the ORTHO ensured similarly good results when the predictions were performed on the same image in a different area (e.g. the WV2 training data was used to predict a separate region of the WV2 image), with validation accuracies above 94% and 97%, respectively (Figure 3a,b).In both cases the learning curve showed that after a few epochs the validation accuracies were changing with the training accuracy without exceeding it during the learning phase, so overfitting cannot be observed for these models.
In the case of the WV3, (Figure 3c,d) the learning curves of the models for both the 5 and 10 image training instances showed overfitting after a few epochs due to insufficient training data and, thus, the early stopping mechanism stopped the training process before the full 50 epochs.The segmentation results of the four models (Figure 4.) were consistent with the tendencies of the learning curves.
When validation accuracy was high and the model did not overfit (WV2, ORTHO; Figure 4a,b), segmentation was also successful: almost all buildings were identified, with only a few misclassified pixels.Buildings that were incorrectly identified as background (false negative results) were almost exclusively smaller outbuildings (garages, tool sheds etc.).
We obtained poor segmentation results with the two models of the WV3 image (3-P5-3; 3-P10-3).When only 24 WV3 training images were used for the modeling process (Figure 4c), the segmentation did not result in contiguous areas and most of the buildings were omitted.Although the modeling with 48 training images already showed some of the individual buildings as contiguous patches (Figure 4.d), the segmentation was still poor, subject to many errors: much of the bare soil and roads were incorrectly segmented as buildings.

Accuracies when ORTHO was trained with WV2 and WV3 images
Prediction on the ORTHO with WV2 and WV3 training images resulted in poor results (Table 3).The class level metrics showed that although the precision values were relatively high (0.62 and

Accuracies when WV2 was trained with WV3 and ORTHO images
For both model combinations segmentation of the WV2 image based on the WV3 and ORTHO resulted in F1-scores above 0.5 (Table 4).Including 10 WV3 training images (3O-P50-2) instead of 5 (3O-P45-2), the precision value decreased slightly, but the increase in recall was almost 0.1, thus, modeling with more images resulted in a higher F1-score.

Accuracies when WV3 was trained with WV2 and ORTHO images
In each case when only the WV2 and ORTHO images were used to train the models (and the WV3 dataset was excluded from training), only low F1-scores were obtained (Table 5).
The highest F1-score of the three models was obtained by combining the training images of the WV2 and the orthophoto (2O-P80-3), but even here the F1-score was very low (0.184) and the segmentation result was also inappropriate: the model completely omitted several houses, while others were only partially segmented, and it also misclassified roads and bare soil areas as buildings (Figure

Accuracies when WV3 was trained with WV3, WV2 and ORTHO images
We obtained considerably higher F1-scores in the cases where training images of the WV3 image were included in the model building process (Table 6).Although WV3 solely as training images (3-P5-3; 3-P10-3) provided a higher F1-score than the combined model of the ORTHO and the WV2 (2O-P80-3), the result showed that despite the higherbut still low -F1-score, there are many errors in the segmentation.Comparison of 3-P5-3 and 3-P10-3 models revealed that by adding only 5 additional WV3 training images, the recall increased significantly (from 0.2 to 0.72), but the precision remained almost unchanged; i.e. the model correctly identified more buildings but the number of false positives was still very high due to the low precision value.
Best results were achieved when the WV2 and the ORTHO data were merged with the WV3.We used 192 images from the WV2 and the orthophoto, as well as 24 and 48 from the WV3 respectively (23O-P85-3; 23O-P90-3).Using all the available training data (23O-P90-3), the F1-score of the resulting model was 0.693, providing the most accurate segmentation for the WV3 image (Figure ).Although misclassifications also occurred in this model, their proportion has been significantly reduced and most of the buildings were included in the segmentation.

Discussion
Although many methods have been developed for building segmentation, seeking the way to reach the highest thematic accuracy with least efforts (i.e.collecting reference data) is still a challenging task due to the inhomogeneity of densely built-up areas: the presence of vegetation, shadows and other obstacles can relevantly bias the results.Furthermore, with the technology development the available input data is increasing, but the special characteristics of images taken by Uncrewed Aerial Vehicles (UAV), aircrafts, and satellites images regarding the applied sensor's spatial and spectral resolution, the time of acquisition, which leads to significant differences among the possible output maps.Deep learning segmentation solutions are popular in image processing, there is a high requirement for annotated data, whilst producing these reference data is very time-consuming and their availability is often limited (Brigato and Iocchi 2020;Jia et al. 2021).Thus, a greatest challenge is the fusion of these data sources and their combined use in analyzes, e.g. in extracting or identifying objects (Liu et al. 2020;Meng et al. 2020), which is an important step when we intend to use previously collected reference data on a new (independent) image.In our work we have also encountered these problems: when fuzing the reference data, we followed the approach of Li et al. (2021), that is, we reduced the bands of the WV images so that the number and the spectral range of the remaining bands is as close as possible to the channels of the orthophoto.As several previous studies have shown, U-Net is fundamentally well suited to building segmentation procedures (Alsabhan, Alotaiby, and Chaudhary 2022;McGlinchy et al. 2019;Yu et al. 2022;Wang et al. 2020a).Although many different variants of the U-Net have been developed in recent years to achieve even higher accuracies (He, Fang, and Plaza 2020;Hui et al. 2019;Rastogi, Bodani, and Sharma 2020;Wang and Miao 2022), we applied the original U-Net proposed by Ronneberger, Fischer, and Brox (2015), as our research focused on how well the basic network can be used for segmentation with multi-source input data.Thus, we were able to focus on the impact of different data processing and training strategies on the segmentation performance, as the original network provided a suitable baseline for comparison.We found that when the prediction was performed on the same image as the training, a very limited amount of data gave sufficiently good results.In case of the WV2 and orthophoto, we achieved validation accuracies over 94% with only 192 images (including the augmented images, too) used for training.The spatial resolution was 0.5 and 0.4 m for the WV2 and the ORTHO, respectively, with an image size of 512 × 512 pixels.Other researchers have achieved similarly high accuracies for building segmentation using U-Net based models, but in many cases, they have used considerably more training images.Wang et al. (2022) conducted their research on the WHU (Ji, Wei, and Lu 2019) and Inria (Maggiori et al. 2017) datasets, with images of the same size as ours and a spatial resolution of 0.3 m, however, the number of images used for training was 4736 and 10000, respectively, achieving validation accuracy above 96%.Abdollahi and Pradhan (2021) implemented the AIRS dataset (Chen et al. 2019) with a larger tile size (1536 × 1536) but finer resolution (7.5 cm), which contains more than 220000 buildings: they reported a validation accuracy above 95% with their proposed U-Net based network.
We reduced the number of training data in case of WV3: after augmentation, the models were trained with 24 and 48 images (3-P5-3; 3-P10-3).The amount of data used for these models proved to be insufficient: overfitting was observed in the learning curve (Figure 3c,d), and severe misclassifications were visible in the segmentation results (Figure 4c,d).In addition to the errors of the 3-P5-3; 3-P10-3, the F1-scores were low (Table 6).Using only the WV2 and orthophoto training data for the prediction of WV3 yielded even worse results (F1-score of 0.184; Table 5): despite the number of the combined training images used for training was 384 (2O-P80-3), the F1-score was lower than in the case where only 24 images from WV3 were used (3-P5-3), so although we had 16 times more data, building the model with data from different sources resulted in a lower prediction performance (F1-score of 0.184).A significant improvement was achieved when the WV3 data were included alongside the WV2 and the ORTHO (23O-P85-3; 23O-P90-3): adding as few as 24 images to the modeling process, the F1-score increased by nearly 0.5 (from 0.184 to 0.661; Table 5 and  Table 6).The inclusion of 24 additional images from WV3 further increased the accuracy of the prediction, with this model (23O-P90-3) yielding the highest F1-score (0.693).Furthermore, the precision and recall values of this model were similar (0.68 and 0.71, respectively), indicating that the model performed consistently.
The difference was considerable when WV2 images were applied to predict WV3 (e.g.2O-P80-3) and vice versa (e.g.3O-P50-2): in the former case (WV2 →WV3) the F1-score was 0.184, while in the latter it was 0.567 (WV3 →WV2), despite having less training data.The explanation lies in the quality of the two images: WV2 was taken in summer, with a smaller off-nadir angle (18°) and better lighting conditions, while WV3 was taken in autumn, with a higher off-nadir angle (30°).As a result, the buildings are more distorted in the WV3 image, which also produced larger proportion of shadows.Using the lower-quality WV3 images for training, the U-Net generalized better when predicting the WV2 images, but when trying to segment the WV3 images with the higher-quality WV2 training data, the model failed.Our results are consistent with the study of Weir et al. (2020), who analyzed images with off-nadir angles ranging from 7.8°to 54°and found a difference of 0.4 between the F1-scores.
Several deep learning techniques were developed beside the U-Net, and successful applications had been reported in the topic of building extraction.Shi, Li, and Zhu (2020) compared different deep convolutional neural networks (e.g.U-Net; SegNet and different variants of the FCN architecture etc.) on both medium (PlanetScope; 3 m) and very-high resolution (ISPRS Potsdam benchmark dataset; 5 cm) imagery, and found that in case of the PlanetScope imagery, U-Net outperformed several networks such as FCN-32s, FCN-16s, SegNet, etc.At medium resolution, they found that FC-DenseNet performed best, but the difference in F1-score compared to U-Net was only 3%.Furthermore, they revealed that while U-Net provides sharp building boundaries at very-high resolution (5 cm), the completeness of the segmentation result decreases in the presence of finer details.Wang et al. (2022) proposed a novel Vision Transformer (ViT) architecture specified for building extraction and conducted experiments to compare its performance with state-of-the-art convolutional neural networks (CNNs) widely used for the same task (DeepLab V3+; different U-Net variants etc.).Although they found that their ViT-based approach outperformed traditional CNNs in building segmentation, the difference was only 2-3% in F1-scores when compared to different variants of the U-Net.Although these previous studies examined many models, they have primarily focused on benchmark datasets (WHU, Inria, Massachusetts), utilizing very large number of buildings (>10000) to train the models.The results of recent studies also show that although novel solutions generally provide better segmentation results, the classic U-Net, and its variants, are still relevant today.Although the accuracy measures of our experiments were lower, we used images with significant off-nadir, while the benchmark datasets usually contain optimal images that can ensure better results.In addition, we only used a limited number of training images instead of several thousand, but the validation accuracies obtained with 32 (192 with augmentation) training images, were comparable to the results reported in the above studies.
Our approach can serve as a good starting point in deep learning object segmentation cases where the goal is to monitor a particular region, since the incorporation of remote sensing images from different sensors into an existing dataset can be achieved with minimal effort but with good accuracy.A limitation of our approach is that selecting some training data is necessary from the newly included image in order to obtain good results with few data.

Conclusion
Although many deep learning-based building extraction methods have been developed, the combination of remote sensing data from different sources, often with varying resolutions, and limited training data still pose a major challenge in image processing.In this research, we investigated the efficiency of U-Net deep learning segmentation techniques for building extraction using remote sensing data from multiple sources, learning with few samples.We aimed to help data experts with quantified results of different training approaches (including when there was no training data from the given imagery).Our results revealed that: . When the source (i.e. the sensor) of the images used for model training and prediction was the same, validation accuracies were above 94% even with a small amount of training data (2-P40-2 and O-P40-O; 32 images).The segmentation results were visually satisfactory, with few misclassified pixels.For the WV3, when fewer images were used for modeling (3-P5-3 and 3-P10-3; 4 and 8 images), the learning curves showed overfitting and the segmentation results were poor. .Low F1-scores were obtained in the segmentation of the WV3 when only the orthophoto and WV2 images were used for the training.The best results were achieved when images from these two sources were merged (2O-P80-3; 64 images), but the F1-score was only 0.184. .The inclusion of 4 (24 with augmentation) WV3 training images with the ORTHO and WV2 (23O-P85-3) relevantly increased the accuracy: the F1-score increased to: 0.661 and the segmentation result were also visually improved.The addition of 8 (48 with augmentation) WV3 training images further increased the accuracy, but to a lesser extent (23O-P90-3; F1-score of 0.693).
The main conclusion is that U-Net can provide high accuracies in building segmentation even with training data from different sources, but at least a small amount of training data is needed from the target image.Small and medium sized enterprises working in the field of image processing cannot have the resources to ensure large amount of training samples (due to labor-intensive data preparation), however, utilizing the findings of this research, i.e. using only a few training images, and involving a minimal number of training data from the target image itself, good thematic accuracy can be obtained.Thus, building reference datasets of an area from multisource images is reasonable, because more training images increase the classification accuracy.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 2 .
Figure 2. The modeling workflow with the abbreviated names of the models.First and last tags of the labels: 2 = WorldView-2; 3 = WorldView-3; O = Orthophoto.Second tag = Number of training image used for the prediction (before augmentation).

Table 5 .Figure 6 .
Figure 6.Segmentation result based on the combined training data of the WorldView-2 and the orthophoto (2O-P80-3).a)original image; b) segmentation result; first and last tags of the label: 2 = WorldView-2; 3 = WorldView-3; O = Orthophoto.Second tag = Number of training image used for the prediction (before augmentation).

Figure 7 .
Figure 7. Segmentation result based on the combination of all available training data (23O-P90-3).(a: original image; b: segmentation result; first and last tags of the labels: 2 = WorldView-2, 3 = WorldView-3, O = Orthophoto, second tag = Number of training image used for the prediction before augmentation).

Table 1 .
Spectral and spatial properties of the input datasets.

Table 2 .
Number of training and validation images per model (first and last tags of the labels: 2 = WorldView-2; 3 = WorldView-3; O = Orthophoto.Second tag = Number of training image used for the prediction before augmentation).

Table 6 .
Prediction results on the WorldView-3 image based on F1-scores in the case where WorldView-3 images were included for building the models (first and last tags of the labels: 2 = WorldView-2; 3 = WorldView-3; O = Orthophoto.Second tag = Number of training image used for the prediction before augmentation).