Large-Scale LoD2 Building Modeling using Deep Multimodal Feature Fusion

Abstract In today’s rapidly urbanizing world, accurate 3D city models are crucial for sustainable urban management. The existing technology for 3D city modeling still relies on an extensive amount of manual work and the provided solutions may vary depending on the urban structure of different places. According to the CityGML3 standard of 3D city modeling, in LoD2, the roof structures need to be modeled which is a challenging task due to the complexity and diversity of roof types. While high-resolution images can be utilized to classify roof types, they have difficulties in areas with poor contrast or shadows. This study proposes a deep learning approach that combines RGB optical and height information of buildings to improve the accuracy of roof type classification and automatically generate a 3D city model. The proposed methodology is divided into two phases: (1) classifying roof types into the nine most popular roof types in New Brunswick, Canada, using a multimodal feature fusion network, and (2) generating a large-scale LoD2 3D city model using a model-driven approach. The evaluation results show an overall accuracy of 97.58% and a Kappa coefficient of 0.9705 for the classification phase and an RMSE of 1.03 (m) for the 3D modeling.


Introduction
The United Nation's population division reports that 55% of the world's population lives in urban areas (United Nations 2018; Department of Economic and Social Affairs of the United Nations 2018).Accurate 3D city models are crucial to sustainably managing and planning the growing urban population.3D city models provide a comprehensive understanding of a city's layout, infrastructure, and resources, empowering engineers, urban planners, and decision-makers to have a better understanding of the current state of the city and make decisions that align with the city's long-term objectives.3D city models can be used in different areas, such as smart city applications, disaster management, urban planning, tourism, navigation, and facilities management (Doulamis and Preka 2016;Peters et al. 2022).
According to the Open Geospatial Consortium (OGC) standard of 3D city modeling, CityGML 3.0, 3D city models can be produced in different Levels of Detail (LoD) to address different applications.This standard allows for detailed modeling of buildings at four LoDs, LoD0 to LoD3.These levels progressively increase the level of complexity and accuracy of the building model.In LoD0, buildings have a 2D representation of the building's footprint, and LoD1 is a block-shaped model of buildings.In LoD2, buildings have the structure of the roof and finally, detailed architectural elements such as windows, doors, and full exterior have been included in LoD3.
Hence, in LoD2þ models, buildings have adequately modeled roof structures and thematically differentiated surfaces.The majority of the studies in the literature can be divided into data-driven or model-driven approaches.Data-driven approaches extract geometric components from a building and use them to model the building.On the other hand, model-driven approaches select the model that best fits the building data from a predefined library of models (Krafczek and Jabari 2022).Although datadriven approaches can be more flexible in modeling the type of roofs, if the shape of the majority of roofs in an area follows basic formats like gable and pyramid, model-driven methods can be a faster and simpler implementation (Partovi et al. 2014).
Thus, in model-driven LoD2 city generation methods, the type of building roofs plays a crucial role in fitting a 3D model to each building (Buyukdemircioglu et al. 2021).As a result, accurate estimation of building roof types is a key step toward 3D city modeling in model-driven approaches.In general, the accuracy of the roof type classification process directly affects the final precision of the LoD2 3D building model.
Due to the high amount of information in urban areas and the diversity of building roof types, the classification of building roof types is an active research area in photogrammetry, remote sensing, and computer vision.
Although classification algorithms have been developed, creating a 3D city model based on fully automatic roof-type classification is still a challenging task.With the rapid development of big data and high-performance computers, deep learning, especially Convolutional Neural Networks (CNNs), can help with classification tasks, e.g., image classification, segmentation, and scene understanding (LeCun et al. 2010).
High-resolution RGB optical airborne/satellite images provide rich information content that can be used for roof type classification.However, semantic classification and extracting 3D information using 2D optical images suffer from difficulty in distinguishing objects in areas with poor contrast and shadows.Digital elevation models can overcome this limitation since each type of roof has its own height pattern.Flat roofs, for example, have a constant height across their surface, while gable roofs have a decreasing height from the peak to the bottom.3D point cloud data containing elevation and intensity measurement information can be used as an independent source of information for roof-type classification.
To improve the accuracy of recognizing roof types and generating a large-scale LoD2 3D city model, we propose a multimodal network that combines deep RGB optical and height features of each building.To the best of our knowledge, this is the first study in the literature that fuses these features for roof type classification based on the model proposed in Figure 4. Our proposed method aims to increase the accuracy of LoD2 3D city modeling by improving the overall accuracy of roof type classification.
The presented methodology consists of two phases.The first phase focuses on building roof type classification by fusing the RGB optical features extracted from high-resolution orthophotos with height information of buildings extracted from LiDAR data.The concatenated features are fed to a fully connected classifier to recognize the roof types.To the best of our knowledge, existing CNN-based roof classification training datasets, such as (Alidoost and Arefi 2018;Buyukdemircioglu et al. 2021;Wang et al. 2022) have up to seven roof types including flat, hip, half-hip, pyramid, gable, and complex.The study area for this work includes Fredericton and Moncton, which are major cities in New Brunswick, located in the Atlantic region of Canada.In New Brunswick, like many other urban/suburban areas in Canada, the majority of rooftops can be classified into nine groups, namely flat, gable, hip, cross-hip, gable-flat, pyramid, cross-gable, gambrel, and dutch.As the existing dataset could not classify roof types specific to this area, it was necessary to develop a roof type dataset that included common roof types in Eastern Canada.A small number of buildings do not fit into the aforementioned groups, which we classify as complex roofs.We employed a tree decision method to recognize complex building roof types from all the buildings.Then, the rest of the nine building roof types have been classified using the proposed deep feature fusion DL network.In this paper, we are able to identify a total of 10 types of building roofs.
The second phase of this work focuses on largescale 3D City Modeling using a mode-driven approach by fitting a 3D model to the point clouds.This part can be divided into two steps includes: (1) Extracting the eave and ridge heights of each building using the Digital Elevation Model and Digital surface Model DSM. ( 2) Assigning a 3D model to each building using a preexisting roof type library.
In summary, this paper contributes to the literature in three respects: Designing a multimodal feature fusion solution to classify roof types.This solution utilizes both RGB optical and LiDAR data features, which are fused through a double deep learning network.Optimizing the fit of a 3D model to the building points in the point cloud by using a model-driven approach.Creating a roof-type classification dataset using high-resolution orthophotos (72 mm) and LiDAR data (6 points/m 2 ).The datasets will be published to be accessed by others and can be employed in other CNN-based roof-type classification applications.

Related works
There is extensive literature on building roof type detection and LoD2 3D city model reconstruction.This section briefly addresses some of the related work in these two areas.These works are dedicated to building recognition and building roof type classification using machine learning and deep learning approaches.Qian et al. (2022) proposed a deep learning network called Deep Roof Refiner for refining the delineation of roof structure lines using satellite imagery, then could use to model the roofs.Alidoost and Arefi (2018) developed a model-based approach for automatic building detection and roof type classification using a single aerial image.They classified three different roof types, including flat, gable, and hip shapes with an accuracy of 92%.Partovi et al. (2017) also classified roof types into seven classes using WorldView-2 pan-sharpened multispectral satellite images data and the VGG-Net model.In another study, Bittner et al. (2019) proposed a multi-task conditional generative adversarial network for DSM refinement and roof-type classification.Their network aims to create DSM, which is then used for dense pixel-wise rooftop classification, assigning object class labels to each pixel in the DSMs with 80.03% precision.Buyukdemircioglu et al. (2021) classified six roof types using a shallow CNN model.They fine-tuned their model with three well-known pre-trained networks, i.e., EfficientNetB4, The result showed that after fine-tuning the network, the accuracy of the model had increased by 3% to 6%.Using the machine learning approaches, Assouline et al. (2017) classified roof types as well as aspect (azimuth) classes and slope (tilt) classes for large-scale solar photovoltaic (PV) deployment and obtained an accuracy of 67%.
Conventional methods of 3D city model generation can be divided into three categories: Data-driven, model-driven and hybrid techniques.
1. Data-driven techniques, which are also called down-top approaches, are used to detect the roof planes and extrude roof shapes based on geometric components such as lines, edges, and points (Park and Guldmann 2019).There are various methods for segmenting the LiDAR point clouds and determining roof planes, including edgebased methods (Jiang and Bunke 1994), regiongrowing methods (Alharthy and Bethel 2004), random sample consensus (RANSAC) methods (Hartley and Zisserman 2003), and clustering methods (Shan and Toth 2018), as well as the combination of two or more algorithms (Dorninger and Pfeifer 2008).Huang et al. (2011) introduced generative modeling of building roofs with an assembly of primitives allowing overlapping using the Reversible Jump Markov Chain Monte Carlo algorithm.Huang et al. (2022) and Li et al. (2022) presented a methodology for reconstructing 3D models of buildings from airborne LiDAR point clouds using a data-driven approach.In both works, they segmented point clouds into planar patches.Then, a 3D optimization technique was applied to create a topologically consistent 3D building model from its compositional primitives.2. Model-driven approaches which known as topdown approaches.Lafarge et al. (2010) and Huang et al. (2013) proposed a method to reconstruct buildings from a digital surface model.This process involved breaking down the building footprints into components either manually or automatically and then utilizing a Gibbs model to fit the 3D block models onto the building footprints.A Bayesian decision was taken to find the most appropriate roof primitives from the predefined library that would represent the point clouds by utilizing a Markov Chain Monte Carlo sampler and original proposition kernels.3. Hybrid methods were developed as a result of the inherent weakness of model-driven approaches in modeling complicated buildings and the complexity of data-driven methods.Model-driven and hybrid approaches are reviewed here, as they are related to the adopted workflow.Pepe et al. (2021) and Tripodi et al. ( 2020) used stereo satellite imagery to build the digital surface model and extract the height of each object using the DSM.The latter used deep learning to extract the contour polygons of the buildings and then the digital train model.Zhao et al. (2021) proposed the reconstruction framework to reconstruct a 3D model containing a complete shape and accurate scale from a single image.The proposed method involves using two convolutional neural networks to create watertight mesh models and optimizing them using another CNN network.Krafczek and Jabari (2022) proposed a decision-tree-based methodology for generating LOD2 3D city model.They decomposed the building footprints into building primitives to have a better estimation of height for each building's parts.
3D city models can be directly generated from 3D point clouds.These methods are based on using terrestrial laser scanners (Akmalia et al. 2014) to generate dense point clouds from Terrestrial Laser Scanners (TLS) and then perform segmentation to detect building fac ¸ade and features.Li et al. (2022) utilized deep learning to model building roofs from raw LiDAR data automatically.They extracted PointNetþþ deep features from the input building roof point clouds to detect the roof corners.The corners are clustered to make a set of accurate vertices.The vertices are fed to a graph algorithm to find the valid edges between vertices, providing the results to make the final roof model.Similarly, Also, Dehbi et al. (2021) presented a new method for reconstructing 3D buildings from LiDAR data.They used an active sampling strategy that combines a series of filters to focus on promising samples.The filters are based on prior knowledge represented by density distributions.The method uses surflets-3D points with normal vectors to provide parameters for model candidates, such as azimuth, inclination, and ridge height.Building footprints are derived in a preprocessing step using machine learning methods.Kada (2022) also used LiDAR point clouds to reconstruct a simple 3D model.He extracted geometrical features of buildings which are needed for 3D modeling using a DL network.

Data preparation
In this research, we developed a roof-type dataset consisting of 2483 number of buildings from nine common roof types.To create this dataset, we used four input layers, and the details of the input data are provided in Table 1.
LiDAR point clouds were used to create a DSM with a spatial resolution of 72 (mm) after resizing.Next, high-resolution orthophotos were mosaiced and used to manually digitize, modify, and label building footprints.This process resulted in extracting nine groups of roof types from high-resolution orthoimages and DSM images.Each dataset includes building images for each class, along with its corresponding label.
The training datasets was created using the DSM and orthophoto of Fredericton in the first scenario.While in the second scenario, the training dataset was created using the DSM and orthophoto of Moncton.To evaluate the performance of the model, separate testing and validation datasets were generated by utilizing the DSM and orthophoto of Moncton in the first and Fredericton in the second scenarios.For each scenario, the whole city building data was used for training and 25% of the test dataset was dedicated to validating the model performance, while the remaining 75% of the data was used to test the model.
The test and train datasets were then cropped and resized.Figure 1 shows the preprocessing and preparation diagram.To reduce the over-fitting problem in deep learning, the training data was augmented.The augmentation process involved horizontal and vertical flipping of the training images, as well as rotations by 45, 60, and 90 degrees clockwise.These augmentations ensured an equal number of images for each class.As a result, there were 1000 samples for each class in the training datasets.A sample of the RGB optical image and height layer for each building is shown in Table 2.

Methodology
As shown in Figure 2, this work is divided into two phases.Phase 1 focuses on the classification of roof types.In this phase, we first determine complex roof types which have irregular geometries that cannot be used in model-driven methods.We intend to detect these roof types and mark them for manual 3D reconstruction.In the next step, we classify the roof types of the rest of the buildings.Phase 2 of this work involves the reconstruction of a large-scale LoD2 3D city model.The following subsections provide details about each phase.

Phase 1: Roof type classification based on multimodal feature fusion network
Recognizing complex roof types This type of building includes multiplane roof structure with multiple peaks and edges.We utilized a decision tree method, depicted in Figure 3, to detect complex roof types.The first step in this process is to find out the number of roof edges and planes for each building.To calculate the number of edges, we simply counted the number of vertices for each building footprint.
To determine the planes of each roof, the RANdom SAmple Consensus (RANSAC) method was used (Derpanis 2010).RANSAC is an algorithm used to identify and extract multiple planes from a LiDAR point cloud dataset.The algorithm works by iteratively selecting a random subset of points from the point cloud and using them to estimate a plane that fits the data.The algorithm then uses a distance metric to evaluate how well the estimated plane fits the remaining points in the point cloud.Points within a certain distance threshold from the plane are considered inliers and assigned to that plane.Then, in a repetitive process, new subsets of points have been     this issue by incorporating skip connections, allowing gradients to flow freely and lay on deep layers before becoming attenuated to small values (Sarwinda et al. 2021;Tan et al. 2018).In this study, we used ResNet as CNN's body of our baseline networks and the proposed method.
Traditionally, CNNs are trained with a random initial set of weight parameters, but this approach requires a large number of training data and a significant amount of memory.In this work, a pre-trained ResNet on IamgeNet dataset is employed to avoid the overfitting problem (Deng et al. 2009).We utilized this pre-trained model as the baseline or starting point for the classification task and then fine-tune its parameters specifically for our target dataset using transfer learning principles (Bengio 2012;Donahue et al. 2013).
To customize the pre-trained ResNet for building roof type classification, we replaced the last fully connected layer of these pre-trained models with a fully connected layer that is relevant to the number of roof types classification problems.Additionally, we utilized transfer learning techniques to fine-tune our model further and optimize its performance.The rest of the models are used as a fixed feature extractor to extract features using our two sets of datasets.Next, SoftMax classifiers of these networks are trained on these new datasets using discriminative learning rates.

Deep feature extraction and fusion
After creating baseline networks that pertained to the ImagNet dataset, we proceeded to develop the third network, which extracts meaningful RGB optical and height features of each building.While the CNN's body of this network is ResNet, it is pre-trained based on RGB optical and height datasets from the previous step (see Figure 4).After feeding the weights parameters from the previous step to the multimodal feature fusion network, the last layer of this network is replaced with a newly developed classifier head with the roof type classes.The multimodal feature fusion network extracts the RGB optical and height features of each building and then concatenates them.This process results in a feature map, which is subsequently fed to the flattened layer.Figure 4 represents a schematic diagram of the multimodal feature fusion network.

Phase 2: Model-driven 3D city model reconstruction
In Level of Detail 2 (LoD2), 3D building reconstruction using a model-driven approach requires roof attributes such as roof type, ridge height, and eave height.Even though roof types are recognized in phase 1, ridge height and eave height are needed to be retrieved from the point cloud.To obtain the ridge height, the maximum statistics zone tool on is applied on the nDSM (Normalized Digital Surface Model), which calculates the maximum height for each building.To generate the nDSM, the LiDAR point cloud data is first used to extract both the Digital Surface Model (DSM) and the Digital Terrain Model (DTM), which represents the ground surface.Next, the DSM is subtracted from the DTM, generating the nDSM, which represents the height of buildings above the ground surface.
The eave height of a building is defined as the minimum height of its largest slope roof plane.To determine this, the following steps are taken, as shown in Figure 5: (1) Calculate the slope of each building and assign zero if it falls outside the minimum and maximum slope range.(2) Create the minimum height threshold DSM.The threshold can be defined as follows: if nDSM > ¼ Min Roof Height, nDSM is assigned 1; else 0, (3) Calculate the aspect of this layer.(4) Reclassify the aspect map into the number of classes.( 5) Convert the classes into polygons.( 6) Calculate the area of each polygon and use the one with the largest area (larger than minimum slope roof area) as the roof plan.As a result of determining the roof plan, the largest sloping roof plane was obtained.The minimum height of this plane is considered as the eave height.Then, the height of roofs can be calculated by subtracting the ridge height and eave height of each building.
Key attributes, including ridge height, eave height, and roof type, have been acquired to fit 3D models to point clouds and reconstruct 3D buildings.Using a computer-generated architecture (CGA) code, these attributes are utilized within the ESRI City Engine for 3D creation purposes.Buildings are extruded to their eave height, then the roof shape and roof height of each building are applied.

Experiments
Phase I: Roof type classification Based on visual inspections, it can be observed that the majority of the buildings in Atlantic Canada comprise ten different types, including complex, as specified earlier.Using the decision tree-based approach, building footprints that have more than six planes and 13 edges have been classified into complex.These thresholds have been obtained by trial and error according to the structure of buildings in Eastern Canada (Table 3).
After recognizing complex roof types, we created two baseline deep networks to recognize non-complex building roof types (see Figure 4).The first network was based on high-resolution orthoimages data, and the second was on digital surface model data.To extract meaningful RGB optical and height features of each building, we tested three pre-trained versions of ResNet-18, ResNet-50, and ResNet-101 as the backbone of our baseline networks.We were able to effectively transfer the knowledge gained from training on the ImageNet dataset to our baseline models, resulting in better accuracy for our specific classification tasks.Each network was trained for 200 epochs.
In the next step, we saved the weight parameters of the two baselines DL models and later used them as initial weights for our proposed deep multimodal feature fusion network.This approach allowed us to leverage the preexisting knowledge of the baseline  networks trained on the roof type dataset, which is more relevant to our specific task of classifying roof types and helped to ensure that the initial parameters of our fusion network were related to the target domain.
Next, we extracted two sets of descriptors from these two DL baseline networks.These descriptors are concatenated together using the proposed multimodal feature fusion network, and the fused descriptors are fed into the SoftMax classifier head.This deep feature fusion approach enabled the generation of more robust features (Dai et al. 2021).
We fine-tuned the proposed network to adapt the initial weight parameters from the previous step to the proposed classification network.For fine-tuning, all layers except the last fully connected layer were frozen, and the network was trained for 100 epochs.Thus, the linear classifier was trained from scratch.Then, all layers were unfrozen, and the network was trained for 100 epochs, and the RGB optical and height features of each building were extracted.
To find an optimal learning rate, we trained the network with a range of learning rates, including 0.1, 0.01, 0.001, 0.0001, 0.00001, and 0.000001, each for just four epochs.The learning rate that results in the minimum loss function is selected as the starting point, and then we fine-tune the learning rate by exploring three decimal places below and above the chosen value.The one that has the minimum lost function is chosen as an optimal learning rate.The learning rates of these networks are shown in Table 4.
We employed two scenarios to ensure the reliability of the baseline networks and the proposed method.In the first scenario, we trained the method on buildings in Fredericton, New Brunswick, Canada, and tested it in Moncton, the largest city in the province.We conducted a second scenario to further validate the proposed method's accuracy.In this scenario, we trained the networks on buildings in Moncton and tested them on buildings in Fredericton.Table 5 demonstrates the number of training, validation, and test data samples for each scenario (individual building roof types).
To quantify the performance of the proposed network for roof type classification, we used two metrics, namely Overall Accuracy (OA) and the Kappa coefficient.The Overall Accuracy represents the proportion of correctly classified test samples to all the test samples, and the Kappa coefficient determines agreement between two raters.The formulas for the overall accuracy and kappa coefficient are presented in Equations ( 1) and ( 2).

OA ¼
Ʃ correctly classified roof types number of buildings Ã100 (1) where P o is the relative observed agreement among raters and P e is the hypothetical probability of chance agreement using the observed data to calculate the probabilities of each observer randomly seeing each category (Cohen 1960).
Phase 2: Model-driven 3D city model reconstruction After classifying the roof types and extracting building attributes such as eave height and ridge height, we used this information in the City Engine program using the ESRI CGA rule for 3D representation purposes.Next, we assigned the extracted roof types to the corresponding buildings in the model, utilizing the ESRI buildings' library.Since the library does not contain any models for complex and cross-hip roofs, we postponed modeling those to another study.Therefore, an LoD2 3D city model of flat, gable, hip, cross-gable, pyramid, gambrel, and Dutch roof buildings was created for city of Moncton and Fredericton, using a model-driven approach.
It is essential to assess whether the proposed LOD2 3D city modeling method performs properly over eave height, ridge height, and roof type detection tasks, Thus, we needed to assess the final 3D model.The accuracy of the final 3D model depends on the accuracy of the roof type classification and building decomposition steps.While CityGML-3 does not prescribe any fixed values, according to the CityGML-2 standard, the geometric error of 3D models should not exceed two meters.
To evaluate the accuracy of the final 3D model, we used the digital surface model as the ground truth and calculated the root mean square error (RMSE) (Chai and Draxler 2014) between each 3D building model and DSM.The formula for the RMSE is presented in Equation (3).

RMSE
where i is variable, N is the number of buildings, x i is DSM value of each building and x i is the 3D model of the building.

Result and discussion
We report the results of each phase and discuss them in the following sections.

Roof type classification
With limited roof type training samples, up to seven to the best of our knowledge, DL methods cannot classify satisfactory building roof types in our study area.So, we created the roof type dataset based on nine common roof types existing in New Brunswick.This dataset can be used for further CNN-based classification purposes and is published through the University of New Brunswick library (https://www.unb.ca/).
The quality indices for RGB optical-based and DSM (height)-based baseline deep learning networks in each scenario are presented in Table 6.The number in front of ResNets shows the number of its hidden layers.
In the first scenario, the OA of the proposed network with ResNet-18, ResNet-50 and ResNet-101 is 91.96%, 97.03%, and 97.58%, respectively, while the Kappa coefficient for the same network is 0.9030, 0.9639, and 0.9705 for ResNet-18, ResNet-50, and ResNet-101, respectively.Plus, the overall accuracy of the first DL network on the RGB optical dataset with ResNet-18, ResNet-50, and ResNet-101 is 89.54%, 91.74%, and 92.62%, respectively, while the Kappa coefficient for the same network is 0.8727, 0.8996, and 0.9102, respectively.The results also show the OA of the proposed method on the DSM (height)-based dataset with CNN's body of ResNet 18, 50 and 101 is 90%, 94.96%, and 95.26%, respectively, while the Kappa coefficient for the same network is 0.8430, 0.9533, and 0.9426, respectively.
The proposed method employing ResNet-101 as the CNN body demonstrated a significant increase of approximately 5% in OA and 0.06 in Kappa coefficient compared to using only RGB optical images.When compared to the height (DSM)-based network, the proposed method shows improvements of 2.3% in OA and 0.03 in Kappa metrics.Further, using ResNet-50 and ResNet-18 bodies enhanced the OA by approximately 6% and 2%, respectively, compared to RGB optical networks, along with a Kappa enhancement of approximately 0.06 and 0.03.Additionally, the proposed method using these two bodies improved the accuracy of the height-based network, resulting in an increase in OA for ResNet-50 and ResNet-18 of approximately 3% and 2%, respectively, and an improvement in Kappa metrics of approximately 0.01 and 0.06 for ResNet-50 and ResNet-18 bodies, respectively.
In the second scenario, the proposed network using ResNet-18 achieved an OA of 91.70% and a Kappa coefficient of 0.8980, while ResNet-50 achieved an OA of 95.02% and a Kappa coefficient of 0.9387.They were outperformed by ResNet-101, which had an OA of 96.30% and a Kappa coefficient of 0.9544.On the RGB optical dataset, the overall accuracy of the first DL network on the RGB optical dataset with ResNet-18, ResNet-50, and ResNet-101 was 82.35%, 84.04%, and 87.10%, respectively, while the corresponding Kappa coefficients were 0.7986, 0.8039, and 0.8416.Moreover, Using the DSM (height)-based dataset, the proposed method has achieved OAs of 87.66% and Kappa coefficients of 0.8532, ResNet-50, with OAs of 88.59% and Kappa coefficients of 0.8616, and ResNet-101 with OAs of 94.89% and Kappa coefficients of 0.9371.
We got a similar result in the second scenario.The result proves that the OA of the proposed method utilizing ResNet-18, ResNet-50, and ResNet-101 has increased around 9%, 11%, and 9%, respectively, compared to the RBG optical network and 4%, 7%, and 2% compared to the height (DSM)-based network.In addition, the Kappa metrics of the proposed method also improved around 0.10, 0.13, and 0.11 in these three ResNet 18, 50, and 101 bodies compared to the RGB optical-based network and 0.04, 0.07, and 0.02 compared to the DSM (height)-based network.
Based on the interpretation of these numbers, presented in Table 6, the results can be categorized into three main aspects:  Figure 7 demonstrates the confusion matrix of the proposed network.As shown in this figure, many of the misclassified roof types using ResNet-18 are related to the cross-hip and gable, which are misclassified as hips and flat, respectively.The structure of these roof types reveals some similarities between them, as shown in Figure 8. Hip and cross-hip roofs, for instance, have similar edges, and even flat roof types have parts with different heights, similar to gable roofs.ResNet-18 CNN's body network could not accurately classify them, whereas both ResNet-50 and ResNet-101 achieved significantly higher accuracy  in classifying these two classes.This result proves that deeper networks can have higher resolving powers for similar roof types.
As shown in Figure 9, DSM (height)-based network performs better than RGB optical-based network in most classes but struggles with classifying certain roof types such as hip or dutch-gable.On the other hand, the RGB optical-based network shows better performance in these classes.Therefore, combining each building's RGB optical and DSM (height) features leads to improved classification accuracy.

LoD2 3D reconstruction
The final 3D city model is generated based on roof type classes extracted using the multimodal proposed method.Snapshots of the LoD2 3D model of the city of Moncton with these different roof types are displayed in Figure 10.
The RMSE of the final 3D model and each roof type has been represented in Table 7.As shown in the table, the RMSE of the large-scale 3D city model is 1.03 meters, and the RMSE of individual roof types such as flat, gable, hip, cross-gable, pyramid, gambrel, and dutch range from 1.02 to 1.86 meters.According to the CityGML 3.0 standard, the accuracy of the 3D city model should be better than 2 meters.The results demonstrate high accuracy for the presented model on 1208 buildings, though its performance varies on different roof types (Table 7).Furthermore, for a better understanding of how the model correlates with the LiDAR point cloud, the overlay of the LiDAR point cloud on the 3D city model is shown in Figure 11.We also brought the Google Earth 3D representation of this building.
While the results demonstrate high accuracy in generating a 3D city model, there is still a limitation in the proposed method regarding complex roof types.The challenge in the large-scale 3D city modeling in this study is that buildings with complex structures could not model using a model-driven approach.In future studies, we will develop a hybrid method to construct 3D buildings with complex roof types.

Conclusion
In this study, we proposed a multimodal feature fusion deep learning-based network for classifying roof types into nine standard roof classes and constructing a large-scale LoD2 3D city model.The methodology inputs high-resolution orthophotos, LiDAR points cloud data, and building footprints layer and follows two different phases to create the large-scale 3D model.In the first phase, a decision tree-based method recognizes complex roof types.Then, the roof types of each building are classified by fusing the RGB optical and DSM (height) features through a deep multimodal feature fusion network.The initial parameters of this network are from DSM (height)-based dataset and RGB optical-based dataset baseline networks.In the second phase, the 3D model is fitted to each footprint using the building's roof information, such as roof type, eave, and ridge heights.As shown in this paper, our roof type classification network confirmed that utilizing the RGB optical and height features of each building improves roof type classification accuracy, thereby enhancing the overall accuracy of 3D building reconstruction.

Figure 1 .
Figure 1.Preprocessing and preparing the training and testing datasets.
selected and refitted to a plane, until a satisfactory number of planes are found that explain the majority of the data.Once the RANSAC algorithm has identified multiple planes, it can be used to classify and assign other points in the point cloud to those planes.(Derpanis 2010; Zeineldin and El-Fishawy 2017).Pretraining baseline networksResidual neural network (ResNet) is a CNN architecture used for image classification with the purpose of solving the vanishing gradient problem that causes the gradient to become infinitely small during backpropagation due to sequential multiplication.ResNet solves

Figure 2 .
Figure 2. Overall workflow of the multimodal feature fusion network and 3D city model reconstruction.

Figure 3 .
Figure 3. Complex roof type distinguishing using a decision-tree based method.

Figure 4 .
Figure 4. Multimodal feature fusion framework.After generating two ImageNet-based pretrained ResNet networks and fine-tuning them based on RGB optical and height datasets in step 1 and step 2, the weight parameters of these two networks are fed to the proposed multimodal feature fusion network in step 3.

Figure 5 .
Figure 5. Flowchart of the eave height calculation process.
Figure 6.Snapshots of the poor contrast or shadow areas in RGB optical image.

Figure 7 .
Figure 7. Confusion metrices of the proposed method.

Figure 8 .
Figure 8.Most common misclassified roof types.There are some cross-hip roofs that have been misclassified as hip roof tops and this figure shows the similarities between them and the hip roof top.(a) Samples of misclassified cross-hip roofs.(b) A sample of hip roof.

Figure 10 .
Figure 10.Snapshots of the 3D city model of Moncton.

Table 1 .
Specification of the data used in this study.

Table 2 .
RGB optical and DSM sample images of the roof types.

Table 3 .
Number of average edges and planes for each building roof types.

Table 4 .
Learning rate and epoch values.

Table 5 .
Training, testing and validation datasets.

Table 6 .
Overall accuracy and Kappa coefficient of the proposed method.

Table 7 .
RMSE result of the 3D city model of Moncton.