Classification of the qilou (arcade building) using a robust image processing framework based on the Faster R-CNN with ResNet50

ABSTRACT Qilou (arcade building) is a particular type of Chinese historical architecture combined with western and eastern building elements, which plays a significant role in the history of modern Chinese architecture. However, the recognition and classification of the qilou mainly rely on manual inspection, suppressing the cultural dissemination and protection of qilou relics. In this paper, we present a new framework that adopts multiple image processing algorithms and a deep learning network to automate qilou classification. First, image dataset of the qilou is enhanced based on the Contrast Limited Adaptive Histogram Equalization (CLAHE) algorithm. Then, an improved Faster R-CNN with ResNet50 (Faster R-CNN-R) is deployed for qilou image recognition. A total of 760 images captured in Guangzhou were used for training, validation, and accuracy check of the proposed framework and several contrastive networks under the same conditions. Compared to other networks, the proposed framework works better than Faster R-CNN with VGG16 (Faster R-CNN-V) and FCOS. The accuracy of the proposed framework embedded with the Faster R-CNN-R, Faster R-CNN-V, and FCOS are 80.12%, 65.17%, and 66.35%, respectively. Based on digital images captured under different lighting conditions, the proposed framework can be used to classify nine different types of qilous, with high robustness.


Introduction
Qilou, also known as arcade, arcade building, or veranda in some literature, is a common type of historical architecture found in Southeast China, especially in the Guangdong Province (Lin 2006).Qilou has a colonialveranda style combined with the European and South-East Asian styles (Chen and Wang 2013;Chen et al. 2011).A qilou has more than two floors: the first floor is a commercial unit with sidewalks front, and the residential area is overhang (Zhang 2015).Qilou was named differently in Asia until its official nomenclature was released in 1912 (Li 2013).It had various historical backgrounds during the 1920s-1930s, but a decline afterwards (Li 2018).With more measures brought by the Guangdong provincial government in China since 1998, the qilou was being protected formally for cultural heritage purposes (Li 2018).
With state-of-the-art sensing and artificial intelligence technology, parts of the conservation of qilou can be achieved digitally and automatically.For example, the structure of the qilou can be recorded by using digital photogrammetry and laser scanners (Lerma et al. 2010;Liang et al. 2018).As shown in Figure 1, the details of the qilou blocks, including façades consisting of pediments, parapets, decorations, and balconies, are captured with cameras.The qilou block's design patterns and building types can be analyzed by inspecting the photo, which is useful for qilou's conservations and documentations (Yilmaz et al. 2007).However, manual inspection of qilou photos is labour intensive and time consuming (Lin 2002(Lin , 2006)).Besides, qilou façade style classification may be misclassified even by experienced experts (Lin 2006;Yang, Hu, and Pan 2012).As a result, a convenient and robust recognition framework for the qilou façade style is always desired to document and thus conserve qilou.
To recognize qilou styles, it could be possible to employ the Computer Vision (CV) algorithm to identify radiometric and geometric features for heritage purposes (Pintus et al. 2016).In recent years, Machine Learning (ML) methods and Deep Learning (DL) methods have been used to detect features and structures of buildings (Maskeliūnas et al. 2022;Xia et al. 2020;Xu et al. 2014;Zhao et al. 2018).DL was also broadly used for building structure monitoring, such as crack detection, corruptions, and delamination detected by CONTACT Ting On Chan chantingon@mail.sysu.edu.cnSchool of Geography and Planning, Sun Yat-sen University, 135 Xigangxi Road, Haizhu District, Guangzhou, China Convolutional Neural Network based (CNN-based) methods (Beckman, Polyzois, and Cha 2019;Cha et al. 2018;Cha, Choi, and Büyüköztürk 2017).For cultural heritage purposes, the CV, ML, and DL methods were used for historical building classification (Fiorucci et al. 2020;Llamas et al. 2017).The CNN-based methods were proven to be robust in feature recognition and building classification (Llamas et al. 2016;Wang et al. 2022;Ye 2022;Yi, Zhang, and Myung 2020).However, these methods focused on recognizing distinctive styles on decorations adhered to building façades, rather than the entire façades.Since the qilou can be broadly classified into more than nine architectural styles, manual classification is not trivial.The CNNbased networks can be adapted to reduce manual intervention for cultural heritage and other related studies.
In recent years, many CNN-based deep learning networks such as Faster Regions with CNN features (Faster R-CNN), You Only Look Once (YOLO), and Single Shot Multibox Detector (SSD) have become more popular for building classification (Liu et al. 2016;Redmon et al. 2016;Ren et al. 2015).These networks focus on different strategies in feature extraction.For example, the Residual Network (ResNet) or Visual Geometry Group (VGG) is used in Faster R-CNN, and Darknet or CSP Darknet is used for YOLO.Faster R-CNN is a two-stage network that uses a more accurate strategy for generating proposals.The two-stage network has lower speed in training and is more complex than the one-stage network (Huang et al. 2017).The performance in classification varies with the building styles (Chen et al. 2022;Jiang, Noguchi, and Ahamed 2022;Yang et al. 2022).
In this paper, we propose a framework based on the Faster R-CNN along with the backbone of ResNet50 (Faster R-CNN-R) to classify qilou's styles accurately using façade images captured by digital cameras.The framework involves a preprocessing stage that improves the capability of the Faster R-CNN and also the accuracy of feature detection and style classification, with recourse to the CLAHE algorithm.The CLAHE reduces the impact caused by different lighting conditions during image capturing, so that the features can be more accurately extracted from the images for training the networks.Compared to the NASNet (Yoshimura et al. 2019) and SB-GoogleNet (Obeso et al. 2017), which classify images of regular buildings, the proposed framework focused only on qilou image classification with an improved feature extraction and enhancement method.A recent study using Faster R-CNN with the tubularity flow field (TuFF) and the distance transform method (DTM) algorithms for crack detection and segmentation had achieved a high level of accuracy (Kang et al. 2020), inspiring our work for the qilou classification.It has a structure that builds more network layers to process the classified results.
Several other networks, the Faster R-CNN with the backbone of VGG16 (Faster R-CNN-V) and the Fully Convolutional One-Stage Object Detector (FCOS) are compared with the proposed framework.The paper has been organized as follows: Section 2 describes the study area and the qilou style for the classification, Section 3 presents the framework's main components, Section 4 delineates the experiment.Section 5 describes the classification results, and Sections 6 and 7 are the discussion and conclusions, respectively.

Study area
As our chosen area, Guangzhou has the largest number and styles of the qilou in China.According to statistics, thousands of the qilous exist in Guangzhou, with 10-14 architectural styles.Those qilous are mainly gathered within Haizhu, Yuexiu, and Liwan Districts (bounded by blue lines), as shown in Figure 2.

Styles of the qilou
Among all the qilou styles, the Veranda style combined with the Chinese style of qilou first appeared in the Guangdong Province of China, having diffused and independent architectural styles.According to Lin (Lin 2006), there were two main styles of qilou characterized by their façade, including the Early Modern style and Modern style.Then, five medium categories of qilou styles were defined as the Traditional Chinese style, Western style, Chinese-Western style, Chicago School style, and Modern style, comprising the two main styles and nine categories illustrated below.Similar to Lin, Yang (Yang, Hu, and Pan 2012) classified the qilous into seven styles: Ancient Greek style, Ancient Roman style, Gothic style, Baroque style, Nanyang style, Traditional Chinese style, and Modern style.Styles of the qilou can be recognized and classified based on different features in each section described in Figure 3. Pediments and parapets in the upper section can be for recognition of the Nanyang style.In the middle section, columns can be found in the Classicism style, and windows are exceptionally decorated in the Gothic style.The lower section can be inspected to distinguish the Ancient Roman style.As for the Baroque style, features such as balconies and pediments appear in both the upper and middle sections.Nevertheless, the proposed framework mainly deals with the nine styles of qilou.Those styles are introduced in Sections 2.2.1 to 2.2.5, and their features are summarized in Table 1.

Traditional Chinese style
The Traditional Chinese style can be further classified as the Chinese-palace and Chinese-dwelling styles.The unique feature of the Chinese-palace style is that a Chinese-palace roof is prominent in the qilou's upper section.As for the Chinese-dwelling style, red or grey bricks on the qilou façades can be identified as one of the features, and their roof is often decorated as an overhanging or flush gable roof (Lin 2006).

Western style
The Western styles of qilou can be further categorized as the Gothic, Renaissance, Baroque, and Classic styles, depending on their architectural features.The Gothic style of qilou mostly lies in narrow walls, which are unique features in qilou façades (Smith 1890).The Renaissance style emphasizes symmetry, proportion, geometry, and the regularity of the qilou's components, as demonstrated in the architecture of classical antiquity and ancient Roman architecture (Anderson 2013).The Renaissance styles of qilou often have classical orders combined with ancient Roman styles, which show in the structure of the façade.Besides, the Baroque style is a highly decorative and theatrical style that first appeared in Italy in the early seventeenth century and gradually spread across Europe (Martin 2018).Baroque style is usually shown in the structure of pediments, balconies, and columns in qilou (Yang, Hu, and Pan 2012).Lastly, the Classicism style of qilou is often accompanied by large columns between the second and third floors, which shows elegance for Classicism (Babinovich and Sitnikova 2021).• The roof is like a Chinese palace.
• Some Chinese palace elements are on top.

Chinese-dwelling Style
• Red or grey bricks.
• Overhanging gable roof or flush gable roof on top.
• Nearly no artistic decoration.

Gothic Style
• Narrow walls between windows.
• Pinnacle on top.
• Elaborate flying buttresses and arch windows if the building is for commercial.

Renaissance Style
• Classical order is used in columns, blank windows, decorations, and so on.
• Arches and triangles appear in parapets and pediments

Baroque Style
• Baroque style on upper sections.
• Balconies, columns, and some decorations on façades can be partially Baroque.

Classicism Style
• Columns in the middle often have two storeys in height.
• Three sections are often very distinguishable.
• Strong geometry could be found on the façade.

Nanyang Style
• One or multiple (circular) holes on parapets or pediments.
• Well ventilation design.

Chinese-Western style
The Nanyang style is predominantly the Chinese-Western (Li and Ying 2021).This type of qilou is usually two or three floors in height.A unique feature of the Nanyang style is that one or multiple (circular) holes are carved on the parapets and pediments to allow wind to pass through, reducing typhoons' influence (Yang, Hu, and Pan 2012).The shape of these holes is not restricted, they can be arbitrary.Circular shape is almost the most popular though.

Chicago school style
Chicago's architecture has a representative style, namely, the "Chicago School" (Condit 1964).When the qilou became popular in the 1930s, the Chicago School style was learned and replicated as the basic type.The structure of Chicago windows may help distinguish them from other styles.Besides, the Chicago School style qilou is mostly the tallest among all other types of qilou.Most of the Chicago School style qilou are hotels, such as the Oi Kwan Hotel and the East Asia Hotel in Guangzhou (Lin 2006).

Modern style
The Modern style of qilou is the most commonly found nowadays as most of them are built in a more recent period (e.g., within 30-40 years).These qilou façades are as concise as other modern architecture, with rare decorations and similar structures (Zhao 2020).Moreover, the Modern style is the most popular qilou style in Guangzhou.

Methods
The proposed method improved the qilou classification accuracy by blending CLAHE-RGB and Faster CNN with ResNet50, inspired by the RCNN network previously developed for structural monitoring (Kang

Preprocessing methods
The CLAHE method (Hitam et al. 2013) is mainly used to reduce the contrast between the images as the images are contaminated with shadows to different extents.This method is robust and straightforward for the masked images.To handle RGB images, a CLAHE-based method, named CLAHE-RGB, is implemented in this paper.The main steps are as follows: • Preprocess the input RGB images.First, normalize all images into either portrait or landscape format (e.g., 4000 × 3000 or 6000 × 4000 pixels).Then, split images into multiple small pieces (e.g., 8 × 8 pixels for each piece) under each color channel.

Chicago School Style
• Iconic Chicago windows.
• Height can go up to 64 m, about 15 storeys.

Modern Style
• Beams and columns are often integrated.
• Rare or no decorations.
• No height limit was found.
called ClipLimit.As shown in Figure 5, the underneath part of the gray value (GV) will be evenly expanded along the count-axis and used to translate the entire histogram upward.The ClipLimit will be recalculated cyclically for each piece using the following equation: Where S represents the square of each piece.The initial value of ClipLimit is set as 2 based on empirical knowledge and will be recalculated after each clip until it equals 2.
After the histogram balances the gray value of each piece, combine all pieces into a complete image using the Bilinear Interpolation to smooth the gray value between two pieces in each color channel (Castleman 1996).The Bilinear Interpolation in CLAHE method can be implemented as follows: where f (x, y) is the gray value of the target pixel, and its coordination is (x, y).f represent the gray value of center pixel in lower left piece, upper left piece, lower right piece, upper right piece, respectively.The coordinations of Q 11 , Q 12 , Q 21 , Q 22 are (x 1 , y 1 ), (x 1 , y 2 ), (x 2 , y 1 ), (x 2 , y 2 ), respectively.With the bilinear interpolation, each image will be restored to their original pixel shape.
Finally, in order to equalize the pixels of each photo (to 6000 × 4000 pixels), the Bicubic Interpolation method (Keys 1981) is used to improve the quality of those images.Similar to Bilinear Interpolation, the color value of the target pixel is calculated by color values of pixels around, but 16 pixels in Bicubic Interpolation.The Bicubic Interpolation proves a better image processing method in upscaling an image, but it requires higher computational power (Han 2013).

Faster R-CNN algorithm
Due to the numerous features and noise in qilou façade images, state-of-art CV methods embedded with DL and ML are used to achieve high accuracy classification.A CNN-based method can be trained to  recognize features by leveraging their robustness in handling images with complex information.Furthermore, the effectiveness of CNN-based method in building classification has been well established in the recent decade (Abed, Al-Asfoor, and Hussain 2020; Ćosović and Janković 2020).We utilize the Faster R-CNN (Ren et al. 2015) for qilou façade image classification for this work.
The Faster R-CNN is a two-stage network combined with Region Proposal Network (RPN) and Fast R-CNN (Girshick 2015).The flowchart for the adapted network in our framework is shown in Figure 6.In this research, the Faster R-CNN algorithm can be divided into four parts: feature extraction layers, RPN, ROI pooling layers, and classification layers (Liu, Zhao, and Sun 2017).
To ensure that most of the qilou's façade features are detected and classified, the feature extraction layer is indispensable.In this regard, a Residual Network with 50 layers (ResNet50) model is employed to extract features from qilou facade images.ResNet50 is a deep convolutional neural network that incorporates residual connections, as depicted in Figure 6.This structure enables ResNet to learn from both the input layer and output layer, instead of just from the output layer, resulting in a more comprehensive representation of qilou façade images.For qilou classification, since qilou's features in each style are complicated, the ResNet50 model is better equipped to extract meaningful information from qilou images.Besides, the ResNet50 model has a better performance compared with VGG16 model in Faster R-CNN's feature extraction layers (Theckedath and Sedamkar 2020;Wen, Li, and Gao 2019).
Within the proposed framework, the Faster R-CNN-V is used as follows.Firstly, all the images are resized to 900 × 600, the ResNet50 model is used to extract features in images.Secondly, the feature map's data flows to RPN for a better proposal.After the data passes the 3 × 3 convolution layer in RPN, it calculates bounding box regression by using a 1 × 1 convolution layer while using softmax (a type of activation function) and reshape features for positive or negative classifications.Results of the RPN and image information ("im info" in Figure 6, denoting the ratio resolution between the resized image and the original image) decide the proposal layer.The feature maps and proposals are combined for the ROI (region of interest) pooling that will evaluate the features of targets in the images and resize them to the same shape.After the fully connected layers (FCs in Figure 6), the classification and regression are performed to calculate the result and label the best suitable area according to the confidence percentage.

Data acquisition
About 1000 qilou images were captured in the study area described in Section 2.1 in May 2022, with different shot distances spanning from 10 to 20 m, using a NIKON D5600 camera (Figure 7a) with a photograph size of 6000 × 4000 pixels and a Samsung Galaxy S21 Ultra cell phone (Figure 7b) of 4000 × 3000 pixels.

Data processing
About 760 images were selected randomly as training and validation datasets from all the collected images.A total of 684 images were used as the training dataset, while 76 images were used for validation.The experiment was conducted on a personal computer with an AMD core Ryzen 9 5950X and an NVIDIA RTX 3070 (8GB), having the CUDA 11.3 and CUDNN 10.8 for the software environment.
For labeling a qilou image, façades of qilou were marked on a web page called "MakeSense" developed by Piotr (Skalski 2019).Whole façades of all single qilou in images were labeled as boxes and saved as text or an XML file.Different networks might need a different label format.Transformations among these networks were required in this work.
After labeling the images, a dataset of qilou façades was fully prepared.The Faster R-CNN with VGG16 and FCOS was chosen to compare with Faster R-CNN-R.The FCOS network is a one-stage anchor-free network (Tian et al. 2019).The anchor box of FCOS is provided by a point and four distances predicted by FCOS which are between boundaries and the point.The ResNet50 was used in FCOS backbone in this study for comparison.The VGG model is a fully convolutional network with multiple layers (Simonyan and Zisserman 2014).Faster R-CNN was first proposed with the backbone of VGG16.Then, the VGG model was replaced by ResNet, a deep residual network, which mostly worked well in different situations (He et al. 2016).In this study, we provided the backbone of VGG16 in Faster R-CNN for another experiments.
The implement of the Faster R-CNN and FCOS was based on the famous open-source framework for DL called Pytorch, which is operated as a Python project for users (Paszke et al. 2019).The optimizer of the Faster R-CNN was Adam, and the batch size was set as 4. The learning rate of the Faster R-CNN equalled to 10 −4 , which would go down by using Cosine Annealing Learning Rate.The parameters are set based on empirical knowledge and preliminary testing (depending on the capacity of the computer's RAM).The same parameters were used in Faster R-CNN-V and FCOS, as well as their framework with CLAHE method.

Evaluations
In order to measure the performance of training in DL classification tasks, series evaluation metrics, such as Precision, Recall, Average Precision (AP), and mean Average Precision (mAP), are used in this research.The Precision can evaluate the probability among all detected positive predictions (including true positive and false positive), while Recall reflects the probability of true positives among all actual positive instances (including true positive and false negative).Higher Precision and Recall imply that the models occupied in this task perform better, which may lead to an increment in the overall performance.These evaluation metrics (Precision, Recall, AP, and mAP) are, respectively, defined as follows: where TP (True Positive) is the number of predictions that accurately recognize the ground truths.TN (True Negative) represents predictions that are true but do not belong to this variation.FP (False Positive) is the predictions that are not true but are shown in pictures.FN (False Negative) is denoting that the network cannot identify the ground truth.AP (Average Precision) is an evaluation indicator to the single class prediction, calculated by precision and recall, which can be directly visualized by fitting the Precision-Recall Curve (P-R Curve).P(k) is the precision, and R(k) is the recall.mAP (mean Average Precision) is the valuation of a network, calculated by using average precision.
After the evaluations, a group of test images that do not belong to any of the training data is used to examine the accuracy of each framework.The accuracy in this part of the test is calculated as:

Image preprocessing
The image contrast improvement for the original images can be realized in Figures 8 and 9 after applying the CLAHE-RGB method.Figure 8(a,b) show that a very high GV is caused by sunlight and shadow in the histogram, corresponding to the black shadow appearing in the image.After using the CLAHE-RGB method, Figure 8(c,d) show that the shadow and sunlight are equalized, and the color transition between pixels becomes more natural for the visualization.The chiaroscuro caused by light radiation is weakened.Differences in the contrast in images captured on cloudy days and sunny days were reduced.Similar results can be found in Figure 9.

Results for the neural networks
The networks used in this research were trained 100 epochs, respectively; 50 epochs were frozen, and 50 epochs were unfrozen.Training loss of each network is shown in Figure 10.It can be seen that the "loss curves" of all networks are lower than one after 100 epochs, which represents well training in each network.The reason for the existence of the "loss curve" is that there is a small difference found between the same network with different preprocessing methods used by the common structures.The FCOS network's losses are different from the Faster R-CNN network.FCOS had a lower loss in the beginning but higher in the end.These results show that the loss is mainly determined by the network, then the backbone, after that preprocessing.
Precisions and recalls can be combined and shown in a graph, namely, the Precision-Recall Curve (P-R Curve).The P-R curve is a vital chart describing the relationship between precisions and recalls, and calculating average precision.Average precision is usually calculated by using the area beneath the P-R curve.The P-R curves of the Faster R-CNN-R, Faster R-CNN-V, and FCOS are compared in Figure 11.It can be seen that P-R curve lines in the network with CLAHE-RGB method were more than lines in network without preprocessing.Three networks' P-R curves seem quite the same, but a few various curves are different.
Meanwhile, the corresponding AP/mAP for each network is shown in Table 2.The table shows that the framework combined with the CLAHE-RGB method and the Faster R-CNN-R has the best performance in classification, as the mAP is the highest.The mAP of Faster R-CNN with CLAHE-RGB seems inferior compared to Faster R-CNN mAP in UA-DETRAC datasets (58.45%).The mAP comparison between Faster R-CNN with ResNet50 and VGG16 shows that the   R-CNN-R and FCOS perform well in Chinese-palace style with CLAHE-RGB, then Faster R-CNN-V with preprocessing method.It seems suitable for Faster R-CNN with CLAHE-RGB for detecting Classicism style and FCOS with CLAHE-RGB method in Gothic style according to Figure 12(e,f).In Figure 12(g-i), Faster R-CNN-V framework has larger area in Modern style, Nanyang style, and Renaissance style.
Both FCOS and Faster R-CNN-R framework have their own merits in recognizing Modern style, Nanyang style, and Renaissance style.

Results of the overall framework
One hundred images from all the collected images were selected randomly to verify the factual accuracy of style classifications.As can be seen in Table 3, the Faster R-CNN-R can detect qilou more than Faster R-CNN-V and FCOS with CLAHE-RGB method.The ResNet50 model has fewer misidentifications or misses than VGG16, which is a crucial factor in calculating accuracy.This is likely due to the fact that the sum of the false negatives and positives is much less than that of the Faster R-CNN-V.As for FCOS, with the same backbone as ResNet50, detection and classification of qilou façades performed mediocrely.The number of correctly recognized qilou in FCOS is less than Faster R-CNN, indicating that the classification of qilou façade  is challenging for the one-stage networks due to the complicated qilou's geometric features, but less misguidance was found in the results.
Figures 13 and 14 show the detections and classifications of qilou façades using different methods.In Figure 13, these four qilous are Modern style, Chinesedwelling style, and another two Modern styles (from left to right in Figure 13).As in Figure 14, all of them are Modern styles.It can be seen in Figures 13(a) and 14(a) that the styles of qilou can be recognized correctly, but two missed detections in 13(b) and a misidentification in 14(b).Though Faster R-CNN-V with CLAHE-RGB method can recognize correctly in Figures 13(c

Discussion
In this study, Faster R-CNN-R, Faster R-CNN-V, and FCOS were applied for qilou style classifications, and a preprocessing CLAHE-RGB method for enlighting the features of qilou was provided.The results demonstrated that networks did approve the accuracy with the CLAHE method, for the pediments, parapets, columns, and decorations were more specific than before.Faster R-CNN-R showed superior performance in qilou classification due to its deeper architecture and advantages in recognizing more complex features by using ResNet50.The two-stage mechanism of Faster R-CNN allowed it to create proposals of potential features on qilou façades, and then the second stage of network classified the matched features.Compared to the result of another Faster R-CNN applications, the accuracy of our results proved to be useful for qilou classification.It was also shown that Faster R-CNN algorithm could handle diverse features on single qilou façade, and higher accuracy than others (Xia et al. 2020;Yi, Zhang, and Myung 2020).The role of ResNet50 is for qilou feature extraction.ResNet50 is a model with unique residual blocks, which is more generalized compared to VGG model.However, qilou classification was more complicated than we thought.Some qilou façades were partly covered with pipelines, electric lines, or hidden behind woods, as Figure 15(a).Some façades of qilou in commercial streets were blocked by enormous advertising boards, as shown in Figure 15(b).If the height of the façade was higher than the bridge in certain streets, features were not able to be recorded on the front, a weird angle of photo was shot, as shown in Figure 15  (c,d); thus, the accuracy could not be guaranteed.

Conclusion
In this paper, we proposed a qilou image classification framework based on the Faster RCNN-R for which its input is first processed by the CLAHE-RGB algorithm that reduces the discrepancy of contrast for the images.The qilou styles were mainly classified into nine categories in Guangzhou, China.The detection results for the nine main styles of qilou from the digital images using the proposed framework integrated the CLAHE-RGB.For training and evaluation, the Faster R-CNN-R with CLAHE-RGB framework had less training loss and better mAP than other frameworks or networks.With the CLAHE-RGB image processing method, the mAP of Faster R-CNN-R, Faster R-CNN-V, and FCOS were 51.66%, 51.15%, and 47.30%, respectively.The mAP of Faster R-CNN-R, Faster R-CNN-V, and FCOS trained by original images was 18.46%, 21.51%, and 20.69%, respectively.
The test also investigated the effectiveness of the CLAHE-RGB preprocessing method and ResNet50 feature extraction model for qilou façade image detection using Faster R-CNN-R, Faster R-CNN-V, and FCOS.The results demonstrated a significant improvement in detection accuracy when employing the CLAHE-RGB method, with Faster R-CNN-R, Faster R-CNN-V, and FCOS achieving accuracies of 80.12%, 65.17%, and 66.35%, respectively.In contrast, without the application of CLAHE-RGB, the accuracies were notably lower at 62.24%, 53.65%, and 50.63%, respectively.These findings highlight the importance of the CLAHE-RGB preprocessing technique in enhancing the performance of the detection algorithms.The use of a deep residual network (e.g.ResNet50) as the backbone proved to be more effective than a deep convolutional network (e.g.VGG) in this research context.Furthermore, the study establishes the superiority of the Faster R-CNN-R algorithm in the proposed framework, emphasizing the benefits of a two-stage network over a one-stage network (e.g.FCOS) for qilou façade style classification.As a direction for future research, the exploration of qilou façade style classification using images or videos captured by Unmanned Aerial Vehicles (UAV) has the potential to attract widespread interest and further advance the field, as the UAV has been widely used for structural inspections (Ali et al. 2021;Kang and Cha 2018;Waqas, Kang, and Cha 2023).

Disclosure statement
No potential conflict of interest was reported by the authors.

Notes on contributors
Ming Ho Li, who just finished his master degree in Sun Yatsen University, specializing in the structural of modern cultural architectures.
Yi Yu, an Associate Professor in East China Normal University, who experts in emotional labor, care ethics, and biopolitics.
Hongni Wei, a Lecturer in Guangdong University of Foreign Studies, who specializes in culture and geography.
Ting On Chan, is an Associate Professor in Sun Yat-sen University.His research area is on the field of LiDAR and photogrammetry applications.

Figure 2 .
Figure 2. Research area (bounded by blue lines in the figure), among the Haizhu, Yuexiu, and Liwan Districts in Guangzhou, Guangdong Province, China.

Figure 3 .
Figure 3.Some features for classification of different styles of qilous.
et al. 2020), to form a new image processing framework, making use of the advantage that the CLAHE-RGB stabilizes the image contrast to improve the robustness of the overall framework.The proposed framework's classification process can be divided into four parts, as illustrated in Figure 4. First, digital images of the qilou are captured and stored.Then, a CLAHE-RGB (Contrast Limited Adaptive Histogram Equalization-red-green-blue) method is used to minimize the difference in contrast of digital images, while the Bicubic Interpolation method is used to reduce the differences of pixels in images.Next, the preprocessed images are labelled.After that, a deep learning method based on Faster R-CNN is trained and then used to classify styles of qilou with the processed digital images.Advanced DL semantic segmentation methods such as SDDNet, IDSNet, and STRNet (Ali and Cha 2022; Choi and Cha 2019; Kang and Cha 2022) focused on processing large image size input videos in real time, which provided inspiration for the current study.

Figure 4 .
Figure 4. Workflow of the proposed framework.

Figure 5 .
Figure 5. Transformation of the histogram using the ClipLimit value.

Figure 6 .
Figure 6.Layout of the adapted Faster R-CNN in the proposed framework.

Figure 8 .
Figure 8. CLAHE-RGB method was applied to Image 1 in the datasets.(a) Original image, (b) histogram of the original image, (c) CLAHE-RGB processed image, (d) histogram of the CLAHE-RGB processed image.

Figure 9 .
Figure 9. CLAHE-RGB method application to Image 2 in the datasets.(a) Original image, (b) histogram of the original image, (c) CLAHE-RGB processed image, (d) histogram of the CLAHE-RGB processed image.

Figure 10 .
Figure 10.Epochs of different neural networks in this study.
detection in Figures13(d) and 14(d)  shows that four false positives are detected and one qilou is missed.In Figures13(e) and 14(e), the false positives in FCOS with CLAHE-RGB are less than Faster R-CNN-V with CLAHE-RGB, but false negatives are more than those.The FCOS algorithm can recognize Chinese-dwelling style and Modern style in Figure13(f), but none of qilou is detected in Figure14(f).It is worth noting that the high detection and innovations are mainly related to the structure of the two-stage network.The two-stage network is more suitable than the one-stage network, whatever the model of the feature map used.The onestage network also has the process of feature extraction, but confusion may occur when training with many heterogeneous qilou images.Another reason for the better performance is that the structure of the two-stage network can extract features more effectively as it is associated with the CLAHE-RGB method that increased the entropy of the images.

Figure 15 .
Figure 15.Examples of complicated scene in qilou classification tasks.(a) features are hidden behind woods and electric lines, (b) façades are inundated with advertisements, (c) the bridge is in close proximity to the façades, (d) a shot for recording qilou upper sections.

Table 1 .
Details of different styles of the qilou.
• Find the histogram equalization of each piece, and clip the histogram at a user-defined value

Table 2 .
AP and mAP of the tested neural networks for model training (best results highlighted in bold font).

Table 3 .
Accuracy of the tested neural networks for the qilou classification (bold font represents our method's results).