Seismic vulnerability assessment at urban scale using data mining and GIScience technology: application to Urumqi (China)

Abstract Seismic vulnerability assessments play a significant role in comprehensive risk mitigation efforts and seismic emergency planning, especially for urban areas with a high population density and a complex construction environment. Traditional approaches such as in situ fieldwork are accurate for conducting seismic vulnerability assessments of buildings; however, they are too much time and cost-consuming, especially in moderate to low seismic hazard regions. To address this issue, an integrated approach for a macroseismic vulnerability assessment composed of data mining methods and GIScience technology was presented and applied to Urumqi, China. First, vulnerability proxies were established via in situ data of buildings in the Tianshan District with an EMS-98 vulnerability classification scheme and two data mining methods, namely, support vector machine and association rule learning methods. Then, vulnerability proxies were applied to the Urumqi database, and the accuracy was validated. Finally, seismic risk maps were constructed through data consisting of direct damage to buildings and human casualties. The results indicated that the two data mining methods could achieve desirable accuracies and stabilities when estimating the seismic vulnerability. The seismic risk of Urumqi was estimated as Slight with a predicted number of 61,380 homeless people for a seismic intensity scenario of VIII.


Introduction
Natural hazards represent an ever-present threat to human life as well as to the physical infrastructure and global economy; among them, earthquakes constitute extremely serious and deadly natural disasters worldwide (Wei et al. 2013;Wei et al. 2017). China, which is highly prone to earthquakes, frequently suffers from destructive and catastrophic earthquake activity that results in the serious loss of both life and property . Approximately 1,600,000 deaths have been caused by Geiß and Taubenb€ ock 2013;Kaushik and Dasgupta 2013;Geiß et al. 2014;Moradi et al. 2014;Geiß et al. 2015a;Geiß et al. 2015b;Su et al. 2015;Costanzo et al. 2016;Dhar et al. 2016;Klotz et al. 2016;Ghorbanzadeh et al. 2017). Data mining methods have also been developed to ascertain the best proxy that links the features of building, which are easily assessed using remote sensing and civil engineering methods, with their seismic vulnerabilities (Şen 2010, 2011Chen et al. 2012;Siraj et al. 2014;Wu H et al. 2014;Riedel et al. 2015;Campostrini et al. 2017;Ghorbanzadeh et al. 2017;Guettiche et al. 2017). The synergistic use of in situ fieldwork, data mining methodologies and GIScience technology constitutes a promising yet challenging approach for reducing the costs of in situ building surveys, which could become more efficient and accurate through an association with vulnerability index methods (VIM) in addition to macroseismic vulnerability and risk assessments in moderaterisk, seismic-prone regions.
Here, an integrated approach combining data mining, remote sensing interpretation, GIS-based mapping, and VIMs is tested and validated through a macroseismic vulnerability assessment of the city of Urumqi. First, the city of Urumqi is presented along with the Urumqi database. Second, the support vector machine (SVM) and the association rule learning (ARL) methods are presented and applied to in situ data to derive two vulnerability proxies for Urumqi. Based on these two vulnerability proxies and the Urumqi database, vulnerability maps of the city are generated within a GIS environment. Finally, the seismic damage within the city is further presented in association with the distribution of the expected damage grade throughout Urumqi in addition to the estimated number of human casualties.

Study area
The city of Urumqi, which is situated in Northwest China, has experienced considerable changes to its urban development over the last 20 years, and it currently serves as a regional, cultural, political, and commercial centre. By the end of 2014, the city hosted approximately 3,550,000 inhabitants and an administrative area of 10,989 square kilometres (2015 Census). As presented in Figure 1, the study area of this article, which encompasses the centre of Urumqi exhibits a probabilistic seismic intensity of VIII  and constitutes an urban area where comprehensive disaster prevention methods are crucial.
Since the beginning of the twentieth century, Urumqi has witnessed seven earthquakes with magnitudes of M s ! 5.0. The strongest recorded event was the M s 6.0 earthquake that struck in 1934 (Shen and Song 2008). Recent GPS measurements indicated that the compressional deformation of the Tianshan area is accommodated primarily by a series of W-E-trending fold-thrust belts with a slip rate of 1-2 mm/a and that Urumqi is moving toward the NE at a rate of 4-5 mm/a (Yang et al. 2008;Wu et al. 2017). The city is currently characterized by moderate to low seismic activity, and no major earthquake has occurred recently. However, no detailed quantitative assessments have been conducted within the city to date.

Data
Building inventories are considered the most important factors in seismic vulnerability and risk assessments. The database used in this study is constituted of two parts: the Urumqi database and the in situ data of the Tianshan District.

Urumqi database
The Urumqi database refers to the Urumqi earthquake emergency foundation database, which comprises social, economic, and population information in addition to a building inventory and city map as well as natural geographic landforms, key object locations, rescue team information, relief communication data, and earthquake preplanning data (Zhong et al. 2002;Xu et al. 2016). The building inventory and population information were employed in this study to conduct a seismic vulnerability assessment; meanwhile, the other elements, which would only slightly affect the results of the assessment, were not included.
In China, urban expansion processes are closely related to national macroscopic policies. Moreover, urban residential communities typically refer to various types of relatively independent residential areas throughout the city (Liu et al. 2016;Lixin et al. 2017). Considering the above characteristics and the ambitions of the macroseismic vulnerability assessment in this study, the basic Urumqi database is subdivided into multiple sections according to several elements: urbanization, urban planning, land use and land cover, district limits, and other factors regarding building construction. Each section is represented by high building and population densities, unified structures, and a small mobility. Only residential dwellings are considered in this study; the information regarding the period of construction and number of floors for each building is available within the Urumqi database. Ultimately, 1,289 residential sections were identified and characterized as homogeneous for the following seismic vulnerability assessment.

In situ data
As demonstrated in any seismic vulnerability assessment approach, the classification of the seismic vulnerability fundamentally depends on the material and type of a structure while taking a variety of additional factors (e.g. constructional and architectural features, quality of material, period of construction, and state of preservation), which may impact the seismic performance of the building, into consideration (Santos et al. 2013;Geiß et al. 2014;Geiß and Taubenb€ ock 2017). Therefore, the contents of the in situ data in this study include various parameters, including the type of material, number of floors, period of construction, and roof type (Matsui 2007;Riedel et al. 2014). Based on these parameters, each residential section was categorized into typical vulnerability classes according to the European Macroseismic Scale 98 (EMS-98) (Gr€ unthal 1998). The EMS-98 standard was primarily developed to analyze typical structural susceptibilities that could be expected throughout Europe; however, since the building vulnerability was considered when defining the intensity, the EMS-98 vulnerability classification could be used to represent the seismic damage in the target region for a given intensity (Riedel et al. 2015). This standard has been widely utilized as a useful vulnerability classification in other areas around the world (Tertulliani et al. 2011). According to the EMS-98, different types of buildings are classified into one of six vulnerability classes ranging from A (highest vulnerability) to F (lowest vulnerability). These different building types experience different levels of damage to a particular earthquake intensity depending on their inherent characteristics. According to the EMS-98, each building is labeled with a damage grade ranging from DG1 to DG5, where DG1 is the least vulnerable and DG5 is the most vulnerable.
The in situ data consisting of seismic vulnerabilities were collected in the Tianshan District in August 2017. A building-to-building survey was performed in a small area (approximately 1530 Â 1200 m), including all buildings within the surveyed area ( Figure 2). During the in situ fieldwork, we investigated the following attributes: the type of structure, period of construction, number of floors, and land use and land cover. In general, the basic census data and the field survey conducted in Urumqi revealed the following dominant building types as presented in Figure 3 (Lagomarsino and Giovinazzi 2006;Su et al. 2015;Xu et al. 2016): adobe (M2), wooden slabs (M3.1), unreinforced masonry (M3.3), reinforced masonry (M4), reinforced concrete (RC1), and concrete shear walls (RC2). Based on the historic urban planning and urbanization, the period of construction was divided into four categories, namely, pre-1960, 1960-1981, 1982-2002, 2002þ. The number of floors was defined according to the interval given in the RISK-UE method (Lagomarsino and Giovinazzi 2006;Mouroux and Le Brun 2008): low-rise (1-2 stories), mid-rise (3-5 stories), and high-rise (>6 stories). Through the use of high-resolution remote sensing images (from Google Earth) and manual image interpretation, the roof types were identified and categorized into two groups: flat and sloped. The building classes were divided into residential and non-residential types. Soil amplification factors were particularly removed from consideration in this study.
In total, 703 buildings were surveyed in the study area and their detailed construction and conservation properties were recorded. The buildings were then grouped according to the following attributes: the type of structure, period of construction, number of floors, land use and land cover, and roof type. Nearly all of the 703 buildings were constructed subsequent to 1982; only 1 building (0.1% in total) was built prior to 1960, whereas 12 (1.7%) were constructed in 1960-1981, 360 (51.2%) were constructed during 1982--2001, and 330 (46.9%) were built after 2002. A total of 179 (1.7%) buildings were low-rise structures (less than 2 floors), while 424 were mid-rise structures (3-6 floors), and only 100 buildings were high-rise (7 or more floors). Through the manual interpretation of remote sensing imagery, 46 of the buildings were identified as having sloped roofs, whereas 657 of the buildings had flat-type roofs. During the field survey in the study area, 616 buildings were identified as residential while the rest of the 87 buildings were non-residential. The distribution and

Methods
Data mining, which is an interdisciplinary subfield of computer science, is a computational process used to discover patterns within large datasets through a combination of machine learning, statistics, and database systems (Linoff and Berry 1997). The overall goal of the data mining process is to extract information from a dataset and transform it into an understandable structure for further use. This technology is important for various research fields, including genetics, mathematics, cybernetics, and marketing (Riedel et al. 2014). Various types of learning algorithms such as association rules, Bayesian classifiers, decision rules, decision trees, neural networks, and SVMs can be used for data mining (Fayyad et al. 1996).
Two data mining algorithms and VIMs are briefly introduced in this section. The schematic workflow of this study is presented in Figure 6.

Support vector machine (SVM)
The SVM model, which was proposed by Cortes and Vapnik (1995), was originally defined for the classification of linearly separable classes of objects. Subsequently, the SVM model has been widely employed in decision-making and prediction endeavors due to its considerable efficiency in dealing with linearly inseparable and high-dimension datasets (Chen et al. 2012;Kavzoglu et al. 2014). Consider a set of training dataset vectors x i (i ¼ 1; 2; Á ÁÁ; n) consisting of two classes denoted as y i ¼ 61: The SVM model seeks an n-dimensional hyperplane that differentiates different classes of labeled samples according to their maximum gap. Two class data points are set to the farthest distance from the classifier hyperplane, which can be a flat or curved surface and can be mathematically expressed as follows:  The selection of the kernel function in an SVM model is crucial for the classification results (Xu and Xu 2012). In this work, radial basis function (RBF) kernels were implemented as follows: where c is the kernel function parameter that is entered manually during the calculation.
Lastly, we established the SVM model through RBF kernel function, which involves training and testing sets. Each instance in the training set has one class label (damage grade) and four attributes (type of structure, number of floors, period of construction, and roof type). Within a supervised classification framework, an SVM statistical learning algorithm is used on the dataset to label the buildings according to the desired EMS-98 standard for seismic vulnerability classes. Having found the best hyperplane (using only the training set), accuracy is then measured by creating a confusion matrix and calculating the ratio between the sum of the diagonal values (correct classification) over the sum of all the elements in the matrix.

Association rule learning (ARL)
ARL is a rule-based data mining method used to reveal interesting relationships among variables within a large database (Piatetsky-Shapiro 1991) using some measures of interestingness (Agrawal et al. 1993). Recently, some researchers (Riedel et al. 2014;Riedel et al. 2015;Guettiche et al. 2017) applied this method to conduct seismic vulnerability assessments since it establishes correlations using mathematical algorithms (i.e. if/then statements) between the basic attributes of buildings and the EMS-98 vulnerability classes.
In this study, we applied a simplified ARL method to the in situ Urumqi database. The conditional probabilities between the EMS-98 vulnerability classes (target Y i = A, B, C, D, and E) and the basic information regarding the buildings (attribute X) are derived to obtain the in situ vulnerability proxy. ARL also takes the form X⟶; Y i , where X (consequent) and Y i (antecedent) represent two sets of independent items. Each relationship between X and Y i can be presented in a binary format [0, 1]: knowing the building attribute X, the probability of belonging to the class Y i is defined as: In practice, P Y i jX ð Þ can be calculated as follows: where N X is the total number of buildings with the attribute X, and N XY is the number of buildings with the attribute X belonging to the class Y i . One limitation of the ARL method is the significant risk of false associations that might arise when searching through massive numbers of possible associations, as the results would consequently include inconsistencies (Guettiche et al. 2017).

Vulnerability index method
The vulnerability index method (VIM) adopted in this work to assess the seismic vulnerabilities of buildings in Urumqi is based on the RISK-UE project (Lagomarsino and Giovinazzi 2006), which was launched in 1999 and involved seven cities throughout Europe and around the Mediterranean Sea, namely, Barcelona (Spain), Bitola (Macedonia), Bucharest (Romania), Catania (Italy), Nice (France), Sofia (Bulgaria), and Thessaloniki (Greece). Due to its compatibility with the EMS-98 standard, the VIM has been widely applied to other areas worldwide, including Iran (Omidvar et al. 2012), Grenoble in France (Riedel et al. 2014;Riedel et al. 2015), Faro in Portugal , Horta in Portugal , Sion and Martigny in Switzerland (Lestuzzi et al. 2016), and Al Hoceima in Morocco (Cherif et al. 2016). The VIM requires that seismicity must be interpreted as a vulnerability index with values ranging from 0 (least vulnerable) to 1 (most vulnerable) based on the macroseismic intensity and the resistance of a structure. The procedure for determining the damage can be categorized into three steps as discussed hereafter.
Step 1: estimation of the vulnerability index V I A building typology matrix (BTM) is a typological classification that reflects the differences among various types of structures that are expected to display similar seismic performances (Table 1) where V Ã I is the vulnerability index corresponding to the building classification, DV R is a regional vulnerability factor considering the characteristics of the region or building period, and DV m is the seismic behavior modifier that includes all other aspects of the seismic performance. The regional vulnerability modifier and the seismic behavior modifiers, which could characterize the regional difference of seismic vulnerability, would help to improve the accuracy of the seismic vulnerability assessment. For macroseismic vulnerability assessment at urban scale in this study, the regional vulnerability modifier and the seismic behavior modifiers were taken to be equal to zero.
Step 2: estimation of the mean damage grade l D A mean damage grade l D is defined to characterize the expected damage using the following equation: where I is the macroseismic intensity, which usually ranges from V (5) to XII (12), and / is the ductility index, which is evaluated by considering the building typology and constructive features (Lagomarsino 2006;Lagomarsino and Giovinazzi 2006). For residential buildings, / takes a value of 2.3 (Lantada et al. 2009).
The weighted mean damage index l D can be calculated using the following equation: where the integer k ranges from 0 to 5 for the damage stakes, and p k represents the corresponding probability of occurrence for each damage stake.
Step 3: estimation of the damage distribution The expected probability of occurrence of a damage grade for any degree of seismic intensity is assumed to follow a binomial distribution. The probability of each damage grade can be calculated as follows: where ! indicates the factorial operator.

Training samples
After the buildings in the in situ database were assigned to the EMS-98 classes based on their basic attributes, we applied the SVM classification. The entire database was separately divided into two groups: training sets and testing sets (training sets are not included within the testing sets). To confirm the uncertainties in the prediction results, we randomly selected a training set and repeated each experiment 1000 times (with 1,000 independent training and testing divisions). Subsequently, we counted the median and overall accuracy of each experiment and created an accuracy histogram. First, the four attributes of the buildings and the five EMS-98 classes were utilized to identify the optimal size of the training sets. The numbers of training buildings are 62, 123, 185, 247, 308, 370, and 431, which respectively correspond to 10, 20, 30, 40, 50, 60, and 70% of the entire training set. Our results clearly demonstrate that the accuracy initially increased and the dispersion initially decreased as the training set size percentage increased. The maximum attainable accuracy was reached when the training set size was increased to 40%, and the influence of the increasing set size was lessened. After increasing beyond 40%, the precision gradually diminished and the deviation increased (Figure 7). Hence, a training size of 40% was determined to represent the best training set size, and thus, this size is adopted for the calculations in the following experiments.
The classification accuracy depends on the different combinations of datasets employed. Therefore, different combinations of building attributes were tested with a 40% training size. The mean accuracy was then calculated after 1000 iterations with each combination, the results of which are presented in Figure 8. Figure 8a demonstrates that the mean accuracy is approximately 65.9% when considering only two attributes (the period of construction and number of floors) and then sharply falls to 41.1% for the combination of two different attributes (the roof type and the number of floors) (Figure 8b). By adding the roof type, which was obtained through the manual interpretation of remote sensing imagery, the mean accuracy is slightly increased to 66.4% (Figure 8c). Through a comparison of Figure 8a, b, and c, it is clear that the roof type is not completely independent since the associated accuracy is not substantially improved. Thus, after replacing the roof type with the type of structure, the mean accuracy is enhanced drastically to 78.9% (Figure 8d). This indicates that the type of structure is a fundamental factor for a seismic vulnerability assessment, which is in accordance with the findings of previous studies (Lang and Bachmann 2004;Wu et al. 2014;Riedel et al. 2015;Guettiche et al. 2017), further revealing that high-resolution remote sensing imagery can allow for the extraction of building details and aid the accurate assessment of building seismic vulnerabilities. Figure 8e shows that the mean accuracy of the classification when considering all four attributes (i.e. the number of floors, period of construction, type of structure, and roof type) and all five EMS-98 classes climbs to 79.7%. Since the SVM classifier could not classify the hyperplane with an extreme accuracy and always failed to differentiate nearby classes clearly, the effect of merging the classes was studied by reducing the six classes to only three. After Classes A and B were combined into Class 1, C and D were combined into Class 2, and E and F were combined into Class 3, the accuracy rose from 79.7 to 83.8% (Figure 8f). In summary, Figure 8 shows that a greater amount of building information can help improve the overall accuracy of the learning phase.
In addition, we also calculated an example of the confusion matrix using four attributes and five classes from the in situ database ( Table 2). The values on the diagonal are the buildings for which the classes were assigned correctly. The overall accuracy of the classification was 79.7%. Meanwhile, the accuracy was lower for Classes A and C; 2 Class A buildings were correctly classified, but two buildings were wrongly classified into Class B. In addition, 55 Class C buildings were correctly classified while 36 and 11 buildings were classified as Class B and C, respectively. Previous research (Riedel et al. 2015) noted that it is difficult to distinguish between buildings of Class A and B because of their equivalent vulnerabilities according to the EMS-98 standard. However, considering the fact that Urumqi is a relatively newly constructed city where Class A buildings are rare, we can neglect the influence of an incorrect classification of Class A structures. The other three classes display better classification results. For Class B, 87 buildings were correctly classified while 12 buildings were classified as Class C. 44 Class D buildings were correctly classified while 14 buildings were classified as Class E. In addition, 105 Class E buildings were correctly classified while 1 building was classified as Class D. The classification accuracies for Class A, B, C, D, and E were respectively 50, 87.9, 53.9, 75.9, and 99.1%.

Validation and application to Urumqi
A vulnerability proxy was defined after the SVM training phase by considering only two building attributes (the period of construction and number of floors). Table 3 displays the conditional classification value in the EMS-98 vulnerability classes, including all 12 combinations of the two attributes (i.e. for four types of construction periods and three ranges for the number of floors). For instance, a randomly selected building within the in situ area of the Tianshan District that is known to have been built after 2002 and has more than seven floors has a 25% probability of being in Class D and a 75% probability of being in Class E. Then, we estimated the seismic vulnerability of the entire city by applying the SVM vulnerability proxy to the Urumqi database. The need for only two attributes eliminates one of the main difficulties related to any vulnerability study while remaining relevant at the urban scale. All of the available data were collected and integrated into a GIS environment, which can perform building analyses and present the associated distributions. Figure 9 shows the different vulnerability distributions throughout Urumqi calculated using the vulnerability proxy in Table 3. The results clearly indicate that the whole city is generally less vulnerable since the building vulnerability is mostly higher than Class C. This suggests that the city has a generally high resistance to earthquakes. The lowest vulnerability is discovered both within the city centre, since the building structures therein change and update rapidly, and around the periphery of the city, especially in the Toutunhe District and Xinshi District, where an earthquake resistance project with carried with the building renewal. In contrast, the highest vulnerabilities are mainly detected in the urban and rural areas, where the building structures are unconfined and there are fewer urban processes than in the city centre.

ARL application
Using two attributes (the period of construction and number of floors) and ARL with 616 buildings from the in situ database, we derived the ARL vulnerability proxy (Table 4).
From Table 4, we can conclude that all of the buildings built prior to 1960 with less than two floors all belong to Class A. No buildings were constructed prior to 1960 with three or more floors. This phenomenon corresponds to the urban planning and construction history of Urumqi.
After the establishment of the vulnerability proxy, we then utilized the following formula (Riedel et al. 2015) to obtain the distribution of the EMS-98 classes throughout the whole city: where P j X ð Þ is the probability of containing a vulnerability class X i ¼(A, B, C, D, E) in each homogeneous area, N ji is the number of buildings with the attribute Y i in section j, N is the total number of buildings in j, and P XjY i ð Þ is the value of the probability given by the vulnerability proxy for the X ! Y i association in Table 4. Figure 10 shows the vulnerability distributions in Urumqi calculated using the ARL vulnerability proxy. Similar seismic vulnerability distributions and trends can be observed by comparing Figures 9 and 10, indicating that both data mining methods can achieve a desirable accuracy and stability for an estimation of the seismic vulnerability.
To further investigate the differences between the SVM and ARL methods, we calculated the ratio between the two methods using the estimated results of the vulnerability distributions. As illustrated in Figure 11, the ARL/SVM ratios for Class A, B, and C are all above 1, indicating that the ARL method returns a greater number of more vulnerable buildings. In contrast, ARL calculated fewer buildings for the less vulnerable classes. This suggests that a greater degree of seismic damage would be estimated using the ARL method than with the SVM method for a particular seismic intensity over a given large-scale region.

Seismic risk scenarios for Urumqi
Seismic risk assessments incorporate both vulnerability assessments and hazard analyses. To reduce the seismic vulnerabilities of buildings and lower the loss of human life, we attempted to assess the percentages of the degrees of damage and loss of life in Urumqi.

Overall vulnerability map of Urumqi
The distribution of the mean vulnerability index is exhibited in the overall vulnerability map ( Figure 12). The vulnerability index ranges from 0.3 to 0.8 for the Urumqi database with a mean value of 0.51. In general, the central urban areas have the lowest overall vulnerability with values of less than 0.55. In contrast, higher vulnerabilities are distributed along the peripheral zones of the Midong, Tianshan, and Saybagh Districts, where the buildings were erected with a lower quality of construction and fewer seismic reinforcements. Some particularly vulnerable sections with vulnerability values exceeding 0.7 can be found scattered throughout the city. This is primarily due to the old age of the buildings and the fact that they have not been renovated.

Direct physical damage
Once the vulnerability has been defined, the mean damage grade can be calculated for the different macroseismic intensities using formula 8. Table 5 shows the mean damage index values and their corresponding damage states. Only the intensity is considered in this study, while other additional factors such as site effects, soil Figure 11. Ratio between the ARL and SVM methods. nonlinearity or triggered effects (e.g. landslides) that might affect the seismic hazard behavior are not considered.
GIS spatial analysis is employed to spatially represent the damage distributions of human settlements, thereby enabling the identification of more vulnerable areas and buildings, which can be crucial for urban management and planning in addition to protection strategies . Figure 13 displays the mean damage grade   distribution of Urumqi for an earthquake scenario with a given intensity of VIII. The values of physical damage range from 0.44 to 4.24 with a mean value of 0.83. This corresponds to a Slight damage state according Table 5. Figure 14 shows the proportion of damage states among the residential districts of Urumqi; the differences among the damage states in each district are not obvious, as 80% of the sections in each district exhibit values between None and Slight. According to the distribution of residential areas, fewer sections in Urumqi County exhibit a better seismic performance.

Human casualties and homelessness
The main purpose of an earthquake protection programme is to ensure human safety and reduce injuries. Human losses are estimated by considering building damage as the root cause for fatalities and injuries (Coburn and Spence 2003); however, this is not consistent with the results presented in Section 4. Therefore, no casualties are expected for a seismic intensity scenario of VIII in this study. The HAZUS methodology presented by FEMA was adopted in this study to estimate the probabilities of casualties and homelessness. This method considers 100% of the residential units located in buildings within the Very Heavy and Destruction damage grades and 90% of those units within the Substantial to Heavy damage grade to be uninhabitable.
The total number of uninhabitable residential units due to structural damage is computed using the following equation: where U MF represents the total number of multi-family residential units, and D MF , VH MF , and H MF represent the damage grade probabilities for the Destruction, Very Heavy and Substantial to Heavy structural damage states, respectively. The total number of persons relocated from each building with a typology (P UND ) is the obtained using the following relationship: where P h is the number of persons assumed to live in each household of the building.  Figure 15 presents the amount and distribution of homeless people expected following a seismic intensity scenario of VIII. The total number of homeless people estimated in this scenario is 61,380. A large number of homeless people are located in the Xinshi, Saybagh, and Tianshan Districts, which are densely populated and rarely contain reinforced buildings. In contrast, the number of homeless people is relatively small in both the Toutunhe District and Urumqi County.

Conclusions
Rapid socioeconomic growth in earthquake-prone areas can cause rapid exposure and changes to the seismic vulnerability and risk of loss. Accordingly, seismic vulnerability and risk assessments at the urban scale have proven to be significant and effective for not only urban planners and decision-makers in the development of corresponding strategies for earthquake disaster reduction and immigration but also public awareness with regard to earthquake prevention. Detailed and real-time building inventories, however, are not easily obtained, especially for developing countries characterized by dynamic urban growth with unplanned and highly vulnerable settlements. To address this problem, an integrated approach for a macroseismic vulnerability assessment combining data mining methods with GIScience technology was tested and applied to the city of Urumqi, China. This combination aspires to take full advantage of existing data and extract 'hidden' information from datasets. In this study, the possibility of using relatively few building attributes to represent and analyze the seismic risk at the urban scale was proven efficiently.
Using the information available in the in situ area of the Tianshan District with two data mining methods (i.e. the SVM and ARL methods), two vulnerability proxies were derived to create relationships between combinations of building attributes and their most likely EMS-98 vulnerability classes. These proxies were derived during the learning phase and then applied to the whole basic Urumqi database. The accuracies of the SVM and ARL methods were evaluated, the results of which showed that both approaches provide similar results. This suggests that both data mining methods were successfully applied to the database of Urumqi and achieved desirable accuracies and stabilities for the estimation of the seismic vulnerability. For a particular seismic intensity within a given large-scale region, the ARL method would predict a greater seismic damage than the SVM method. Meanwhile, few differences were considered for the building structures between those suitable for RISK-UE and the existing structures of Urumqi. A seismic risk assessment was performed for Urumqi using the seismic hazard scenarios, the results of which revealed that the seismic risk of Urumqi is Slight with a predicted number of 61,380 homeless people for a seismic intensity scenario of VIII, since Urumqi is not exposed to soil effects and most of the buildings (especially those built before 2002) have low vulnerability indices. Future research will conduct in situ field surveys in other districts of Urumqi to expand the training sets and the testing sets, in order to improve the reliability and minimize the uncertainty of these methods. Additionally, updates to the Urumqi database are also critical to identifying more precise results of the seismic vulnerability assessments.
In conclusion, we confirmed that a seismic vulnerability assessment using data mining methods and GIScience technology can provide a relevant estimation of the seismic damage at a far lower cost than that of a conventional method (e.g. in situ fieldwork). The results of this study represent a potential guide for earthquake protection and risk management endeavors for the city of Urumqi as well as a powerful tool for urban development. Moreover, the original results described in this study in conjunction with the complete and detailed technical information concerning the expected damage and damage to the populace will be delivered to the civil protection services of the municipality of Urumqi, who will utilize these findings to update the emergency plans for the city.