The use of maximum entropy and ecological niche factor analysis to decrease uncertainties in samples for urban gain models

ABSTRACT Uncertainty is a common problem in spatial modeling and geographical information systems (GIS). Furthermore, urban gain modeling (UGM) contains various dimensions and components of uncertainties. Data sampling is important in UGM, and may cause the results of the models to contain many uncertainties as well as affects their precision and accuracy. A poorly sampled or biased dataset can lead to inaccurate predictions and decreased performance of the models. This paper aims to present and develop novel strategies for sampling and building training datasets that can enhance the performance of data-driven models. In other words, the present study used maximum entropy (ME) and ecological niche factor analysis (ENFA) models to select pure non-change samples with minimal uncertainty for training datasets in UGM of Isfahan and Tabriz cities in Iran. The urban gain of two time intervals of 1992–2002 and 2002–2012 were used for Tabriz City and two time intervals of 1994–2004 and 2004–2014 for Isfahan City. Nine and 14 urban gain drivers were used in the UGM of Isfahan and Tabriz cities, respectively. After the ME and ENFA models produced a training dataset with change and non-change samples with the lowest uncertainty, three well-known models, namely random forest (RF), artificial neural network (ANN), and support vector machine (SVM) were used for the modeling. Moreover, the ME and ENFA models that were used to investigate the uncertainty of the sampling procedure were used as the one-class prediction models. Compared to extant studies, the proposed ME – based sampling strategy increased the area under the receiver operating characteristic curve (AUROC), figure of merit, producer’s accuracy, and overall accuracy by 5.5%, 5%, 5%, and 3%, respectively, in the validation phase of Isfahan City and by 5%, 6%, 14%, and 17%, respectively, for Tabriz City. For Isfahan, the accuracies of ME (AUROC = 0.649) and ENFA (AUROC = 0.661) one – class models were closer to that of the ANN – ME (AUROC = 0.646), ANN – ENFA (AUROC = 0.619), and RF – ENFA (AUROC = 0.631) models but differed significantly from that of the RF – ME (AUROC = 0.737) model. For Tabriz, the accuracies of ME (AUROC = 0.657) and ENFA (AUROC = 0.688) one – class models were lower than that of the two class RF-ME (AUROC = 0.852), and ANN-ME (AUROC = 0.778) models. The results showed that the ME model was able to identify relatively pure non-change samples and properly remove impure non-change samples from the training dataset. This study discovered that binary models are preferable to one-class models, and showed that an optimal sampling strategy is an essential step in UGM as it can decrease uncertainty. As such, modelers must adopt efficient sampling methods.


Introduction
In various scientific fields, uncertainty analysis and management is an important step in efficient modeling (Aven 2010).Urban gain models (UGMs) help policyand decision -makers to adopt vital policies that avert environmental issues (Matthews et al. 2007).This includes policies that can decrease the potential threats of urban gain such as environmental degradation (El Araby 2002), loss of biodiversity (Hansen, DeFries, and Turner 2012;McDonald et al. 2020), changes in land surface temperature (Nurwanda and Honjo 2020; Ullah, Jing, and Wadood 2020), destruction of farmlands (Liu et al. 2014;Surjan, Ara Parvin, and Shaw 2016), and changes in water quality (Dong, Liu, and Chen 2014;Zhao et al. 2015).Therefore, UGMs can help urban planners and decision-makers predict future urban gain areas, plan land usage, and create basic urban infrastructure and amenities as well as adopt the necessary policies to protect the environment (Bakker et al. 2008;Don, Schumacher, and Freibauer 2011;Martin et al. 2013;Van Minnen et al. 2009).Although a plethora of statistical, machine learning, and data mining models have been used to create UGMs, the produced UGMs all contain uncertainties in various dimensions and components.Uncertainty may arise due to the data and the models used in the modeling (Tayyebi, Tayyebi, and Khanna 2014).If left unaddressed, these uncertainties will lead to inaccurate results when depicting the relationship between independent and dependent variables, which in turn causes the UGMs to produce erroneous results.Therefore, the various dimensions of uncertainty in urban gain modeling (UGM) must be examined.
UGM using data-driven models such as artificial neural networks (ANNs) is conducted by considering two time intervals of t 1 -t 2 and t 2 -t 3 (Ahmadlou, Karimi, and Pontius 2021;Shafizadeh-Moghadam et al. 2017, 2017).For example, the UGM uses all or a part of the data from the first time interval including urban gain drivers and the urban gain variable as the training dataset.The models are then validated using the data from the second time interval including the urban gain drivers and the urban gain variable.Significant uncertainties were noted in the input data such as in the predictor variables and even the dependent variable, as well as in the model uncertainty (Tayyebi, Tayyebi, and Khanna 2014).Tayyebi et al. (2014) concluded that uncertainties in the input data were more damaging than uncertainties in the parameters of the model.Therefore, the various dimensions and components of uncertainty in the input data of a model must be examined.As such, this present study analyzes the uncertainty of a sampling strategy that is proposed for creating a training dataset for UGM by unrealistically assuming that the input data is error-free and that the parameters of the models are uncertainty-free.
In UGM, the first time interval contains both change and non-change samples, with typically more non-change samples than change samples.This is known as the imbalance problem (Ahmadlou, Karimi, and Pontius 2021;Gu et al. 2008), one of the biggest challenges plaguing UGM training datasets, even in highly-researched fields (Ahmadlou, Karimi, and Pontius 2021;Pirizadeh et al. 2021).More specifically, as change samples are often surrounded by a large number of non-change samples, an imbalance problem occurs when selecting samples for the modeling (Ahmadlou, Karimi, and Pontius 2021).
Machine learning models are built using training datasets (Jaydhar et al. 2022;Ruidas et al. 2021Ruidas et al. , 2022)).Although multiple methods have examined creating training datasets for UGMs using samples from the first-time interval, they all contain different degree of uncertainties.When creating a training dataset for UGM, it is common to randomly select 70% of the whole samples including change and non-change samples from the first-time interval (Shafizadeh-Moghadam et al. 2017, Parvinnezhad et al. 2021;Tayyebi and Pijanowski 2014;Tayyebi et al. 2014).Although this approach is more frequently used, it produces the highest level of uncertainty as the firsttime interval contains significantly fewer change samples than non-change samples (imbalance problem).As such, the training dataset contains more nonchange samples (Ahmadlou, Karimi, and Pontius 2021).As machine learning and data mining models are more likely to learn the non-change samples that are present in higher quantities, they fail to model the change samples, which is the primary goal.The second approach is to select equal quantities of change and non-change samples (Karimi et al. 2019;Pal et al. 2022).Ahmadlou, Karimi, and Pontius (2021) and Karimi et al. (2019) randomly selected equal quantities of the change and non-change samples from the first-time interval and discovered that the non-change samples contained a significant amount of uncertainty.In recent studies, it has been assumed that these non-change samples do not have the potential to change in the future, and they were entered into the modeling as the opposite of the change samples.In other words, randomly selecting non-change samples from samples that have the potential to change in the future causes significant uncertainty to arise in the modeling.This is because the model may encounter many non-change samples that have the same spatial drivers and features as change samples, but they have not changed.Thus, another severe modeling challenge is the selection of pure non-change samples with no to low potential to change.The third approach is to use all the data from the first interval for modeling (Ahmadlou, Karimi, and Pontius 2021).The challenges of this approach include differing degrees of class inequality (imbalance problem) and non-change samples that have the potential to change affecting the accuracy of the models.As these uncertainties in the training dataset make it challenging to build a UGM, efficient approaches are needed to manage and overcome these problems.
A balanced training dataset that contains change and pure non-change samples should be used to develop UGMs.It is easy to select change samples for the training dataset of the UGMs as uncertainties only ensue from errors in extracting and providing these samples.However, it is very difficult to select the pure non-change samples with no or low potential to change as the number of urban gain samples is much less than the non-change samples, and significantly more non-change samples than change samples are selected for the training dataset.Therefore, random sampling causes the model to be biased in favor of the non-change samples.Such models have good accuracy in modeling of non-change samples (frequent class) and bad accuracy in modeling of change samples (infrequent class), while the goal is to model urban gain samples.Multiple studies have faced these modeling issues (Parvinnezhad et al. 2021;Pontius et al. 2018;Shafizadeh-Moghadam et al. 2017;Ahmadlou, Karimi, and Pontius Jr 2021).Pontius et al. (2018) used various models to simulate land use changes (LUCs) in 13 study areas with varying change rates.They found that low LUC rates lead to lower predictive accuracy.Therefore, the novelty of this paper lies in proposing and examining the efficacy of two approaches, maximum entropy (ME) and ecological niche factor analysis (ENFA), for creating balanced training datasets containing equal quantities of change samples and pure non-change samples for UGM of Tabriz and Isfahan Cities in Iran, both of which have different urban gain rates.

Study areas and data sets
The urban gain of two major megacities in Iran, Tabriz and Isfahan, were used to model and evaluate the proposed sampling strategies due to their non-linear and complex urban gain pattern as well as differing urban gain rates (Figure 1).

Study areas
The urban gain of two periods, 1990 to 2000 and 2000 to 2010 were examined for Tabriz City and 1994City and to 2004City and and 2004 to 2014 for Isfahan City.Isfahan and Tabriz are two old, large, and industrial cities in Iran with numerous tourist attractions.Their growing immigration rates have increased the demand for residential sites and industrial developments, leading to the urban expansion and growth of Tabriz and Isfahan.
Isfahan is located in Central Iran at 32.38° N and 51.38° E with an average elevation of 1587 m above sea level.Its northern and southern halves are divided by The Zayandeh Rud River, which played an important role in the urban gain of Isfahan in the past.In 1994 and2004, agricultural and open lands were the dominant types of land use.However, by 2014, most of the city had been urbanized, which alludes to Isfahan's high urban gain rate.
Tabriz is a major city in North-Western Iran located at 32.38° N and 51.38° E, with an average elevation of 1500 m above sea level.As the industrial center of the northwest, its population is expected to increase to 1,940,000 by 2030.The population of the city increased 6-fold and its urban growth increased 18fold between 1956 and 2011.In 1992, its lands were predominantly barren or urbanized.

Dataset
Landsat images of Isfahan in 1994Isfahan in , 2004Isfahan in , and 2014 were obtained from the United States Geological Survey (USGS) and used to prepare land use maps and identify urban gain areas.The area had five landuse classes, namely croplands, open lands, built-up areas, water bodies, and salt marshes.They were classified using the maximum likelihood classification (MLC) with an overall accuracy of 84%, 86%, and 87% for 1994, 2004, and 2014, respectively.The urban gain maps of 1994-2004 and 2004-2014 were obtained by comparing the land-use maps of 1994 with 2004 and of 2004 with 2014.Nine significant urban gain factors of Isfahan were used for the modeling procedure (Table 1).The distance maps were obtained using Euclidean distance analysis in the geographical information system (GIS) environment, the elevation map was obtained from the 30 m digital elevation model (DEM) of Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), and the slope map was obtained from elevation map in the GIS environment.Landsat images of Tabriz in 1992Tabriz in , 2002Tabriz in , and 2012 were obtained from the USGS and used to prepare land use maps.The MLC was used to classify the satellite images into urban, vegetation, and open lands.Fourteen significant urban gain factors of Tabriz were used for the modeling procedure (Table 1).The altitude and slope maps of Tabriz were obtained from the DEM of the ASTER.The urban drivers varied between both cities.As such, the urban gain drivers of both cities were selected using the available expert opinions and extant studies on the two cities.Moreover, to compare the modeling results of the present study to other studies that have been carried out in these two cities, the urban gain of two time intervals of  Step 1: After preparing the necessary land use maps, the urban gain cells of the selected time intervals 1992 to 2002 and 2002 to 2012 for Tabriz and 1994 to 2004 and 2004 to 2014 for Isfahan were extracted in the GIS environment.The urban gain drivers were then prepared for the two cities.

Methodology
Step 2: The first-time interval of both study areas was used for the sampling procedure and to build the models.The urban gain potential of the non-change samples was calculated using the ENFA and ME models as well as the urban gain samples and drivers of the first-time interval.The proposed sampling strategy preserves the change samples of the first-time interval and enters them into the training dataset and then selects an equivalent number of non-change samples with the lowest ME and ENFA to create the training dataset (Figure 2).
Step 3: This training dataset was then used to construct three well-known machine learning models, random forest (RF), artificial neural network (ANN), and support vector machine (SVM).These models have been used in various fields (Das and Chandra Pal 2020;Saha et al. 2022).
Step 4: The model was validated using the urban gain that occurred in the second time interval and the urban gain drivers at t 2 .The total operating characteristics (TOC), figure of merit (FoM), producer accuracy (PA), and overall accuracy were used and calculated using the Hits, False Alarms, and Misses entries of the confusion matrix (Table 2) for validation of models.
Step 5: The suitability maps obtained from the ME and ENFA were entered directly into the validation phase and compared to those of the three machine learning models (Figure 2).

Maximum entropy (ME)
The Shannon entropy is a basic concept in the information theory that Claude Shannon developed in 1940 to assess uncertainty in a random process (Gray 2011).The ME mainly identifies a probability distribution to meet any constraints in the data (Berger, Della Pietra, and Della Pietra 1996).With a series of constraints on urban gain cells, UGMs aim to identify the unknown distribution (P), presenting a set of urban gain drivers.The information available for this distribution is the mean of the features (X) under P in each change cell defined as follows: The goal is to identify the distribution P X ð Þ as an approximate of the actual distribution in the change cells.According to the ME principle, of all the possible distributions that satisfy the constraints, the Isfahan (between 1994 and 2004) Tabriz (between 1992 where ln represents the natural logarithm.Based on convex duality theory and to maximize the entropy of the given constraints, the Gibbs distribution is the only distribution with the smallest Kullback-Leibler that satisfies all the constraints without additional presumptions (Della Pietra, Pietra, and Lafferty 1997;

Ecological niche factor analysis (ENFA) model
Ecological niche factor analysis (ENFA) is a multivariate method that uses factor analysis and ecological niche theory to study the distribution of species according to environmental variables and presence-only locations without the need for absent locations (Hirzel et al. 2002).An ecological niche is  formed in the model via habitats that are available and used by the species (Brotons et al. 2004).This present study equates the urban gain cells with the species spread at various regions over time.The ENFA model calculates the difference between the predictor variables in the change cells and other cells of the entire studied area (Basille et al. 2008).The model does not require any information on the non-change cells and calculates the suitability of the cells using the change cells and the predictor variables.Refer to (Basille et al. 2008) for more details on the ENFA model.

Validating the proposed sampling strategies
The data from the second time interval were used to evaluate the quality of the two proposed sampling strategies.Their outputs were converted to maps with values of 0 and 1 with a threshold of 0.5.Cells > 0.5 indicate that they have the potential to change use, while cells < 0.5 indicate that they do not have the potential to change use (Tayyebi and Pijanowski 2014).The strategy with the highest consistency was chosen as the best sampling strategy.

Applying the proposed sampling strategies in UGM
Three well-known and widely-used machine learning algorithms, RF, ANN, and SVM, were used to test the proposed sampling strategies.Multiple studies on UGM have used these models (Jun 2021;Shafizadeh-Moghadam et al. 2017).As the purpose of this present study was to examine the ability of two proposed sampling strategies to efficiently control and manage uncertainty in training datasets, the ANN, SVM, and RF models are not discussed in detail.After preparing the urban gain maps of Tabriz and Isfahan, the urban gain samples from the first-time interval were used to develop the ME, and ENFA models.The outputs of these models are maps with values between 0 and 1, which indicate the change potential by considering the change samples in the first interval.In these maps, cells with a value close to 1 mean that the value of the drivers of those cells are close to these drivers of the change samples.By setting a threshold of 0.5, nonchange cells with values that exceed this threshold are removed from the training dataset.An equal number of non-change samples as change samples were selected from the < 0.5 samples according the clustering-based sampling approach that (Ahmadlou, Karimi, and Pontius Jr 2021) proposed.The ANN, SVM, and RF models were then developed using this training dataset and evaluated using the urban gain of the second time interval.The predication of these three models were compared to the real values.The models were evaluated using the TOC (Pontius and Kangping 2014) and the Hits, Misses, and False Alarms entries in the confusion matrix.This study primarily uses the ME and ENFA models to examine the uncertainty in the urban gain training dataset.Nonetheless, the outputs of these models can be used for UGM.These models, also known as oneclass algorithms (Moya and Hush 1996), are developed solely using change samples.The outputs of these models were compared with that of the urban gain of the second time interval.The results were evaluated using the TOC, the FoM (Eq.3), PA (Eq.4), and OA (Eq.5): (5)

Sampling of the urban gain modeling using ME and ENFA
The urban gain drivers for Isfahan in 1994 and Tabriz in 1992 were used as the predictor variables and the urban gain of Isfahan between 1994 and 2004 and the urban gain of Tabriz between 1992 and 2002 were used as the target or dependent variable.The ME and ENFA of all the non-change cells were calculated using the change samples and the urban gain drivers of the first-time interval.Figures 3 and 4 depict the ME and ENFA maps calculated for the non-change cells for Isfahan in the first-time interval.These maps were generated only using the urban gain samples of the first-time interval and without the use of any of the non-change samples.As seen, the ME and ENFA range between 0 to 0.96 and 0 to 1, respectively.The higher and closer to one the ME, the higher the potential of a non-change sample to change.Conversely, the lower the ME, the lower the potential for urban gain.
An equal number of non-change samples as change samples was generated from the first-time interval (black points in Figures 3 and 4) to depict the uncertainty of the sampling approaches that extant studies have adopted.As seen in Figure 3 (A1-A6), a large number of the randomly generated non-change samples were placed in cells with high ME and ENFA.The ME and ENFA of all the cells were calculated using the urban gain samples and drivers of the first time interval for Tabriz (Figures 5 and 6).A large and equal number of non-change samples as the change samples was randomly generated and placed in cells with high ME and ENFA.To create the training dataset, non-change cells with ME and ENFA above 0.5 were removed from the study areas.Then, as (Ahmadlou, Karimi, and Pontius Jr 2021) recommend, a clustering-based approach was used to select an

Binary classifiers
These models were developed using binary labels 0 and 1, where 0 indicates that the examined phenomenon did not occur while 1 indicates that the examined phenomenon did occur.In urban gain modeling, these values refer to the non-change (0) and change samples (1).Binary models are the most common examples of UGMs.After using ENFA and ME to create the training datasets for Isfahan and Tabriz, three well-known binary models, RF, ANN, and SVM, were used in UGM.More specifically, the ME and ENFA models were used to create two training datasets for Tabriz and two training datasets for Isfahan.Six hybrid models, ME -ANN, ME -RF, ME -SVM, ENFA -ANN, ENFA -RF, and ENFA -SVM, were then developed for each case.The data of the second time interval including the urban gain variable and drivers at t 2 were used to validate the developed models.Figures 7 and 8 depict the error maps as well as the values of four entries in error matrix (Table 1) for Isfahan and Tabriz, respectively.They also provide the prediction success rates of the models.

ENFA and ME models as one-class classifiers for UGM
Apart from the binary ANN, RF, and SVM models, the outputs of the ME and ENFA models as one-class classifiers were directly used as suitability maps.As such, an equal number of cells with the highest ME and ENFA as urban gain cells in the second-time interval were selected and considered the predicted urban gain cells for the second time interval.Figures 9 and 10 depict the error maps of these models for the two cities in the second-time interval.

Validating the models using TOC and the four entries in the confusion matrix
Figures 10 and 11 show the TOC curve of the six hybrid binary models (i.e., ME -ANN, ME -SVM, ME -RF, ENFA -ANN, ENFA -SVM, and ENFA -RF) with two one-class models of ME and ENFA for Isfahan and Tabriz, respectively.For Isfahan, the proposed ME-based sampling approach outperformed the proposed ENFA -based sampling.As seen in Figure 10, the RF -ME model (AUC = 0.737) was the most accurate, followed by the ANN -ME (AUC = 0.646), RF -ENFA (AUC = 0.631), ANN -ENFA (AUC = 0.619), SVM -ENFA (AUC = 0.512), and SVM -ME (AUC = 0.509) models.Furthermore, the SVM-based models were overfitted.The accuracies of ME (AUC = 0.649) and ENFA (AUC = 0.661) one -class models were closer to that of the ANN -ME, ENFA -ANN, and RF -ENFA models but differed significantly from that of the RF -ME model.Compared to the ANN (AUC = 0.682), SVM (AUC = 0.481), and RF (AUC = 0.661) constructed by balance sampling without ME and ENFA models, the proposed RF-ME (AUC = 0.737) increased the area under the receiver operating characteristic curve (AUROC) by 5.5% in the validation phase of Isfahan City.
The proposed ME-based sampling approach outperformed the proposed ENFA -based sampling approach for Tabriz as well.As seen in Figure 12, the RF -ME model (AUC = 0.852) was the most accurate, followed by the ANN -ME (AUC = 0.778), RF -ENFA (AUC = 0.504), SVM -ENFA (AUC = 0.503), ANN -ENFA (AUC = 0.502), and SVM -ME (AUC = 0.449) models.The RF -ENFA, SVM -ENFA, ANN -ENFA, and SVM -ME models were overfitted.The accuracies of ME (AUC = 0.657) and ENFA (AUC = 0.668) one -class models were lower than that of the two class RF-ME and ANN-ME models.Compared to the ANN (AUC = 0.71), SVM (AUC = 0.521), and RF (AUC = 0.791) constructed by balance sampling without ME and ENFA models, the proposed RF-ME (AUC = 0.852) increased the AUROC by 6% in the validation phase of Tabriz City.Table 3 provides the PA, OA, and FoM of all the models.As seen, the PA, OA, and FoM of the ME -RF and ME -ANN models were higher than that of the other models.

Discussion
Urban gain is very important due to its impact on ozone concentration, water quality, and pollution, food security, and so on.Although multiple studies have developed models to depict urban gain behaviors, very few studies have examined the uncertainties in the training datasets that are used in these models, as these datasets are often plagued by imbalance issues and the impurity of non-change samples.
Class imbalance problem in the training dataset can be overcome by using equal quantities of change and non-change samples.Past studies have randomly selected change and non-change samples to create     set is another significant issue.More specifically, UGMs may encounter samples that have identical features but some labeled change and others non-change.As there was no logical approach of selecting non-change samples from available cells in the past, UGMs contained samples that had been randomly chosen from the available cells.The findings of this present study indicate that randomly selecting cells for non-change samples creates samples that, in reality, may have a high potential for change as these samples have been erroneously labeled non-change and entered in the training datasets.Therefore, this present study proposed a balanced sampling approach that uses two approaches, namely ME and ENFA to select cells with the lowest potential for change as non-change samples and entering them into the training dataset.Conway and Wellen (2011) have used ENFA model to examine the purity of the non-change samples, which they used to model the urban gain of Barnegat Bay watershed in New Jersey, United States.However, the one-class ENFA model failed to outperform the logistic regression model.Conversely, this present study found that the oneclass ENFA model outperforms the ANN, RF, and SVM binary models in both study areas.This could be because Conway and Wellen (2011) created their logistic regression model using non-change samples with the lowest urban gain potential that their ENFAbased urban gain suitability map had overestimated.The ENFA-based binary models of this present study, however, had reasonable results.This present study also used the ME model to select pure non-change samples and build a one-class model.However, the ENFA model outperformed the ME model in both Tabriz and Isfahan.Nevertheless, the binary ANN and RF models constructed using the non-change samples that the ME probability map selected outperformed the one-class models as well as other models built based on ENFA probability map.Although there were no significant differences between the ANN-ME model for Isfahan and the one-class models of ME and ENFA, the ME-based binary ANN and RF models outperformed the other models in both study areas.Similar to the findings of Ahmadlou, Karimi, and Pontius (2021), the SVM-based models of this present study were overfitted in both study areas.Therefore, it fails to model the urban gain of both study areas.
The use of samples with the biggest variety in the training dataset for UGMs is also a significant challenge.Therefore, after using the ENFA and ME models to remove non-change samples with change potential from the training dataset, this present study used  the framework proposed by Ahmadlou, Karimi, and Pontius (2021) to diversify the non-change samples.
As the urban gain patterns of Isfahan and Tabriz cities are very complex, multiple studies have attempted to model the urban gain in these areas.Most of these studies have focused on developing new hybrid models.More specifically, Parvinnezhad et al. (2021) proposed using support vector regression to integrate an adaptive neural fuzzy inference system and a fuzzy rough set to model the urban gain of Tabriz City.A comparison of the accuracy of the modeling results of that study and that of this present study indicates that focusing on sampling can improve the performance of a model better than developing hybrid models.Apart from that, Shafizadeh-Moghadam et al. ( 2017) developed a model that used the Land Transformation Model (LTM) and cellular automata to model the urban gain of Isfahan City.A comparison of the accuracy of the modeling results of that study and that of this present study also proves that using a suitable sampling approach is more important than developing hybrid models.
This present study examined using ME and ENFA models for sampling as well as one-class classifiers and discovered that they provided less accuracy than binary models.Therefore, binary models are preferable to one-class models.This finding is in line with the study of Zhu et al. ( 2018) that compared two one-class models, namely oneclass SVM, kernel density estimation, and two binary models namely, ANN and SVM.Similar finding was reported by Pandey, Reza Pourghasemi, and Chand Sharma (2020) that compared one class ME and binary SVM.
Imbalance issue and the impurity of non-changes samples are very complex in multiple LUC modeling as, apart from imbalances between change and nonchange classes, imbalances also occur between the change classes.Therefore, future studies may endeavor to overcome these issues in multiple LUC modeling.Imbalance issues in the training datasets of the study areas, which contain various interclass imbalance ratios, also warrant further study.To address the impurity issue of non-changed samples in multiple land use changes, researchers can provide suitability maps using ME for each type of the LUC classes to select the cells with the lowest potential for change as non-change samples.Also, a simple solution to overcome the imbalance issue between the change classes is to select an equal number of the change samples from each type of the LUC classes as the land use class with the smallest number of change samples.

Conclusion
This present study explored the uncertainties that arise in samples that are used in UGM, namely imbalance problem and impurity of the nonchanges samples.Sampling is one of the most important steps when UGM using data-driven models as it may result in many uncertainties in the model outputs and affect its precision and accuracy.As such, this present study used two balanced MEand ENFA-based sampling approaches for UGM.Three well-known and widely used data mining models, namely ANN, SVM and RF and six hybrid models, namely ME-ANN, ME-SVM, ME-RF, ENFA-ANN, ENFA-SVM, and ENFA-RF that had been constructed using proposed sampling strategies were used to evaluate the efficacy of the proposed sampling approaches Two ME-and ENFA-based oneclass models were also developed and compared with proposed two-class hybrid models.The urban gain of Isfahan City at the two time intervals of 1994-2004and 2004-2014and that of Tabriz City between 1992to 2002and 2002to 2012 were used to evaluate the proposed sampling approaches.The proposed sampling approaches were found to significantly increase the accuracy of the data mining models and decrease the size of the training dataset and computational load of these models.Furthermore, the non-change samples that are selected for use in a training dataset should have the lowest potential for change and differ completely from the change samples.Therefore, the concept of "garbage in, garbage out" is important in data mining and selecting the correct samples for the training dataset significantly affects the success rate of machine learning and data mining models.
The binary data mining models that this present study developed also outperformed the one-class models.
This study provided a new perspective of sampling strategy, and proposed two ME-and ENFAbased sampling approaches for creating the training dataset for urban gain models.As data sampling is one of the most significant data preprocessing steps in the data mining process, researchers and modelers may use the adapted and proposed sampling strategy in the present study to improve the accuracy of the other machine learning and data mining techniques like decision trees, which are very large in number.Moreover, future studies may investigate using the sampling approaches that this present study proposed in other study areas with different rates of urban gain.

Figure 1 .
Figure 1.Location of both study areas.

Figure 2
Figure2depicts the modeling process and the proposed sampling strategy.The method used to create the training dataset for the various data-driven models should solve the class imbalance problem and include non-change samples with no to low potential for change.The study involves several steps, which are outlined below:Step 1: After preparing the necessary land use maps, the urban gain cells of the selected time intervals 1992 to 2002 and 2002 to 2012 for Tabriz and 1994 to 2004 and 2004 to 2014 for Isfahan were extracted in the GIS environment.The urban gain drivers were then prepared for the two cities.Step 2: The first-time interval of both study areas was used for the sampling procedure and to build the models.The urban gain potential of the non-change samples was calculated using the ENFA and ME models as well as the urban gain samples and drivers of the first-time interval.The proposed sampling strategy preserves the change samples of the first-time interval and enters them into the training dataset and then selects an equivalent number of non-change samples with the lowest ME and ENFA to create the training dataset (Figure2).Step 3: This training dataset was then used to construct three well-known machine learning models, random forest (RF), artificial neural network (ANN), and support vector machine (SVM).These models Nasser and Cessac 2014).This distribution function is proportional to the conditional probability of being positive.Refer to(Phillips, Anderson, and Schapire 2006;Phillips, Dudík, and Schapire 2004) for more details.

Figure 2 .
Figure 2. The flowchart of the urban gain modeling.

Figure 3 .
Figure 3.The maximum entropy of non-change samples for Isfahan City.

Figure 4 .
Figure 4.The ecological niche factor analysis of non-change samples for Isfahan City.

Figure 5 .
Figure 5.The maximum entropy of non-change samples for Tabriz City.

Figure 6 .
Figure 6.The ecological niche factor analysis of non-change samples for Tabriz City.
training datasets.However, as urban gain datasets contain significantly fewer change samples than nonchange samples, random sampling causes the training dataset to contain more non-change samples, which skews the model toward these samples.Therefore, urban gain modelers should consider using equal quantities of change and non-change samples in the training dataset as machine learning and statistical models require balanced training datasets.This present study used under-sampling, which some extant studies have used, to build a balanced training dataset.Apart from class imbalance problem, the impurity of the non-change samples used in the training data

Figure 9 .
Figure 9.The spatially distributed errors of the (a) ME (b) ENFA model's urban gain predictions for Isfahan City in the second time interval (2004-2014).

Figure 10 .
Figure 10.The spatially distributed errors of the (a) MaxEnt (b) ENFA models' UG predictions for Tabriz City in the second time period (2002-2012).

Figure 11 .
Figure 11.The TOC and AUC of the proposed models for Isfahan City.

Figure 12 .
Figure 12.The TOC and AUC of the proposed models for Tabriz City.

Table 1 .
List of spatial drivers of urban gain in Isfahan and Tabriz.

Table 2 .
The error matrix.

Table 3 .
The validation of the eight models using FOM, OA, and PA.