Integrating geographical information systems, remote sensing, and machine learning techniques to monitor urban expansion: an application to Luanda, Angola

ABSTRACT According to many previous studies, application of remote sensing for the complex and heterogeneous urban environments in Sub-Saharan African countries is challenging due to the spectral confusion among features caused by diversity of construction materials. Resorting to classification based on spectral indices that are expected to better highlight features of interest and to be prone to unsupervised classification, this study aims (1) to evaluate the effectiveness of index-based classification for Land Use Land Cover (LULC) using an unsupervised machine learning algorithm Product Quantized K-means (PQk-means); and (2) to monitor the urban expansion of Luanda, the capital city of Angola in a Logistic Regression Model (LRM). Comparison with state-of-the-art algorithms shows that unsupervised classification by means of spectral indices is effective for the study area and can be used for further studies. The built-up area of Luanda has increased from 94.5 km2 in 2000 to 198.3 km2 in 2008 and to 468.4 km2 in 2018, mainly driven by the proximity to the already established residential areas and to the main roads as confirmed by the logistic regression analysis. The generated probability maps show high probability of urban growth in the areas where government had defined housing programs.


Introduction
Land Use Land Cover (LULC) maps are one of the bases representing the dynamic of urban environments demanding innovative concepts and techniques to obtain spatio-temporal information for urban planning (Cavur et al. 2015;Madasa, Orimoloye, and Ololade 2021;Huang et al. 2017;Orimoloye and Ololade 2020;Shao et al. 2021).LULC classification is proven to be a challenging task in remote-sensing-based urban studies due to the spectral confusion among features caused by complexity and heterogeneity of urban environments (Carranza-García, García-Gutiérrez, and Riquelme 2019; Simwanda and Murayama 2017).Very often, a single pixel of a satellite image represents multiple cover types generating confusion during classification: a mixed pixel problem that depends on the spatial resolution of the satellite image and the spatial distribution of the cover type (Choodarathnakara, Kumar, and Patil 2012).The mixed pixel problem is mainly observed in sub-Saharan African cities due to the highly complex spatial structure of the urban environment, a wide range of construction materials (Simwanda and Murayama 2017;Huang et al. 2021) along with the limited coverage of high-resolution satellite images.Although several approaches have been developed to resolve the mixed pixel in medium-resolution images, obtaining good results is still a "black box" (Trinder and Liu 2020).Most of the land-cover information extraction algorithms employ pixel-based methods, but when this approach is applied using high-resolution images, a "salt and pepper" effect is produced, contributing to a bad classification result (Huang et al. 2017).Pixel-based methods are gaining more attention for research and development with the increase of open data available on a global scale despite the emerging Very High Resolution (VHR) imagery having led to the development of object-based approaches (Lennert et al. 2019).There is no advantage using object-based classification over pixel-based classification when only medium-spatial resolution images are used (Weih and Riggan 2010;Mallinis, Plenioub, and Koutsiasb 2010;Adam, Clsapovics, and Elhaja 2016;Tassi et al. 2021).Also, the lack of spectral information necessary to discriminate spectrally similar pixels enforces the inaccuracy of the classification (Weih and Riggan 2010), which can be solved by a subpixel-based approach (MacLachlan et al. 2017).In addition, the selection of the classifier may also influence the accuracy of the classification (Cheţan, Dornik, and Urdea 2017).All the land-cover information extraction approaches mentioned above mainly resort to supervised classification.Supervised classification requires training samples and is, therefore, timeconsuming, and its accuracy depends on the users' skills.Furthermore, external factors, such as the size and quality of training samples, usually affect the performance of popular classifiers (Yang, X., and D. Shi. 2016;Huang et al. 2021).Studies in Remote Sensing (RS) often resort to band combinations that highlight features of interest (Hurd 2015).However, the land-cover information extraction from spectral indexes have also been implemented in many studies (e.g.Mwakapuja, Liwa, and Kashaigili 2013;Patel and Mukherjee 2015).Spectral indices are combinations of spectral reflectance from two or more wavelengths that indicate the relative abundance of features of interests (Harris Geospatial Solutions n.d.a).Classification based on spectral indices (Indexbased classification) does not require training samples and endmembers (Li et al. 2015), reduces spectral confusion, and increases spectral contrast among different land-use classes (Xu 2007).Index-based classification also has some limitations of highlighting one specific land-cover type and confusion in discriminating some land-cover types (Li et al. 2015).To overcome this problem, Xu (2007) and Mwakapuja, Liwa, and Kashaigili (2013), by combining spectral indices, and resorting to supervised classification, extracted built-up features, vegetation, and water.
The final result of remote-sensing products can further be used to solve various problems in society.For example, from historical records of satellite images, predictive models can be generated based on change detection to understand a particular phenomenon over time, such as urban expansion and its influence in forest cover change and habitat loss.Eyoh and Nihinlola (2012), after classifying three Landsat images in Lagos, Nigeria from the years 1984, 2000, and 2005 using the K-means unsupervised algorithm, modeled, and predicted future urban expansion by employing logistic regression, considering 10 explanatory variables: distance to water, residential structures, industrial and commercial centers, major roads, railway, Lagos Island, international airport, international seaport, University of Lagos, and distance to Lagos State University.Alsharif and Pradhan (2014), by selecting distance to main active economic centers, central business district, the nearest urbanized area, educational area, roads, and to urbanized areas, easting, and northing coordinates, slope, restricted area, and population density, as possible drivers of urban sprawl, generated probability maps using data from 1984 to 2002 in the calibration phase and data from 2002 to 2010 in the validation phase resulting in an accuracy rate of 0.86.Achmad et al. (2015) predicted LULC for the year 2029 based on Banda Aceh's (Indonesia) spatial plan regulation 2009 and 2029, by analyzing a total of seven potential driving factors. .population density, distance to central business district, green open space, historical area, river, highway, and coastal area.Traore and Watanabe (2017), integrating a Logistic Regression Model (LRM), GIS and RS, analyzed and quantified urban growth patterns and investigated its relationship with driving factors to simulate an urban growth probability map for Conakry.The results of the LRM indicated that the variables elevation and distance to major roads resulted in the best fit and the highest regression coefficients.
The synergy between GIS and RS has benefited from advances in computing, Global Positioning Systems (GPS) and Artificial Intelligence (AI) methods, particularly Machine Learning (ML), allowing massive amounts of imagery to be analyzed rapidly, outperforming more traditional tools of analysis, developed before the Big Data era (Merchant and Narumalani 2008;Vopham et al. 2018;Dangermond and Goodchild 2020) requiring higher computational costs.Cloud-based geospatial processing platforms are one of such advances that have been used for reducing computational and memory costs since the data is stored in the clouds and small portions of the data (maps, images) can be accessed upon request, manipulated on the fly resorting to built-in programming languages and ML algorithms, and the processing is done in servers owned by a big tech company or across multiple devices distributed around the world.The geospatial data-oriented cloud platform Google Earth Engine (GEE) has become the first choice for geospatial data analyses in different domains (Tamiminia et al. 2020;Yang and Huang 2021;Mao et al. 2021;Laipelt et al. 2021;Coelho et al. 2021;Junting et al. 2021) and has been used as a repository and to produce large scale LULC maps (Buchhorn et al. 2020;Karra et al. 2021;Gong et al. 2020;Midekisa et al. 2017) despite its limitations to a small number of classification and regression algorithms and inability to perform Complex machine/deep learning algorithms that need to be performed outside of its environment due to computational restrictions (Amani et al. 2020).
Remote sensing allows the collection of data from upper space and ML algorithms can help to extract knowledge from its products that can later be manipulated by overlaying with a cartographic dataset (e.g.road, river, sample points) derived from a GIS environment to produce a merged product that can be visualized for further interpretation (Merchant and Narumalani 2008;Guobin, Krakover, and Blumberg 2003;Abdollahnejad et al. 2019).ML allows systems to learn and improve automatically from experience (Awad and Khanna 2015).Experiments have shown that ML algorithms outperform parametric techniques for RS studies (with Random Forest (RF) and Support Vector Machine (SVM) achieving the best average accuracy) but, use, and implementation uncertainties limit their usage (Maxwell, Warner, and Fang 2018).In this research, to solve the mixed pixel problem, a Thematic-oriented Index Combination (TOIC) (Xu 2007;Mwakapuja, Liwa, and Kashaigili 2013) is implemented to improve the discrimination among features, which is expected to be prone to unsupervised classification.Furthermore, ML algorithms are used for: (1) classification of the satellite image using a memory and computationally efficient unsupervised classifier, Product Quantized K-means (PQk-means) (Matsui et al. 2017), to evaluate the effectiveness of index-based classification for LULC; and (2) regression (logistic regression) to monitor the urban expansion of Luanda, the capital city of Angola, for the periods of 2000-2008 and 2008-2018, having into account factors that may have driven Luanda's urban expansion.

Study area
Urban growth in Luanda has increased significantly, which forced the government to expand its area in 2011 by altering the political-administrative division of Luanda and Bengo provinces (Info-Angola n.d).It is now composed of seven municipalities: Cacuaco, Belas, Cazenga, Ícolo e Bengo, Luanda, Kissama, and Viana (Figure 1).

Urbanization in luanda
From 1948 to 1975, a series of plans failed to regulate urban expansion by the capital's fast growth (Jenkins, Robson, and Cain 2002;Viegas 2012).After independence in 1975, the flow of rural population to the city led to a rapid expansion of informal properties.In 2002, the end of the civil war had led to an increasing search for opportunities toward improving living standards.The diagnosis of the city's urbanization had been carried out empirically, and all proposals and interventions had been made randomly (Viegas 2012).A metropolitan plan was approved in November 2015 by the Angolan government to be developed over the next 15 years starting from 2015 (IPGUL (Luanda Urban Planning and Management Institute [Instituto de Planeamento e Gestão Urbana de Luanda]) 2015).The government intended to establish extents of 2030 (expected population 12.9 million) urban area and to impose and enforce a new urban boundary and keep sprawl in check.

Data and methods
The satellite images used in this study were obtained from different sources: NASA's Landsat 5 Thematic Mapper (L5 TM), Landsat 7 Enhanced Thematic Mapper Plus (L7 ETM +), Landsat 8 Operational Land Imager (L8 OLI) (USGS n.d.); ESA's Sentinel 1B (S1B) Synthetic Aperture Radar (SAR); and JAXA's ALOS Digital Surface Model (DSM) (JAXA n.d.) (Table 1).The date of acquisition was based on availability of the satellite data that also depends on cloud coverage (trying to select images from the same month) that was also crucial for determining the period of study.

Preprocessing and classification
The satellite images were re-projected to the WGS84 UTM zone 33S projection system, clipped to the extent of the study area using vector data and modified according to Luanda's new administrative division (mapcruzin n.d; Info-Angola n.d; IPGUL n.d).The Sentinel1 SAR images were radiometrically calibrated, filtered and, geometrically corrected in SNAP software using the sentinel1 toolbox.ALOS DSM (v1.1, v1.2) was used as a predictor variable for the LRM and to generate the slope raster.
The Landsat images were atmospherically corrected in QGIS 2.18.For the year 2018, Fast Line-of-sight Atmospheric Analysis of Hypercubes (FLAASH) atmospheric correction in ENVI software resulted in a better performance of the classifier (see the comparison in supplementary material).BQA bands from the three Landsat data were used to remove clouds from the scenes masking out values higher than 672 for Landsat 5 and 7 and 2720 for Landsat 8 according to convenience, since getting rid of clouds is paramount for the proposed classification approach.

Calculation of spectral indexes from Landsat 8 data
Five spectral indexes, Normalized Difference Built-up Index (NDBI), Moisture Enhancement Index (MEI) (Liau 2016), Vegetation Index considering Green and Shortwave Infrared (VIGS) (Hede et al. 2015), Dry Built-up index (DBI) (Rasul et al. 2018) and an empirical built-up index QzCal were calculated and used as input for the unsupervised classification.Additionally, SAR data (VV and VH polarized bands) were added to improve the classification.
For the 2018 image, the main initial idea was to select a built-up/bareland index, a water index, and a vegetation index, to discriminate the four LULC classes, having preference to spectral indexes that outperform conventional spectral indexes: NDBI; MEI instead of Normalized Difference Water Index (NDWI [Gao 1996]); and VIGS instead of Normalized Difference Vegetation Index (NDVI) or Soil Adjusted Vegetation Index (SAVI [Huete 1988]) (Figure 2(a,b)).
MEI combines three spectral indexes to handle their weak differences (Equation ( 2)).The first two indexes, NDWI (McFeeters 1996) and Modified Normalized Difference Water Index (Xu 2007(Xu , 2008)), enhance the water category by eliminating vegetation objects (higher reflectance in NIR and SWIR1 bands (Landsat 8) compared to the Green band while water reflects few differences).The third index eliminates urban/bareground reflectance by applying the difference between band 1 of Landsat 8 (coastal aerosol) and the green band (higher reflectance in the green band compared to the coastal aerosol band) while water reflects few differences.
Atmospheric and ground-surface conditions, such as clouds and soils, often distort NDVI accuracy in detecting vegetation (Shishir and Tsuyuzaki 2018).SAVI is more sensitive than NDVI in detecting vegetation in the low-plant covered areas as low as 15%, while the NDVI can only work effectively in the area with plant cover above 30% (Xu 2008).Given NDVI and SAVI's limitations, VIGS,  developed to discriminate unhealthy from healthy vegetation stressed by heavy metals in an area covered by thick vegetation, was chosen.VIGS combines 1) two spectral indexes, Green-Red and NDVI, to discriminate vegetation from other backgrounds; and 2) one spectral index NDWI (Gao 1996) to discriminate water from other backgrounds (Equation ( 3)).Since SWIR reflectance is sensitive to leaf water content (Hede et al. 2015), following the same idea as MEI, there is a hypothesis that VIGS in an area with non-thick vegetation may enhance vegetation and water categories (higher values) while eliminating urban/ bareground (lower values).
An experimental unsupervised classification for the band combination NDBI, MEI, VIGS is illustrated in Figure 3(a).The classified map still had some misclassification when discriminating bareground from builtup areas (Figure 3(a) in Figure 3).A fieldwork in Luanda aimed to visit the misclassified points (Figure 3(b,d) in Figure 3(a)) to understand the diversity of the surrounding features and how they respond to spectral indexes.
Trying to discriminate sand beach from built-up, which has been misclassified in the first experimental classification, an empirical spectral index was calculated: Quartz and Calcite are principal constituents of beaches in tropical and sub-tropical seas (Vincent 1997, 14).The spectral reflectance for Quartz and Calcite in TIR (range 10.6-11.19µm wavelength for Landsat 8 TIR1 band) is higher compared to their spectral reflectance in SWIR (both, 2.1 µm wavelength, which is within the Landsat 8 SWIR2 band range).Based on this assumption, the normal difference between  Landsat 8 TIR1 and SWIR2 bands, here referred to as QzCal spectral index (eEquation (4)), was calculated to discriminate sand beach from other backgrounds.An additional spectral index, dry built-up index (DBI) (Equation ( 5)) (Rasul et al. 2018) was calculated which together with QzCal, improved the classification result (Figure 3(d)) but built-up class was still misclassified (figures (a-c) in Figure 3(b)).
Since the result was still not satisfactory, 10 m resolution Sentinel-1 C-Bands (VV+VH) were added to improve the classification (Corbane et al. 2008;Abdikan et al. 2016;Sinha, Santra, and Mitra 2018).The reason is to take advantage of the double-bounce effect on built-up features from SAR data (Tavares et al. 2019).The final selection of spectral indexes was based on their performance regarding the diversity of spectral reflectance in the study area: NDBI, MEI, VIGS, DBI, QzCal, and two sentinel SAR bands resulting in 7 input data.

Classification algorithm
For the Landsat 8 OLI image, which provides unique specifications, such as radiometric resolution of 16 bits, and the aerosol band that is important for the calculation of one of the spectral indexes, a memory and computationally efficient unsupervised classifier, i.e.PQk-means was chosen.PQk-means first compresses input vectors into memory-efficient short codes by product quantization and then clusters the resultant product-quantized (Pq) codes in the compressed domain (Matsui et al. 2017).As with K-means (Harris Geospatial Solutions n.d.b), PQk-means finds the nearest center from each code and updates each center using a proposed sparse voting scheme (Matsui et al. 2017).The optimization process consisted of keeping the same dimension of the input because the dimension represents an odd number (seven input bands) (num_subdim = 7); setting the Pq-code to eight bits (Ks = 256); quantizing the original dataset (7 × 52,535,364-pixel) to only 5,000,000-pixel; and finally setting the number of classes to eight, a higher number than the expected number of classes desired for the classification, that were in a later stage merged into four classes.
The resulting PQk-means LULC map was compared with classified maps from state-of-the-art unsupervised K-means classifier available in python scikit-learn library, and supervised SVM classifier in ArcGIS Pro.The comparison with the supervised approach SVM was performed using the band combination NIR, SWIR, and red.For previous Landsat mission data (Landsat 7 for the year 2000 and Landsat 5 for the year 2008), which lack some important specifications mentioned earlier, supervised Maximum Likelihood Classification (MLC) and SVM respectively, were performed in ArcGIS Pro, also using the band combination NIR, SWIR, and red.The choice of the classifier for each case was based on performance when compared to other classification techniques (Huang et al. 2018).

Accuracy assessment
The accuracy assessment was conducted in QGIS using the Accuracy Assessment of Thematic Maps (AcATaMa) plugin (Corredor 2018).Despite being easy to use, the AcATama plugin does not provide the K coefficient, which is an important parameter to test interrater reliability (McHugh 2012).For this reason, the confusion matrix tool in ArcGIS Pro was used to determine the K coefficient.In total, 160 stratified random points (40 per class) were generated for the 2000 and 2008 classified images (Corredor 2018;Olofsson et al. 2013).For the 2018 image, 480 stratified random points (120 per class) were generated.To the random points generated, ground truth labels were assigned through visual interpretation having assistance from TOIC (NDBI, MEI, and VIGS), true-color composite (red, green, and blue) and false-color composite (NIR, SWIR, and red) of the same Landsat image used for classification each year (see supplementary material Figures 16).

Predictive model (logistic regression)
Different urban growth models consider different indicators to predict urban growth phenomena (Musa, Hashim, and Reba 2016).No single model appears to perform consistently well when applied to different geographical locations (Xie et al. 2005).Logistic regression empirically uses statistical techniques to model the relationship between LULC changes and drivers based on historical data (Atak et al. 2014).The logit Y ð Þ function (Equation ( 6)), a linear combination function of the explanatory variables was, used to carry out the logistic regression model (Li 2017).
where x i is the explanatory variable, β i is the regression coefficient to be estimated, and β 0 is the intercept.

Independent variables.
There is no rule or known global formula for selecting land-use drivers (Eyoh and Nihinlola 2012;Akinwande, Dikko, and Samson 2015).Predictors were selected a priori based on previous studies in African cities (Eyoh and Nihinlola 2012; Traore and Watanabe 2017), the current knowledge of the urbanization processes in the study area, as well as the results from hypothesized driving forces of urban growth (Atak et al. 2014).
Seven initial drivers were chosen: (see supplementary material Figure S5.2), but only the first five were selected to generate the models (Figure 5).

Multicollinearity.
Before any regression analysis, multicollinearity should be checked.Eliminating a variable involved in collinearity runs the risk of omitted variable bias (Lesschen, Verburg, and Staal 2005).Variance inflation factor (VIF) (Equation ( 7)) is the most widely used diagnostic for multicollinearity that may be calculated for each predictor by doing a linear regression of that predictor on all the other predictors and obtaining the R 2 from the regression (Allison 2012).VIF determines how much the variance of an estimated regression coefficient increases when predictors are correlated (Akinwande, Dikko, and Samson 2015).A Pearson correlation analysis among predictor variables was conducted prior to the logistic regression modeling, followed by VIF analysis (Tavares 2017).

Model evaluation.
The coefficient of determination R 2 is used as a standard to measure the overall strength of a linear regression model, with zero indicating a model with no predictive value and one indicating a perfect fit (Hu, Palta, and Shao 2006).Several goodness-of-fit indexes, also known as "pseudo-R 2 " analogs of R 2 as used in Ordinary Least-Squares (OLS) regression, exist to assess the predictive capacity of the logistic regression model.One such index, McFadden pseudo-R 2 (Equation ( 8)), is perhaps the most popular pseudo-R 2 index (Smith and McKenna 2013;Williams 2015).
where LL(Full) is the full-model log-likelihood, and LL(Null) is the intercept-only log-likelihood.For two models performed on the same data, the greater the likelihood, the higher McFadden's pseudo-R 2 (UCLA 2011).
Another index, Receiving Operating Characteristic (ROC), represented by the area under the curve (AUC), measures the relationship between expected and real changes (Alsharif and Pradhan 2014).ROC graphically represents "true positive" and "false positive" classification rates as a function of different classification cutoff values for the predicted probabilities resulting from the logistic regression (Smith and McKenna 2013).

Unsupervised classification
A major challenge in unsupervised learning is evaluating if the algorithm learned something useful, and often, the only way to evaluate its result is to inspect it manually (Muller and Guido 2017).
A comparison of the PQk-means classified map with maps from state-of-the-art classification techniques, K-means and SVM (see also RF and MLC in supplementary material Figure 7.2(a, b)) was made by performing accuracy assessment using the same ground truth.For the 2018 images, 93% and 95% of overall accuracy were achieved corresponding to PQk-means and K-means unsupervised algorithms, respectively, and from the supervised classification, 89%, 87%, and 78% of overall accuracy were achieved using SVM, RF, and MLC algorithms, respectively (Figures 6(a,b), 7 and Figure S7.2 in supplementary material; Table 2 and Table S2.2 in supplementary material).
The K-means algorithm was also used to validate the PQk-means computational efficiency, although PQk-means memory efficiency from this comparison could not be proved: While K-means took more than half a day to classify a large scale 7 × 52,535,364-pixel image, PQk-means only took approximately 16 minutes running on a machine with Intel Core i5 of 3.3 GHz and 8 GB memory.
Below, the spectral profiles of the four classes, are represented in parallel coordinates for the bands, NIR, SWIR, red, green (Figure 8(a)), and the spectral indexes used in unsupervised classification (Figure 8(b)).
While the first spectral profile only shows the discrimination in four bands, the second one shows a more diverse feature discrimination.It can easily be seen how well the separability was in both spectral profiles.Although the second one still has some confusion in  the SAR polarized bands, it allows us to see the unseen in seven dimensions.For the built-up class, false alarms on SAR image correspond to areas where backscattering greatly varies, like fields with rough bare soils (subfigure (d) in Figures 6(a,b) and 7) or with high soil moisture content.The build-up underestimation in the classified maps is essentially related to the presence of some low-density urban areas (sub-figure (a) in Figures 6(a) and 7) that do not face the radar beam and accordingly have a low return signal (Corbane et al. 2008).A higher underestimation of the built-up class can be observed on the SVM map in which only optical data were used (sub-figure (a) in Figure 7).
It is common that after improvement methods, built-up areas still get confused with everything since urban is extremely heterogeneous spectrally (NASA (National Aeronautics and Space Administration) 2017).The goal of this study was not necessarily to outperform conventional classification techniques.Instead, the interest was to find a solution for the spectral confusion of built-up areas in the Sub-Saharan African countries and find an alternative for the time-consuming supervised classification.2 and  3).Let's clarify that the bareground here referred does not correspond to pixels representing exclusively bareground.With 30 m resolution satellite images, individual pixels covering suburban area will represent a mixture of buildings, vegetation, and bareground (Smith and Hoffmann 2000), and the similarity of the spectral reflectance of soil and nonphotosynthetic vegetation in the Visible-NIR (VNIR) spectral region may underestimate regional scale total vegetation cover (Gill and Phinn 2008).Bareground pixels may contain vegetation but were assigned to the class label with the maximum proportion of the pixel (Chen et al. 2018).The discrimination of bareground and vegetation is probably the drawback of the indexbased unsupervised classification, which is not a big problem for proceeding to the second objective of this study focused on built-up class.

Built-up area estimation
After classification and assessment of accuracy, the different LULC classes were quantified (Table 2; Figure 10  (a and b)).It is important to clarify that the total study area from the different classification results slightly differs because of removed clouds' pixels.Apart from the overall accuracy, the user's accuracy should be paid attention to understand which classes resulted in fewer false alarms (Cao et al. 2018).The user accuracy of the built-up class for 2018 is high and slightly differs among the three classification approaches: 97% for PQk-means and K-means and 96% for SVM (Table 2).
The estimated built-up area for the year 2000 is around 94 km 2 , while the Atlas of Urban Expansion (AUE) of Angel et al. (2016) estimated the built-up area to be 171.75km 2 in 2000.The AUE differentiates the built-up pixels classified in the Landsat imagery into three types: (1) urban, when the majority of built-up pixels (50% or more) are within the Walking Distance Circle (WDC), defined as a circle with area 1 km 2 and radius 584 m; (2) suburban, (25-50% of built-up pixels in their WDC); and (3) rural (less than 25% of build-up pixels in their WDC).The threshold for this three-fold division is somewhat arbitrary, that might be the reason for such a significant difference between the classified map for the year 2000, and the AUE built-up estimation since the share of a class in each pixel within the WDC is not taken into account in this study.
For the year 2008, the estimated built-up area is around 197 km 2 (Figure 11(a)) while the map based on Envisat MERIS fine resolution 300 m image of 2009 (Figure 11(b)), (Arino et al. 2012) estimates 207 km 2 (check geodatabase Globcover dataset in supplementary material).
For 2018, the estimated built-up area is 468 km 2 (Figure 12(a)) that is comparable with the 407 km 2 of the 2015/2016 built-up area estimated by ESA (2017) on its high-resolution prototype map of Africa (Figure 12(b)).

Logistic regression
The results of the Pearson correlation analysis for the models corresponding to the periods 2000-2008 (model 1) and 2008-2018 (model 2) show a very high correlation among the variables of distance to residential areas, to industrial areas, and to the primary center (supplementary material Tables 3.2 and 4.2).After checking for multicollinearity, some researchers consider 2.5 as a threshold for VIF (Allison 2012), and others only include values close to 5.0 if the variable in case is a suppressor variable, i.e. a variable that is not  LULC 2000, 2008, 2018 (PQk-means), 2018 (K-means) and 2018 (SVM).
The AUC shows a very high power for distinguishing the two classes (built-up and non-built-up) in model 1 (97%), while in model 2 it shows a moderate discriminatory power (92%) (Figure 13(a and b)).In general, ROC curves with an AUC ≤ 75% are not clinically useful, and an AUC between 75% and 97% has moderate discriminatory power, while a ROC curve with an AUC of 97% has a very high clinical value (Fan, Upadhye, and Worster 2006).
All the P-values lower than 0.05 and its associated z values greater than 1.96 for positive z and less than −1.96 for negative z, show the importance of the predictor variables used in the two models in explaining urban expansion (Tables 5 and 6), which means that the null hypothesis was not satisfied for any predictor variable, allowing us to infer that urban expansion in Luanda is mainly driven by the proximity to (1) already established residential areas and (2) to the main roads that are confirmed by their higher coefficients in both models (negative relationship with distance to main roads and to residential areas).
From the probability maps, especially the one corresponding to 2018, areas with a higher probability of urban growth can be observed in places where the government had defined housing programs: Talatona in Camama district, Kilamba Kiaxi, Zango, Cacuaco (GPL (Government of the Province of Luanda The variable elevation changes from a positive and high influence on urban expansion to a negative and low influence, while the variable slope changes from a negative and low influence to a positive and still low influence during the period 2000-2018 (Tables 5 and  6).This trend might be because during the period 2000-2008 with a small urban area concentrated near the shore (west), where the elevation is low, people were more likely to build far from these areas, avoiding areas with high densities of people and areas at risk most affected by slope such as Sambizanga and Maianga from the old administrative division of Luanda (Mendelsohn, Mendelsohn, and Nakanyete 2010) (Figures 14(a) and 15(a-d), having preference for already built residential areas near the main roads for easy access to services.During the period 2008-2018, with the urban growth, inhabitants were forced to shape their priorities within the metropolitan area, having chosen to fill low elevation and high-risk areas (higher slope) rather than already built and populated (negative relationship) areas and near the main roads (Figures 14(b),15(a,b and d)).
For the future metropolitan plan, the proposed transport network was designated as development corridors reserved for new dense and high-quality development (Mobility in chain 2015) and is expected to become the main driver of the urban development since we have confirmed with our  (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008).Const -intercept in the logit function, Coef -coeficient of the intercept (in the case of the first value) and the predictor variables, Std.Err -standard error, P>|s| -p value associated with z-score z, [025 0.975] -confidence interval; Road for the year 2000 (Rd_2000), Residencial area for the year 2000 (Rsd_2000), DSM, Slope and Population Density for the year 2000 (PD_2000).results that government-imposed rules have determined the way the built-up stain looks like today.The result of this study will be beneficial for environmentalists and urban planners, to understand and monitor the urban dynamic since the methods can easily be reproduced with new input data, and to inspire researchers toward machine learning application to GIS/RS.

Limitations of the study
Data availability has limited the methodology of this study to be implemented with previous satellite data, for example, the usage of SAR data for the years 2000 and 2008.There are commercial SAR data providers, but relying on commercial datasets can consume monetary resources, since it does not guarantee that the data will certainly be useful.Future built-up area expansion could not be predicted from the logistic regression because it is believed that future prediction may lead to errors.This is because urban expansion drivers constantly change their predictive power, and some drivers can be removed, or new drivers be introduced over time.An approach that considers previous models, i.e. sequence models such as Recurrent Neural Network (RNN), could be a better approach for future prediction.

Conclusions
The results of this study show that unsupervised classification utilizing spectral indexes can effectively classify LULC if appropriate spectral indices and classifiers are chosen for a specific area.For unsupervised LULC classification, the K-means algorithm is preferred.However, considering a large number of inputs (spectral indexes), PQkmeans, that to the best of found knowledge, was implemented for the first time in RS studies, is an alternative selection because it is memory and computationally more efficient, and gives good accuracy similar to K-means.
The built-up area of Luanda during the period 2000-2018 was mainly driven by the proximity to the already established residential areas and to the main roads, with other drivers playing an important role in urban expansion when changing from a positive to negative influence or vice-versa, as a consequence of newly dense built-up areas and rules imposed by the government.In future research, it would be interesting to see the predictive performance of the emerging deep learning algorithms such as Long Short Term Memory (LSTM) using the same approach of this study.

Figure 1 .
Figure 1.Geographic location of the study area.

Figure 3 .
Figure 3. LULC map 2018 from PQk-means algorithm.(a) using the spectral indexes NDBI, MEI, VIGS.(b) using the spectral indexes NDBI, MEI, VIGS, DBI, and QzCal.In (a, b) sub-figures a and c show the Sand beach misclassified as built-up, b shows the saline area misclassified as built-up, d shows the built-up misclassification in agriculture field.d in Figure (b) shows improved result.
LULC classified maps of 2000, 2008, and 2018 were used to generate the dependent variables for the models.A binary map representing the change in built-up during the period 2000-2008 was used as input for the first model (Figure 4(a)).The same procedure was used for the model corresponding to the period 2008-2018 (Figure 4(b)).

Figure 6 .
Figure 6.LULC map 2018 using the spectral indexes NDBI, MEI, VIGS, DBI, QzCal, and two SAR polarized bands.(a) PQk-means classification.(b) K-means classification.In Figure (a) and, (b), sub-figure a shows the improved classification of built-up area; b shows the improved classification in saline area; c shows the improved classification of built-up in sand beach area; d shows the reduced overestimation of built-up in a rough surface area.

Figure 7 .
Figure 7. LULC map 2018 from SVM algorithm using the Landsat 8 band combination NIR, SWIR1, red (564).(a) underestimated built-up area (b) saline area classified as built-up, (c) improved classification of built-up in a sand beach area, (d) overestimation of built-up in a rough surface area.

Figure 9 (
Figure 9(a and b) represent the results from the supervised classification whose accuracy assessment resulted in 82% and 93% for the images corresponding to the years 2000 and 2008, respectively (Tables2 and 3).Let's clarify that the bareground here referred does not correspond to pixels representing exclusively bareground.With 30 m resolution satellite images, individual pixels covering suburban area will represent a mixture of buildings, vegetation, and bareground(Smith and Hoffmann 2000), and the similarity of the spectral reflectance of soil and nonphotosynthetic vegetation in the Visible-NIR (VNIR) spectral region may underestimate regional scale total vegetation cover(Gill and Phinn 2008).Bareground pixels may contain vegetation but were assigned to the class label with the maximum proportion of the pixel(Chen et al. 2018).The discrimination of bareground and vegetation is probably the drawback of the indexbased unsupervised classification, which is not a big problem for proceeding to the second objective of this study focused on built-up class.

Figure 15 .
Figure 15.(a) The slope in the areas of Maianga and Sambizanga.(b) detail of the corresponding probability map for the year 2008.(c) detail of population density for the year 2008.(d) detail of population density for the year 2018.

Table 1 .
Satellite images used in this study (L = Landsat; S = Sentinel).

Table 2 .
Accuracy assessment and quantification of

Table 3 .
Correlation of the predictor variables for model 1 (2000-2008) after removing predictors with high VIF.Road for the year 2000 (Rd_2000).Residential area for the year 2000 (Rsd_2000), DSM, Slope and Population Density for the year 2000 (PD_2000).

Table 4 .
Correlation of the predictor variables for model 2 (2008-2018) after removing predictors with high VIF.

Table 5 .
Logistic regression results for model 1

Table 6 .
Logistic regression results for model 2 (2008-2018).Const -intercept in the logit function, Coef -coefficient of the intercept (in the case of the first value) and the predictor variables, Std.Err -standard error, P>|s| -p value associated with z-score z, [025 0.975] -confidence interval; Road for the years 2008 and 2018 (Rd_2008_2018), Residential area for the year 2008 (Rsd_2008), DSM, Slope and Population Density for the year 2008 (PD_2008).