Object-Based Image Analysis (OBIA) and Machine Learning (ML) Applied to Tropical Forest Mapping Using Sentinel-2

Abstract The purpose of this research was to distinguish and estimate natural forest areas at Paraná, Brazil. Forest plantations (Silviculture) and natural forests have high annual vegetative vigor, as well as agricultural areas in the periods of agricultural harvests, which can bring classification errors between these classes of Land Use and Land Cover (LULC), these classes have similar spectral signatures, but have a distinct texture that can be separated in the supervised classification process, with the joining of object and pixel-to-pixel classification method approaches. Thus, image segmentation techniques through Object-Based Image Analysis (OBIA) and Machine Learning (ML) made forest mapping possible over a large territorial extension. The Google Earth Engine (GEE) platform was used to calculate the vegetation indices (VIs) and Spectral Mixture Analysis (SMA) fraction spectral from Sentinel-2 images, and the creation of homogeneous spectrally shaped regions under supervised classification of phytoecological regions and mesoregions. The overall precision obtained in the mappings resulted in 0.94 Kappa Index (KI) and 96% of Overall Accuracy (OA), which indicates a high performance in large-scale forest mapping. The proposed dataset, source codes and trained models are available on Github (https://github.com/Cechim/simepar-brazil/), creating opportunities for further ad vances in the field.


Introduction
The global forest area is 4.06 billion hectares (ha), which represents 31% of the planet's total surface area, more than half of the world's forests (54%) are concentrated in five countries: Russia (20%), Brazil (12%), Canada (9%), the USA (8%), and China (5%) FAO (2020).The mapping of natural forests plays a strategic role in forest management and policies, conservation, and environmental licensing purposes.The natural forests area estimation using remote sensing depends on the method and the spatial resolution of the images.
The new series of satellites have sensors with a high spatial and temporal resolution, which relates to the capability of revisiting the same area Segarra et al. (2020).Furthermore, the adoption of new technologies and several alternative image providers, such as Planet (Nanosatellites), AIRBUS, Digital Globe, and ESA (European Space Agency), reduced the image cost per km 2 .These provider groups offer several operating satellites alternatives, such as Pleiades A, Pleiades B, SPOT 7, SPOT 6, SPOT 5, WorldView 2, WorldView 3, and Sentinel-2.The Sentinel-2 images are produced from passive optical sensors, and the temporal resolution is generally weekly due to the use of satellite constellations Vrdoljak and Kilic�Pamukovic�(2022).
Sentinel 2A and 2B satellites generate multispectral images (bands 2 to 4 and 8) with a 10-meter spectral resolution and a 5-day temporal resolution with the advantage that these images have no acquisition cost for the user.Due to its mul tispectral bands, Sentinel-2 images are used for forest type classification Chen et al. (2018), biomass estimation Duan et al. (2019), urban forest spatial distribu tion Eskandari et al. (2020), forest removal Pała� s and Zawadzki (2020), mapping of land cover and land use Zeng et al. (2020), forest and mangrove mapping Cis sell et al. (2021).
Therefore, the temporal and spectral resolution of Sentinel-2 images have great potential for projects based on mapping, area estimation, and the identifica tion of changes in LULC.There are also various initiatives and projects involving public and private institutions focused on the development of LULC maps both in Brazil and in Paran� a state, the focus of this study.
This large database can be processed for vast territorial extensions using the almost real-time processing from the GEE cloud platform, which presents comput ing capabilities that can be applied for several highimpact social issues, including deforestation, drought, catastrophes, diseases, food security, water management, climate monitoring, global water surface changes Pekel et al. (2016), and environmental protection Gorelick et al. (2017).In the study of forests, this platform has been used for the analysis of forest cover and forest loss Hansen et al. (2013), estimation of crop harvest Lobell et al. (2015), to generate land cover products (Sousa et al. (2020) and Zeng et al. (2020)), for coniferous forest classification Kaplan (2021), biomass estimation of forest plantation Theofanous et al. (2021), mangrove mapping Mondal et al. (2019), forest cover mapping Ganz et al. (2020), forest estimation and detection of forest change Zulfiqar et al. (2021), and analysis of forest species distribution Xie et al. (2021).
The GEE platform uses ML algorithms and many works indicate its efficiency in different mapping applications and supervised classification.Some examples are shoreline mapping in order to differentiate substrate types Banks et al. (2017) and to estimating terrestrial latent heat flux Wang et al. (2017), mapping of land cover dynamics Huang et al. (2017), land use classification Hird et al. (2017), mapping of wetlands Brovelli et al. (2020), and forest mapping and monitoring Was�niewski et al. (2020).
Therefore, the mapping of natural forests with a new methodological approach employing multispectral images from the Sentinel-2 satellite series and ML in GEE will contribute to the enforcement oversight public policies in any process of environmental licensing or authorization for forest suppression.This is especially important in order to support the monitoring of natural forest resources and to subside the monitoring large-scale deforestation.The aim of this work was the development of a methodological approach for the mapping of natural forests.
Thus, the main objective of this work was to map and estimate areas the natural forest formation of the Atlantic forest biome in Paran� a state, Southern Brazil, using OBIA process with ML supervised classification using images from the Sentinel-2 satellite implemented in GEE computing environment with spectral fraction and VIs applied for each biome phytoecological.

Organization of research steps
The article was divided in the following steps

Study area
The state of Paran� a is located in Southern Brazil, between the parallels 22 � 29'S and 26 � 43'S latitude, and between the meridians 48 � 2'W and 54 � 38'W longitude (Figure 1).Paran� a has an annual average precipitation between 1100 mm to 1920 mm, an average temperature between 15 � C and 24 � C Aparecido et al. ( 2016), and annual average evapotranspiration of 700 mm to 1600 mm Caviglione et al. (2000).
The state's climatic characteristics have regional variations, yet the K€ oppen and Geiger (1928) climate classification system establish the Cfa (warm temper ate with hot summer) and Cfb (humid temperate with moderate hot summer) as the most predominant classes.However, Cwa (humid temperate with dry winter) and Aw (humid tropical savanna) are also present in the state's North portion Aparecido et al. (2016).
As a result of the interaction between biotic (vegetation and animals) and abiotic (climate, rock, topography, and soil) components, the state has a great diver sity of landscapes and numerous types of vegetation.The phytoecological regions are spaces defined by typical floristic genera and characteristical biological forms that are recurrent within the same climate, occurring on land of varied lithology, but with defined topography IBGE (2012) (Figure 1).

Data selection
Sentinel-2 satellite images from June to December 2016 were used for natural forest classification.We used as a selection criterion the periods without predominant agricultural crops -such as soybean, corn (summer) and wheat (winter), and with low cloudiness at the time of image acquisition Table 1.

Methodological procedures
The research method was divided in the following steps (Figure 2).

Object-based image analysis (OBIA)
The supervised classification method was used by acquiring samples from the images with different types Sentinel-2 color composites to discrimination forest types, as natural forest formation (native) of forest plantation (Silviculture).For the natural forest class, 1,885 sampling distributed throughout the state of Paran� a were selected.The classification method used was segmentation with (OBIA) in eCognition Software.Different parameters were tested for the segmentation using the algorithm "Multiresolution Segmentation" and the base parameters Scale, Shape and Compactness" were set to 150, 0.1, and 0.5 respectively.
To assist in the classification process, different official reference cartographic bases from the state of Paran� a were used as extra information.The Normalized Difference Vegetation Index (NDVI) and the Normalized Difference Water Index (NDWI), Haralick texture index (GLCM Homogeneity) Haralick et al. (1973), Haralick (1979) and Conners and Harlow (1980) were generated and used as additional bands.
The classification process is iterative, thus, some classes of land use were segmented with different parameterizations, is considered as both a qualitative assessment based on visual interpretation and a quantitative assessment using reference data Costa et al. (2018) and measures that report on the overall accuracy of segmentation Clinton et al. (2010), and consequently reclassified more than once until a better result was achieved regarding the performance and the desired classification.We used the Nearest Neighbor (NN) classification algorithms from eCognition Developer, a supervised classification method derived from statistical learn ing theory Cover and Hart (1967).The result was submitted to a class editing, a post-classification process to correct possible inconsistencies in the classification, which are mainly related to wrongly labeled classes.

Topological analysis
After the OBIA classification, topological analyses were applied taking into account all classes, ArcGis 10.8.1 software was used for this.As topological tools, "Intersect" was used to identify overlapping areas or edges between classes, and "Symmetrical difference" and "Erase" to identify areas not classified by the NN algorithm.The "Dissolve" function was applied to the Forest class, the functions "Check geometry" and "Repair geometry" have been applied to correct inconsistencies in the geometries or the attribute table.From these topology analyses the" Explode" tool was applied to break the forest class into polygons, which were then cut and separated by mesoregion and further imported into GEE.

Supervised classification in GEE
The OBIA supervised classification was imported in Asset form into GEE.In the ML supervised classification process, the stable samples -training and calibration samples that were extracted from classes that did not change their values during all years of the 6.0 collection (from 1985 to 2020) -from MapBiomas for the year 2016 were used as reference.The digital classification was performed by phytoecological region contained in each mesoregion using the Random Forest (RF) algorithm available in GEE, and running 70 interactions Breiman (2001).This classifier is an algorithm less sensitive to the quality of training samples and overfitting due to the large number of decision trees produced by the random se lection of a training sample subset Belgiu and Dr� agut ¸ (2016).
A total of 23 Sentinel-2 visible and infrared spectral bands at 10 m spatial resolution, Spectral Mixture Analysis (SMA) spectral fractions and Vegetation Indices (VIs) were used in the classification process (Table 2).The SMA was generated from the calculation of the Green Vegetation (GV), Non-Photosynthetic Vegeta tion (NPV), Soil and Shade fractions implemented in GEE.SMA is a physically based form of image processing that aids in the repeated and accurate derivation of quantitative subpixel information (Smith et al. 1990).Some studies have already used SMA for the estimation and mapping of agricultural crop residues (Bannari et al. (2006); Pacheco and McNairn (2010)).
SMA works under the assumption that a spectrum computed by a sensor considers a linear combination of the spectra of all components within the pixel and the spectral proportions of the end members, which reflect the proportions of the occupied area by defined features on the Earth's surface Adams et al. (1995); Lu et al. (2004).
The VIs Calculated in the GEE and based on the median Ganz et al. (2020) (Table 2).
To improve accuracy, a new RF classification was made, such as parameters was used, the number of trees in the random forest classifier varied from 50 to 100 iterations and variables per split from 1, the classifier with 70 trees to Sentinel 2 data resolution, the sample set used were the stable samples from the MapBiomas project for the year 2016 Souza et al. (2020).
After the supervised classification process, a spatial filter created with the ["connectedPixelCount"] function was applied to avoid unwanted changes to the pixel group edges.This function, available and implemented in the GEE platform, locates connected components (neighbors) that share the same pixel value.Thus, only pixels that do not share connections with a predefined number of identical neighbors are considered isolated.In this filter, at least six connected pixels are required to achieve the minimum connection value.Consequently, the minimum mapping unit is directly affected by the applied spatial filter, and it was defined as 6 pixels (the equivalent of approximately 0.5 ha) Souza et al. (2020).

Validation methodology
The methodology for validating the natural forest classification by ML was divided into two steps: grid generation within the study area, and the generation of point samples within each grid.The number of grids sampled was defined ac cording to the methodology described in the Technical Specification Standard for Quality Control of Geospatial Data (ET-CQDG) of DSG (2015), which adopts the sampling plans described in ISO (International Organization for Standardization) standards.
The thematic accuracy validation was defined based on the number of points that were generated from the regular grid (sampling is uniform, non-proportional, and non-random).The definition of the grid is based on the scale of the generated product, which, in this case, has a spatial resolution of 10 m, so the scale would be 1:60,000.Spatial sampling is done by partitioning the cells into 4 � 4 cm according to the scale of the product to be evaluated, and using integer values in the form of a grid, while the number of cells depends on the scale and size of the study area (Table 3).
Following the scale definition, the grid and the number of points to be used for validation were generated with the "Fishnet tool" from ArcGIS, while the population points set was generated by the grid centroid.The resulting population set was 34,708 points for the entire state of Paran� a, i.e., one reference point every 2.4 km.From the attribute table, the classes corresponding to each of the points were defined based on the Sentinel-2 images with a spatial resolution of 10 m, and Planet of 5 m.In this way, all points in the population set were visually identified via satellite image interpretation, defining, thus, the standardized reference matrix for two classes: Natural forest and Non-forest.

Accuracy analysis and area estimation
The classifier performance was evaluated using metrics such as Kappa Index (KI), Overall Accuracy (OA), Inclusion Errors (IE), and Omission Errors (OE) (Congalton (1991), Congalton and Green (2019)) and Global Disagreement (Allocation and Quantity components) Pontius and Millones (2011).The forest area was estimated by counting the pixels contained in each municipality using the QGIS "Zonal Statistics" tool Sherman et al. (2011) and comparing the estimated area of the mapping done with Sentinel-2 images, the area obtained in the 2016 MapBiomas, and the one from IAT (Institute Water and Earth of Paran� a) generated with WorldView images from 2012 to 2016.
The Albers Equal Area Projection, and SIRGAS2000 Datum were the standard to generate the area quantitatives.The estimations' normality analysis was performed using the Shapiro-Wilk and Anderson-Darling tests, as well as the Spearman's correlation coefficient (rs) to verify the data dispersion when comparing the estimated municipal area obtained from other mappings (IAT and MapBiomas).The Refined Index of Agreement (dr) of Willmott et al. (2012) (Equation (1)), which measures the precision of the estimated values in relation to the straight line 1:1, and the Mean Error (ME) (Equation ( 2)), which measures the mean of the errors, were used as

Mapping of LULC
For OBIA mapping, samples for 9 thematic classes were selected through NN supervised classification based on the mapping generated with Sentinel-2 images from 2016, this classification for the forest cover class was made for the state of Paran� a using the supervised OBIA classification with topological correction imported into GEE, for the natural forest class, an KI of 0.87 with OA of 91.05% was obtained (Figure 3).

Supervised classification in GEE
The stable samples obtained from the Savanna phytoecological region through the MapBiomas project, year 2016 Souza et al. (2020), were used to test the performance of different ML supervised classification algorithms on GEE.The Random Forest (RF) and Gradient Tree Boost (GTB) classifiers showed better perfor mance and similarity, with better definition and smoothing between the mapping classes (Table 4).
The Classification and Regression Trees (CART) classifier demonstrated greater spectral confusion among the land use and land cover classes.The accuracy evaluation was done by means of a confusion matrix in GEE according to Stehman (1997) with a sampling of 70% of the data for training and 30% for validation testing.The highest spectral confusions were between Natural forest formation and Silviculture (Planted forest) classes, and between Pasture and Agriculture areas.Among the classifiers used, RF obtained the best accuracy result in LULC classification.The use of any type of spatial filter was not evaluated, but the classification algorithm Support Vector Machine (SVM), RF, GBT and CART and its performance using the same training sample.
After defining the best ML algorithm (RF), all classes of LULC were imported as Asset in the GEE, thus a new classification was made with a new sample set for training and validation, improving the performance of thematic accuracy which can be verified by Correctly Classified Pixels (CCP%) and OA% according to matrix adapted of Richards (1993) (Table 5).
The proximity between the average, predicted value and the true value is evidenced by the Accuracy score metric, indicated how close is the measured value to the true value, the Macro average indicated average unweighted mean per label and Weighted average indicated average support-weight mean per lable (Table 6).

Mapping of natural forest formation
It is important to highlight that, in this mapping, the forests' spatial distribution is evident, especially in the South, Southeast, and Metropolitan mesoregions.This map was generated from the initial mapping done by OBIA with supervised classification (NN) and refined by a reclassification process in GEE by ML with RF (Figure 4).

Spatial accuracy assessment
According to the classification proposed by Landis and Koch (1977) for the KI value, the spatial accuracy analysis of the classifications showed excellent the matic quality with KI values between 0.80 and 1.00 (Table 7).The lowest values, in turn, were obtained for the Mixed Ombrophilous Forest, which can be explained by the large number of reforestation areas (Table 7).
Similar results were found by the mapping of Souza et al. (2020) on GEE with Landsat images from the Atlantic Forest Biome in 2016 using RF for the MapBiomas project.It resulted in an OA of 91.4%, an Allocation disagreement of 6.5%, and a Quantity disagreement of 2.9%, the Quantity component, considered as the classification of incorrect proportions of pixels in the classes and by the Allocation component that refers to the incorrect spatial distri bution of pixels in classes.
Similar results were also obtained by Was�niewski et al. ( 2020) with an accuracy of 92.6% and 98.5% using RF and Sentinel-2 images for forest mapping in Northwestern Gabon; and by Niculescu et al. (2018)   while monitoring the vegetation with Sentinel-2 and SPOT-6 data in 2017 at France, where he obtained 93% in OA.
The values for IE and OE obtained in this study were 6.4% and 5.4%, respectively, regarding the entire state of Paran� a.These results were lower than those for OE and IE obtained by Souza et al. ( 2020) − 14% and 6.2% respectively in the mapping of the forest formation class for the MapBiomas project in 2016.This improvement in the mapping performance with Sentinel-2 images can be explained by the use of the Paran� a state mesoregions (10 homogeneous regions).In addition, the new classification (refined) process performed by RF implemented in GEE was done by forest types and considering the phytoecological regions.
As a function of the KI calculating the accuracy based on randomness arises as an option the analysis of the Global Disagreement (GD) (Allocation and Quantity).Such components provide additional information that assists in explanation of the error in the mappings, the contribution of the allocation component of 2.0% and the proportion of the quantity component with 0.1% indicates the effectiveness of this new approach developed for large-scale forest mapping associating OBIA and ML (Figure 5).
The determination of GD and metric intensity of omission and commission uses the reference sample set of the classes under study comparing with the population set of pixels of the generated mapping.In the global disagreement graph formed by the percentages allocation component and quantity, it is observed that the contribution of the allocation component in the total disagreement was greater than the quantity component, the which implies incorrect spatial distribution designated pixels in classes (Figure 6).

Estimated forest area
The estimation indicated that there was an underestimation of the forest area obtained by RF mapping with Sentinel-2 images, with a difference of 213,164 ha (3.4%) when compared to mapping from MapBiomas collection 5 from 2016.In contrast, when comparing with IAT mapping done with WorldView images from 2012 to 2016, the area was overestimated, which can be explained by the spectral mixing of some areas contained in the 10 m pixels of Sentinel-2 images.The area estimations were calculated considering the original spatial resolution of each mapping technique (Table 8).As the data of estimated forest area do not follow a normal probability distribution, the Spearman's correlation coefficient (rs) was used, and the value of rs ¼ 0.99 was obtained when comparing MapBiomas/Landsat and IAT/WorldView2 mapping, indicating a strong correlation between the data obtained in the mapping done with Sentinel-2 images using the methodology associating OBIA and RF (Figure 7).
The analysis of the data compared by statistical metrics indicated that the estimated area had a Mean Error (ME) of 534.25 ha in relation to MapBiomas map ping, and a ME of 576.03 ha when compared to IAT mapping.This dissimilarity can be justified by the detect sensors' spatial resolution difference on the different satellites (Figure 5).Willmott's refined index of agreement (dr) measured the accuracy between the area estimated with Sentinel-2 mapping, MapBiomas, and IAT mapping using WorldView images.Values of 0.94 (MapBiomas) and 0.93 (IAT) were found, indicating an optimal performance, i.e., high accuracy among the area estimations.

Conclusions
The results indicate the potential of applying OBIA and ML techniques for supervised classification of natural forests using VIs and spectral fraction from Sentinel-2 images.Moreover, it was possible to map and estimate the forest area for large territorial extensions such as the entire state of Paran� a.
The division of the state by phytoecological region and mesoregions implemented in GEE enabled a better spectral homogeneity of the regions for mapping.This facilitated the selection of images from days without cloud cover, and in peri ods with agricultural crops at low vegetative vigor, which allowed for a significant improvement in mapping.As a result, there was a decrease in spectral confusion among the LULC classes.
The use of vegetation indices and texture improved the performance and accuracy of the classifier.Furthermore, the use of OBIA facilitated the post-classification editing and enabled the reduction of spectral confusion between classes, and consequently increased the thematic spatial accuracy.
The high performance of this approach demonstrated the methodological efficiency based on the analysis of the IE and OE, the high OA, the area estimation, the statistical indicators such as Spearman's correlation coefficient (rs), Mean Error, and the refined index of agreement, which had an excellent performance in comparison with mappings from other detect sensors.
Therefore, the methodology can be applied in projects involving forest mapping in large territorial extensions that require high thematic precision and area estimation.This makes possible the monitoring and generation of reliable area estimations for subsequent years.

Figure 5 .
Figure 5. Errors of commission, omission by domain and allocation component and quantity.

Figure 6 .
Figure 6.Intensity of omission errors, commission and disagreement global.

Table 1 .
Characteristics of Sentinel-2 MSI bands used for OBIA.
Figure 2. Schematic overview of natural forest mapping methodological procedures.

Table 3 .
Satellite scale compatibility from ET-CQDG spatial sampling from DSG 2015.

Table 4 .
Spatial accuracy, KI and OA(%) of the classifiers tested by ML in GEE.

Table 5 .
Confusion matrix among the classes of LULC.

Table 6 .
Accuracy assessment criteria for the natural forest class.

Table 7 .
Evaluation statistics of supervised classification by Phytoecological Region.

Table 8 .
Estimation of natural forest area by different satellites.