Development of an algorithm for identification of sown biodiverse pastures in Portugal

ABSTRACT  Sown biodiverse pastures (SBP) are a pasture system developed in Portugal. Until 2014, farmers were supported in installing and maintaining SBP, but tracking their locations has been lacking. To survey the country, remote sensing tools with machine learning were used. Here, we developed the first algorithm that combines remote sensing data with machine learning algorithms to identify SBP areas. The algorithm combines Landsat-7 and night-light spectral data with terrain and bioclimatic data. Remotely sensed data offer higher spatial resolution compared to bioclimatic data and also cover interannual variability. Gradient-boosted decision trees (XGB) and artificial neural networks (ANN) were the machine learning methods used. The overall classification accuracy, on an independent validation dataset, was 94%, with 82% producer accuracy and 85% user accuracy. The total estimated area of SBP in the Portuguese region of Alentejo region was 1300 km2 in 2013, which is similar to the total known installed area (approximately 1000 km2). The estimated spatial distribution is in accordance with the known distribution at the municipal level. These results are a critical first step towards the future development of remote systems for assessing the state of SBP and for compliance checks of farmer commitments.


Introduction
Land use and land cover, which are largely influenced by human actions, constitute the primary drivers of environmental changes such as alterations in the carbon cycle, energy balance, and biodiversity (Hosonuma et al., 2012;Phiri et al., 2020).Consequently, it is essential to undertake land cover mapping to preserve and improve the quality of relevant ecosystems (Burkhard et al., 2009;De Araujo Barbosa et al., 2015).With the advancement of remote sensing (RS) technologies and data, including missions such as Landsat (United States Geological Survey USGS, 2023) and Sentinel (ESA, 2023), the expense of repeated Earth surface data acquisition has significantly decreased, facilitating the creation of time series based on remotely acquired data.As a result, land cover mapping with high spatial and temporal coverage has become feasible (Hansen et al., 2000;Song et al., 2018;Venter & Sydenham, 2021).
RS data have been utilized to extract information regarding land cover types in agroforestry ecosystems located in Mediterranean regions, which are diverse ecosystems comprising trees, shrubs, pastures, and bare soil (Allen et al., 2018;Pereira et al., 2009).The models available for using RS data may have different objectives, such as identifying pastures or grasslands, cork oak woodlands, and other agricultural fields (Allen et al., 2018;De Luca et al., 2019;Ribeiro et al., 2019;Suess et al., 2018).Machine learning (ML) is an automated method used to explore hypotheses that explain data, and it is increasingly used to identify and describe land.The methods and data used are dependent on the purpose of the study and can have a significant impact on accuracy.Other critical factors include the sensor used (e.g.satellite or unmanned aerial vehicle), class nomenclature (e.g.aggregated or individual classes), models (e.g.decision trees or support vector machines), spatial aggregation (e.g.pixelbased or object-based classification), and even the training/testing data sets.In combination, RS data and ML methods provide opportunities to further understand and describe understudied land systems (T.G. Morais, Teixeira, et al., 2021;T. Morais et al., 2022) Sown Biodiverse Pastures (SBP) are a type of pasture system developed in Portugal and are predominantly found in the agroforestry Montado/dehesa ecosystem, which is crucial for preserving biodiversity (Moreno et al., 2021;Pereira et al., 2009).Presently, over 90% of Portuguese SBPs are situated in the NUTS II (Nomenclature of Territorial Units for Statistics CONTACT Tiago G. Morais tiago.g.morais@tecnico.ulisboa.ptMARETEC − Marine, Environment and Technology Centre, LARSyS, Instituto Superior Técnico, Universidade de Lisboa, Av.Rovisco Pais, 1, Lisbon 1049-001, Portugal level 2) of Alentejo, Portugal (Teixeira et al., 2015).This pasture system was gradually developed and introduced, starting in the 1960s, to address the typically low pasture yield in Mediterranean conditions (Smit et al., 2008).SBPs have high yields that aid in decreasing the environmental impact of grazing systems, mainly through carbon sequestration and soil protection (Morais, Ricardo, & Domingos, 2018;Morais et al., 2022).Significant research has been carried out to enhance knowledge of this system, including insights on nutrient cycling, land use, biodiversity, and estimation of production characteristics.
Due to their potential for sequestering soil carbon, between 2009 and 2014, the Portuguese government provided support for the establishment and maintenance of SBP through the Terraprima -Portuguese Carbon Fund (PCF) projects to meet the goals of the Kyoto Protocol, particularly the Agriculture, Forestry, and Other Land Uses activities of Article 3.4 (APA, 2018).This involved a service payments for carbon sequestration, and during this period, SBP is estimated to have sequestered 1.5 million tons of CO 2 as soil carbon (Teixeira et al., 2014).Over 1000 farmers adopted SBP during this scheme, resulting in SBP covering over 4% of the country's agricultural land in 2014.Spatial data on the installed SBP area was collected during the Terraprima -PCF projects, but there is no information on the maintained and newly installed area since 2013, making it necessary to develop an efficient procedure or model for identifying SBP areas and continuing to monitor this important pasture system.One option for consistently surveying the entire country and assessing where SBP are located, would be through the use of RS data.This would enable a one-time identification of SBP areas in a time-efficient manner, without resorting to field surveys or farmer questionnaires, which are expensive and time consuming.RS data has already been used for land cover/use mapping at various scales and for various purposes (e.g.Tassi et al., 2021;Venter & Sydenham, 2021).However, there is still potential for further exploration of RS in the context of the SBP system.While RS data has been utilized to identify SBPs and optimize management practices, these applications have been limited to small scales.Currently, the only available model for identifying SBPs relies on UAV data (Vilar et al., 2020), which is not feasible for broader scale implementation, such as the entire Alentejo region, due to the high costs involved.Hence, a model to recognize SBP systems that can be applied extensively is still absent from the literature.
In this study, we used data from the PCF project on known areas of SBP installed between 2009 and 2012 to develop the first algorithm for identifying SBP using remotely sensed data from satellite sources.We utilized Landsat-7 (L7) RS spectral data as it was the only satellite mission available from 2009 to 2014 and is currently still in operation.Individual bands were employed along with the normalized difference vegetation index (NDVI), while terrain, climate, and nightlight (i.e.artificial illumination during night-time hours) data were utilized as covariates.We obtained all RS and climate data from Google Earth Engine (GEE - Gorelick et al., 2017) to reduce processing time and storage space.Two ML models were tested: 1) gradient boosted decision trees (XGB) and 2) artificial neural networks (ANN).A cross-validation approach was employed to improve the generalizability of the models.To obtain an appropriate measure of the estimation error we used an independent test set.The total estimated area of SBP in the Alentejo region was validated against the total known installed area up until 2013.

Study area
We defined the scope of our study area to include the entire Alentejo NUTS II region in Portugal, located in the center-south of the country.This region covers approximately 32,000 km 2 and is home to around 705,000 inhabitants according to the 2021 census.The climate in the region is classified as Mediterranean according to the Köppen climate classification system (Rubel & Kottek, 2010), with hot and dry summers, limited precipitation, and cold winters with heavy rainfall.The annual mean precipitation in the region is 552 mm and the annual mean temperature is 18°C.The Montado ecosystem is the typical landscape in Alentejo, characterized by multifunctional holm oak (Quercus rotundifolia Lam.) and cork (Quercus suber L.) forests coexisting with crop and livestock production (Pimenta, 2014).About 53% of the total utilized agricultural area (UAA) in Portugal is located in Alentejo, although the largest number of agricultural farms is in the North and Center regions of Portugal (INE, 2015).

Land cover data
During the payment for environmental services PCF project, Terraprima (https://www.terraprima.pt/en),a Portuguese company, was responsible for paying the farmers to install and maintain SBP.Here, spatially explicit data for the maintained and installed areas of SBP in the Alentejo region were obtained from Terraprima, through personal communication and used to train the models to identify SBP systems.In total, approximately 51,000 hectares were installed between 2009 and 2012, with 3,891 polygons corresponding to unique agricultural parcels where SBP were installed.The majority of the installed SBP were rainfed, but about 10% of the installed SBP parcels were irrigated, which influences the temporal dynamic of the pastures.The year with the highest SBP installation was 2010, with approximately 15,000 hectares installed (1027 polygons), while the year with the lowest SBP adoption was 2011, with only 9,000 hectares installed (792 polygons).Approximately 13,000 hectares (1007 polygons) were installed in the first year of the project (2009), and approximately 14,000 hectares (930 polygons) were installed in the last year of the project (2012).Data for SBP areas installed before 2009 were obtained through personal communication in Teixeira (2010) from the main supplier of SBP mixtures in Portugal, but it only includes information at the municipality level and not at the agricultural parcel level.Therefore, this data was only included to validate the obtained models.The total installed area until 2009 was approximately 40,000 hectares in the Alentejo region.
The Portuguese Institute for Agricultural and Fisheries Financing (IFAP) is responsible for publishing a land parcel identification system that covers the entire country.Each homogenous portion of land is classified according to its land cover, and there are 38 unique land cover classes considered.For this study, we used the land parcel data for the year 2013 in order to include all installed SBP areas during the PCF project.In the total area of the Alentejo region, there are 691,466 unique polygons.Data was obtained from IFAP's WMS server (IFAP, 2021).IFAP parcels were the basic unit of analysis, i.e. for each one the dependent variable used in the study was binary with a value of 1 in case it was an SBP in 2013 and 0 otherwise.We assumed that all area installed between 2009 and 2012 was still SBP in 2013 because the PCF project farmers had the obligation of maintaining the SBP in good agricultural conditions and optimum management either until 2012 (for pastures sown in 2009 and 2010) or 2014 (for pastures sown in 2011 and 2012).During the PCF project, farmers were visited by field technicians who ensured that the SBP were properly managed.Here, we aim to identify all parcels that adopted or installed sown biodiverse pasture during the period from 2009 to the end of 2012.Pastures installed during this period committed to maintaining the system in good conditions until 2013 and were controlled in the field to ensure that the pasture existed and was well managed.

Remote sensing and covariates data
We utilized surface reflectance data from Landsat 7 Enhanced Thematic Mapper+ (L7) for our study.L7 data has been available since April 1999 and has a 16day repeat cycle (USGS, 2023).The images from L7 include eight spectral bands, consisting of four visible and near-infrared (VNIR) bands, two short-wave infrared (SWIR) bands processed to orthorectified surface reflectance, and one thermal infrared (TIR) band processed to orthorectified surface temperature.These products are presented as 8-bit images with 256 grey levels and a resolution of 30 m.For band 6, the GEE platform provides a merged band from Band 6 High (6 H) and 6 Low (6 L).We used a total of seven unique bands (B1-B7) in our analysis.Additionally, we computed the normalized difference vegetation index (NDVI) (Tucker, 1979).
On 31 May 2003, the scan line corrector (SLC) of L7 failed, resulting in a zigzag pattern along the satellite image and a duplicated area.However, this duplicated area was corrected or removed during Level-1 processing, and about 78% of the pixels can be normally used.To minimize the loss of usable data due to the remaining 22% of the area, we created a composite image that is the median of all available images between January 1 and 30 June 2013, which includes 21 scenes.
To eliminate or minimize the effects of clouds and cloud shadows on the L7 data, we implemented a twostep approach.Firstly, we selected only tiles where the total cloud cover was less than 50% using the "CLOUD_COVER" metadata field of each tile obtained from the GEE.After applying this filter, we obtained 19 scenes.However, it is important to note that the tiles that passed the first step may still contain areas with cloud shadows, clouds, and cloud confidence.Therefore, in the second step, we utilized the "Quality Assessment (QA)" band.This band provides information on surface, atmospheric, and sensor conditions for each pixel of the L7 data, allowing for perpixel filtering.Cloud shadows, clouds, and cloud confidence are identified based on the bit values of 3, 5, and 7, respectively.All data and preprocessing steps were conducted using the GEE.At a regional scale, local characteristics such as vegetation and climate significantly influence surface reflectance.To address the variations in spectral responses, additional covariates are typically utilized (Brown et al., 2020;Pflugmacher et al., 2019;Venter & Sydenham, 2021).In this study, we used elevation data from a third-party source, Epic WebGIS Portugal (Magalhães et al., 2018), which has a spatial resolution of 25 meters.We also used bioclimatic variables from WorldClim.The bioclimatic variables are averages for the time period between 1950 and 2000.The spatial resolution of this data is 1 km (WorldClim BIO Variables V1 - Hijmans et al., 2005), obtained from GEE.The list of 19 bioclimatic variables includes annual mean temperature (bio01), mean diurnal range (bio02), isothermality (bio03), temperature seasonality (bio04), maximum temperature of the warmest month (bio05), minimum temperature of the coldest month (bio06), temperature annual range (bio07), mean temperature of the wettest quarter (bio08), mean temperature of the driest quarter (bio09), mean temperature of the warmest quarter (bio10), mean temperature of the coldest quarter (bio11), annual precipitation (bio12), precipitation of the wettest month (bio13), precipitation of the driest month (bio14), precipitation seasonality (bio15), precipitation of the wettest quarter (bio16), precipitation of the driest quarter (bio17), precipitation of the warmest quarter (bio18), and precipitation of the coldest quarter (bio19).
Finally, we also included data on night-light sources with the objective of differentiating artificial surfaces from pasture surface.The data was obtained from the Earth Observation Group, Payne Institute (Mills et al., 2013) with approximately 500 m resolution.We also calculated the median value for the period between 1 January and 30 June 2013.

Classification approaches
To identify SBP areas, we utilized two classification models, namely, gradient boosting tree-based method (XGB) and artificial neural network (ANN).Both models used all input variables, including individual L7 bands, NDVI, elevation, bioclimatic, and nightlight, equally.Decision trees, including XGB, and ANN are well-known for their capability to handle large datasets, take into account non-linear relationships between explanatory and response variables, and are robust against overfitting (Gómez et al., 2016;Phiri et al., 2020).Introduced in 2016, XGB is a gradient boosting tree-based method that is trained by making sequential predictions and combining weak predictive tree models to learn from obtained errors.XGB has shown significant performance improvements and computational efficiency compared to traditional gradient boosting methods.On the other hand, ANN is a multi-layer network structure consisting of an input layer with input/explanatory variables, an output layer with the dependent/objective variable, and one or more hidden layers with nodes or artificial neurons.Each hidden layer receives a signal, processes it through a transfer function, and passes the processed signal to the neurons connected to the next layer.In a single hidden layer structure, neurons in the hidden layer receive a signal from the neurons in the input layer, process it, and transfer the processed signal to the output layer.
The hyperparameters of both models were optimized using a Bayesian optimization approach.For XGB, we tested several hyperparameters, including the number of estimators (ranging from 1000 to 10,000 with an interval of 50), the learning rate (ranging from 0.01 to 1, with intervals of 0.015), the maximum depth of trees (ranging from 1 to 30), and L1 and L2 regularization.The objective function was set to "binary: logistic".For ANN, we tested the number of hidden layers (one or two), the number of neurons in each hidden layer (ranging from 100 to 1000, with an interval of 50), the learning rate (ranging from 0.01 to 1, with intervals of 0.015), and the activation function (sigmoid, tanh, or softmax).The Adam optimizer was used in the ANN, and the binary cross-entropy objective function was used.
For the XGB method, input data normalization was not required (Chen & Guestrin, 2016).However, for ANN, we used a "min-max" normalization approach where values were normalized between 0 and 1.This was done to increase the learning rate and ensure faster convergence.A model with large weights tends to be unstable, suffer from poor performance during learning, and be sensitive to input values, resulting in higher generalization error (Bishop, 1995;Goodfellow et al., 2016).

Validation approach
To train and evaluate the models, all SBP polygons were used but, to reduce computational requirements, only 30% of non-SBP polygons were used.To minimize the sensitivity of results to data selection effects, 10 different random selections of non-SBP polygons were considered.For each of the 10 selections of the non-SBP polygons, we used the following structured multi-fold approach.First, the data was split into independent training (75%) and test (25%) sets.Then, the training set was split into 10 approximately equal folds.For each choice of fold, 9/10 of the training set (all the other folds) was used to train the model and the remaining 1 part (the fold itself, taken as a hold-out sample) was used as the validation set.For each selection of the non-SBP polygons, the error on the validation set was used by the Bayesian optimization approach to select the best hyperparameters.The performance of each model (XGB and ANN) was then measured on the independent test set for each of the 10 selections of the non-SBP polygons.Finally, the identification error of each model (XGB and ANN) is taken as equal to the average of these 10 error values.
To compare the performance of the obtained models (XGB and ANN), we used multiple evaluation metrics, including the confusion matrix, overall accuracy (ratio between total number of correctly classified parcels by the total number of parcels), user's (ratio between the number of true positives by the sum of true positives and false positives) and producer's accuracy (ratio between the number of true positives by the sum of true positives and false negatives), and the F1score (harmonic mean of user's and producer's accuracy).

Remote sensing data availability
Using all available L7 images, we calculated composite images for individual bands and NDVI.The number of clear sky observations (CSO) depends on the location within the Alentejo region (Figure 1), as it is covered by different L7 tiles.This excludes cloudy pixels and those that cannot be used due to the failure of the L7 scan line corrector.The number of CSO ranges from 1 to 17, with more than 70% of CSO occurring between April and June.The composite values for each band and NDVI are thus highly influenced by this period.In more than 70% of the Alentejo region, over five images were used (see Figure 1(a)), with the highest number of pixels belonging to the 3-image class (20%).The SBP areas show similar trends to the Alentejo region, with the 3-image class having the highest number of observations (Figure 1(b)).However, only 40% of SBP parcels have more than five images available.

Optimal structure and accuracy assessment
The XGB model outperformed the ANN in identifying SBP in the independent test set in all performance metrics (Figure 2).Both methods had similar performances in terms of overall accuracy (XGB: 93%, ANN: 91%).However, XGB had significantly higher producer's and user's accuracy.The highest difference was in user's accuracy, which is the ratio of true positives to true positives plus false positives, where XGB had a performance that is 52% better than ANN (XGB: 82%, ANN: 54%).The producer's accuracy, which is the ratio between true positives and true positives plus false negatives, of XGB was 25% higher than that of ANN (XGB: 80%, ANN: 63%).
Regarding the individual importance of independent variables, no significant variation was observed among random selections of the non-SBP data.Out of the 10 random selections of non-SBP areas for XGB, the bioclimatic variable "precipitation of driest month" exhibited the highest individual average feature importance (12%), followed by parcel area (6%) and mean temperature of the warmest quarter (5%).The first remotely sensed variable, night-light, ranked fifth in importance (4%), while band 4 of L7 satellite ranked sixth (4%).Additionally, the input variables associated with the variance of L7 bands and NDVI demonstrated the lowest explanatory power, with a cumulative importance of 9%.No variable, whether remotely sensed or other, demonstrated an overwhelming feature importance when taken in isolation compared to the rest of the variables.For instance, the difference between the variable with the highest value and the first remotely sensed variable is only 8% points.Additionally, when taken in combination, remotely sensed data and bioclimatic data had similar contributions.The total feature importance of remotely sensed data, including L7 bands, NDVI, and night-light data, amounted to approximately 43%.

Regional application
We used the best model (XGB) to identify all SBP parcels and their respective areas in the entire Alentejo region, as shown in Figure 3, which also displays all the municipalities of Alentejo.A total of 32,480 parcels were classified as SBP areas, representing approximately 5% of the total number of parcels in the Alentejo region (708,400 parcels).The total area of identified SBP in 2013 is approximately 1,320 km 2 .The estimated SBP areas are mostly located in the central and northern areas of the Alentejo region, with less than 10% situated in the four southern and coastal municipalities of Alentejo.The total estimated area of SBP represents approximately 4% of the total area of the Alentejo region, which is around 31,550 km 2 .About 25% of the classified parcels as SBP were classified as "permanent pastures" in the Portuguese land parcel identification system (7,325 parcels).However, about 38% of the SBP parcels were classified as "annual crops" (12,301 parcels).Among the other land use classes of the Portuguese land parcel identification system, the class with the highest percentage was "olive grove" with only 9%.A similar distribution was verified in the known SBP parcels from PCF, where 30% of known SBP parcels are in "permanent pastures", and 31% are in "annual crops".
Despite the existence of two notable regions with a high concentration of SBP, namely the municipality of Vendas Novas (on the left) and Viana do Alentejo (in the centre), estimated SBP areas were sparsely distributed throughout the Alentejo region (Figure 3).On average, the classified SBP parcels had five neighbours, but more than 40% of the classified SBP parcels were isolated (13,662 parcels) (Figure 4).Only 21% of the classified SBP parcels had more than 50% of their neighbours also with SBP.
The classified SBP parcels therefore exhibit differences in size and adjacency.Figure 5 displays the classified SBP parcels at the plot level, with an L7 image used to train the models as the base map (only RGB bands were plotted).In this area, a high adjacency of the classified parcels can be observed, but there is also a significant portion of isolated classified parcels.Furthermore, the classified parcels range greatly in size (in this figure, from about 10 km 2 to about 1,000 km 2 ), and they also vary significantly in shape.The majority of the parcels do not have a welldefined shape, but some are circular parcels that can be readily identified, and which probably correspond to pastures irrigated with a centre pivot.In Figure 5, SBP parcels tend to have a darker green colour than the surrounding pasture areas (also in green).In this figure, other land uses that were correctly not identified as SBP, such as rocks and landforms (in brown and dark grey) and human construction (in white and light grey), can also be seen.
At the municipality level, there is a high agreement in the spatial distribution of known SBP from PCF and the estimated SBP area (Figure 6).For instance, the Évora municipality has the highest percentage of known and estimated SBP area (represented in the class ">11" and the largest municipality in the centre).Moreover, there are five municipalities for which known and classified SBP areas constitute more than 35% of the total SBP area in the Alentejo region (the five municipalities located in the centre of the Alentejo region in Figure 6).There are also some discrepancies between known and estimated SBP areas, particularly in municipalities without known installed SBP (shown in grey on the left panel of Figure 6).There are four municipalities without known installed SBP, but the model still estimates SBP area in those municipalities.However, the total estimated SBP area in those municipalities is less than 2% of the total estimated area.It should also be noted that the training data only includes areas installed under the PCF project.It is unlikely, but still possible that some additional parcels could have been sown in this period, or were sown before and still maintained good agronomic conditions.Overestimated SBP areas in each municipality may not be incorrect, but rather the algorithm finding pastures installed at other moments in time and not properly tracked but still functional as an SBP.

Discussion
In this study, we present a model for accurately identifying SBP parcels in the Alentejo region of Portugal using remote sensing data and machine learning methods.The proposed model incorporates remote sensing data from L7 and auxiliary variables to achieve precise identification of SBP parcels.The model demonstrated high accuracy across all performance indicators, with an overall accuracy of 94% on the independent test set.Among the two machine learning models employed, XGB consistently outperformed ANN in all performance indicators, including overall accuracy.
There have been few studies focused on identifying specific types of land cover in grassland/agroforestry systems, such as the SBP system, with most studies focused on broader land cover classes (D'Andrimont et al., 2021;Phiri et al., 2020) or on discerning more abstract land use types where errors are diluted.Venter and Sydenham (2021) recently used Sentinel-2 satellite imagery combined with climatic variables to produce a land cover map of Europe with a very high spatial resolution (10 m) compared to other available land cover maps such as CORINE, which has a spatial resolution of 250 m.However, Venter and Sydenham (2021) only considered eight unique land cover classes, with SBP areas potentially classified into three of those classes (grassland, shrubland, and woodland) depending on the woody/shrub vegetation cover.Furthermore, the overall accuracy of the map in southern European regions, such as Southern Portugal, ranged from 50% to 70%.Previous studies in Portugal have mainly focused on crops and broad agroforestry systems (Allen et al., 2018;Navarro et al., 2021).These findings suggest that models trained with broader land cover classifications and for broader regions, such as continental scale, may not be suitable for identifying SBP areas.
It is worth noting that the accuracy obtained in the present study is aligned with the accuracy observed in the literature for studies with similar aims for other cover types.For example, Phiri et al. (2020) reviewed the use of Sentinel-2 satellite imagery for land cover/use classification, and the overall accuracy for object-based classification in the papers reviewed ranged between 61% and 98%.In Portugal, Allen et al. (2018) produced a model for land cover classification covering three municipalities in Alentejo, which obtained an accuracy of 63% for the identification of the agroforestry land cover class (where most of SBP is located).Examining the individual contributions of input variables, it is worth noting that certain bioclimatic variables, when taken in isolation, exhibit higher feature importance individually than any remotely sensed variable.However, none of the variables in isolation holds significant explanatory power.The variable with the highest feature importance, for example, only accounts for 12%.In contrast, Venter and Sydenham (2021) found a feature with 60% importance.In our results, such a substantial difference in feature importance is not observed, leading to the conclusion that no individual variable can explain the obtained results.Both remote sensing variables and other auxiliary variables are necessary to achieve the obtained results.When the effects are assessed together, the explanatory power of both sets of variables was similar.In fact, the identification of SBP at the parcel level was made possible by using remote sensing data from L7 and night-light.Primarily, remote sensing data introduced spatial and temporal variations in the analysis that would not have been possible through the use of bioclimatic data alone.Remote sources have superior spatial resolution compared to other variables.Landsat-7 data offers a spatial resolution of 30 m, while bioclimatic variables have a resolution of 1 km, except for elevation, which has a resolution of 25 m.Without remote sensing data, the predicted area of sown biodiverse pasture would also remain constant over the time period of the project as bioclimatic sources are time independent.
The most significant limitation of the present work was the use of L7 because it affects the data and models used.The only spatially explicit data available on SBP areas was for the period between 2009 and 2013.L7 was the only satellite mission available in 2013 that still collects data (United States Geological Survey (USGS) (2021).If more recent spatially explicit data on SBP areas were available, we could have used Landsat-8 or Sentinel-2 data, which have a shorter revisit time (L7 has a revisit time of ~16 days, while Sentinel-2 has a revisit time of ~5 days), more individual bands, and higher spatial resolution, with Sentinel-2 having a resolution of 10 m.Additionally, we could not use a multi-temporal image stacking approach, such as using a monthly composite image instead of a single composite image for the period between January and June.The scan line corrector failure of L7 prevented us from using such an approach.Due to the effect of the L7 failure and the presence of clouds, only 20% of Alentejo and 10% of SBP area had data available for each month between January and June of 2013.Therefore, we had to discard the option of using a multi-temporal approach.Furthermore, our algorithm classified parcels in binary form (as SBP or non-SBP).It will be interesting in the future to test if a pixel-based classification system, using for example convolutional neural networks, would improve the predictive capabilities of this approach.
We used remotely sensed data for the period between January and June 2013 because this period covers the season of peak productivity and flowering in the SBP system.The selection of this period makes it easier to identify the differences between SBP and other grassland systems and permanent crops that exhibit different temporal patterns, such as lower NDVI throughout the entire period.If months beyond June had been considered, the data obtained would have lower quality in terms of highlighting the unique pattern of the SBP system.For instance, during July and mid-September, all grassland systems, including the SBP system, are dry in Alentejo.During the autumn months, the sown biodiverse pasture and other natural or semi-natural pastures start to grow, but there is no significant difference between the sown biodiverse pastures and the others.Therefore, data from the period between January and June would not aid in the identification of the SBP system as it would be similar to that of other grasslands.Since the majority of the installed SBP are rainfed, the temporal dynamics of the pastures are more or less similar.However, in the approximately 10% of parcels with irrigated production, the temporal dynamics are different from the rainfed parcels, which can also influence the obtained results.
In the present study, we conducted a binary classification task to distinguish between SBP and non-SBP parcels.However, a multiclass classification approach could also be utilized, such as distinguishing between rainfed and irrigated SBP parcels or different types of biodiverse pastures.Nevertheless, increasing the number of land cover/use classes typically leads to a decrease in the overall accuracy of the models, as demonstrated by Van Thinh et al. (2019).The authors discovered a negative correlation between overall accuracy and the number of classes, indicating a tradeoff between accuracy and the level of detail in the land cover map.Furthermore, the authors concluded that land cover mappers should consider an approximate 0.77% decrease in overall accuracy for every additional land cover classification class.This study also observed that the number of classes has a greater impact on overall accuracy than the spatial resolution of the remote sensing data source.
In this study, two machine learning (ML) models were evaluated: a decision tree model (XGB) and an artificial neural network (ANN).Both models performed well in identifying the SBP system, but they did not capture spatial patterns as effectively as convolutional neural networks (CNNs), which are commonly used for image classification and segmentation.CNN models can also be used for landscape segmentation, such as in the task of SBP identification.However, a major limitation of CNN models is the need for a large amount of data to achieve high accuracy without overfitting, as noted by Hu et al. (2018).For example, the U-net CNN was initially developed for biomedical image segmentation (Ronneberger et al., 2015), but has also been used for land cover classification/segmentation (e.g.Giang et al. (2020)).U-net has 23 convolutional layers and approximately 1 million parameters to train, whereas the most complex ANN structure in the present study trained approximately 5 thousand parameters (about 0.5% of the U-net parameters).Despite the lower complexity of the models employed in this study, they achieved high accuracy.
The algorithm developed here was a first step towards a robust method for having a nationwide survey of where SBP are located.One practical application of such an algorithm is the annual monitoring of SBP since 2000 using available data from the L7 satellite until the present day.The Portuguese government takes into account the carbon sequestration performed by SBP areas in the National Inventory Report (APA, 2018).However, the National Inventory Report only uses the SBP area installed until 2012 as the real area of SBP installed between 2013 and the present is unknown.With the models produced in this study, it is possible to estimate the current area of SBP in Portugal and update the contribution of SBP areas to the total greenhouse gases balance of the country.To achieve this, the proposed model should be combined with another model to estimate the soil organic carbon (SOC) stocks of SBP from its installation until it reaches saturation.Models and approaches already in the literature can be used for SOC stock estimation in SBP areas, such as the ANN model proposed by Morais et al. (2021) based on infrared spectral data.Another alternative is the use of process-based modelling, where biogeochemical processes that occur in nature are formulated according to mathematicalecological theory (Coleman et al., 1997;Liu et al., 2011), and SOC turnover is estimated according to specific site conditions and management practices (Luo et al., 2015;T. G. Morais et al., 2019).The Rothamsted soil Carbon Model (RothC model), a SOC process-based model, has already been used in the SBP system in Alentejo to estimate associated variables with SBP (Morais, Ricardo, & Domingos, 2018).However, its applicability to estimate SOC accumulation due to the installation of SBP still requires validation, particularly in terms of the default SOC mineralization rates considered in the RothC model for SBP.

Conclusions
Here, we present the first model for identifying Portuguese SBP areas by combining remote sensing (RS) data, including L7 satellite data, with additional covariates such as climate, terrain, and night-light information.Using remote sensing data enables the identification of SBP at the parcel level.This dataset was used in two ML models, specifically XGB and ANN, which serve as valuable tools for monitoring this significant grassland system.Comparing the models, we found that XGB outperformed ANN in all considered performance indicators (e.g.XGB overall accuracy: 93%, ANN: 91%).Further, XGB was particularly better than ANN in the user's accuracy (the ratio between true positive and true positive plus the false positive), where XGB performed 52% better (XGB: 82%, ANN: 54%).With the obtained model, we were able to estimate the total SBP area in the Alentejo region spatially (about 1,300 km 2 , which is 5% of the Alentejo area).A practical application of the present work is annual monitoring of SBP, which today is necessary for estimating the contribution of the land for sequestering carbon in soils.

Figure 1 .
Figure 1.Clear sky observations (CSO) in Alentejo region (a) and the number of CSO in sown biodiverse pastures (SBP) areas.

Figure 3 .
Figure 3. Spatial representation of sown biodiverse pasture areas in Alentejo region (in red) in 2013 estimated with the algorithm developed here.The total estimated area of SBP is about 130 thousand hectares (about 5% of total area of Alentejo region).

Figure 2 .
Figure 2. Performance metrics (overall accuracy, producer and user's accuracy) of the best structure of XGBoost (XGB) and artificial neural networks (ANN) for identify sown biodiverse pastures (SBP) in Alentejo measured in the independent test set.

Figure 4 .
Figure 4. Adjacency of predicted sown biodiverse pasture (SBP) areas.The adjacency was calculated as the ratio between the number of neighbour parcels with SBP and the total number of neighbour parcels.

Figure 5 .
Figure 5. Example of the identified sown biodiverse pastures (SBP) areas at plot level.The base map in the figure is the composite Landsat 7 image (only RGB bands were plotted) used to train the models.

Figure 6 .
Figure 6.Comparison between known sown biodiverse pastures (left) and estimated sown biodiverse pastures (right), at municipality level.This comparison is performed at a percentage level, i.e. values in each municipality corresponds to area of SBP in that municipality divided by the total SBP area in Alentejo.Municipalities without known SBP installed are presented in grey (in left plot).