Fractional crop-planting area projection by integrating geographic grid data and agricultural statistics based on random forest regression

ABSTRACT Accurate fractional crop-planting area (FCPA) mapping is a challenging task as it requires leveraging the advantages of geographic data in detailed spatial expression and agricultural statistics in the description of crop types and quantitative characteristics. We present a robust method to disaggregate the agricultural statistics within each county unit to 1-km scale grid by combining particle swarm optimization (PSO)-based feature selection with Random Forest (RF) regression, and an iterative area-gapallocation(IAGA) method is implemented to reconcile the discrepancies between the disaggregating results and statistics. The agriculture in Gansu Province, China, is characterized by complex heterogeneous smallholder farming landscapes. We tested this methodology in Gansu and explored the synergistic estimation of FCPA for six types of crops (i.e. wheat, maize, oil-bearing, vegetable,orchards, and other crops) in 2010. The results showed that the derived FCPA maps matched well with the statistics in terms of quantity, while also providing spatial details. The quantitative evaluation results indicated that the derived FCPA had good accuracy,with a higher R2 above 0.97, a lower RMSE below 1%, and an absolute error between 1.53-5.24%. The proposed methodology provides valuable insights for practical large-scale FCPA mapping at a high spatial resolution in a cost-effective manner.


Introduction
Crops account for 15 × 10 6 km 2 (i.e.approximately 40%) of the Earth's ice-free land surface,providing important products for human survival, including food, vegetables, edible oil, fiber, and medicine (Fritz et al. 2015;Molotoks et al. 2018;Ramankutty et al. 2008).With the rapid growth of the global population, one of the greatest challenges facing humanity in the twenty-first century is the ability to address the growing demand for more crop products while also mitigating the season (Hu et al. 2016;Lobell and Asner 2004;Xiong et al. 2017).The core of the single date-based method is to find the most appropriate image in the crucial phenological phase to optimize the spectral separability of different crop types.Although this method is simple and easy to use, it uses medium and high spatial-resolution optical remote sensing images, such as Landsat and Sentinel-2, because of the long revisit cycle and relatively small swath of such sensors and the impact of cloud and rain weather conditions.The crucial phenological phase images are often insufficient, and it is extremely difficult to identify two or more crops that have similar spectra.Generally, the crop-mapping method using multitemporal images selects one or more feature variables, such as the normalized difference vegetation index (NDVI), enhanced vegetation index (EVI), or leaf area index (LAI), and then determines the threshold value of the 'spectrum-time series' curve for each feature variable at a specific time to identify a crop.Compared with single-date remotely sensed data, multitemporal images can accurately depict the spectral dynamics of crops throughout the entire growing period and better combine the spectral and phenological signatures of crops.This approach is promising for capturing subtle but crucial phenological differences among various crop types (Foerster et al. 2012;Zhang et al. 2015).Crop-mapping methods applied to multitemporal images have been proven to perform better than single date-based methods and have been successfully implemented at the regional scale (Begue et al. 2018;Gumma et al. 2014;Luo et al. 2020;Pena-Barragan et al. 2011;Zhong, Gong, and Biging 2014) as well as at the global scale (Pittman et al. 2010;Salmon et al. 2015;Thenkabail and Wu 2012).Different crop types, however, may exhibit similar spectral characteristics at a given time point.Although these hard classification methods, which aggregate all crop types into a single category in a pixel, provide an efficient means to map crop type and extent, they may result in significant errors in estimated crop areas because of the spatial resolution limitations of the images.Notably, heterogeneous agricultural landscapes characterized by small-scale family farms and excessive land fragmentation often have the problem of mixed pixels (Clauss, Yan, and Kuenzer 2016;Lobell and Asner 2004).
To address this 'mixed pixels' problem caused by coarse resolution in heterogeneous agricultural regions on crop mapping, some studies have explored subpixel mapping of crops and have obtained the fractional abundance of each type of crop on a pixel.Existing methods can be group into two categories: multiple regression and linear spectral unmixing methods.Multiple regression methods generally use traditional statistical regression analysis or machine learning methods to develop quantitative relationships between the subpixel crop fraction and a set of input variables involving spectral signatures from different temporal periods.For example, Pan et al. (2012) developed a linear regression model between the MODIS EVI time series and actual crop acreage to estimate winter wheat fractions within 250 m of spatial resolution in two agricultural areas of China.Liu et al. (2018) used linear regression to construct the relationship between the coefficient of variation of the land surface water index derived from MODIS and paddy rice reference planting fraction calculated from high-resolution Gaofen2 images.Atzberger and Rembold (2013) used a neural network to develop nonlinear regression relationships between the AVHRR NDVI time series and fractional crop area within a 1-km spatial resolution in central Italy.In addition, a stepwise regression model was established using the MODIS EVI time series and Landsat-derived fractional cropland data to estimate the cropland fraction at a regional scale (Zhu, Zhang, and Huang 2019).The spectral unmixing method assumes that the reflectance of a pixel is a linear combination of the surface component spectra within a pixel and that the weight of each component is proportional to the area occupied by that component.For example, Lobell and Asner (2004) used a probabilistic linear unmixing method to estimate subpixel crop area fractions based on MODIS time series reflectance signatures.Subpixel crop mapping methods have created new possibilities for generating more accurate crop planting area data.Because of different field management (e.g.irrigation, fertilization, and sowing), planting modes (e.g.single or double cropping, continuous cropping, rotation, intercropping and interplanting), local weather conditions, and soil quality, two phenomenon often occur: (1) different crop types with the same spectrum and (2) variations in spectral signatures of the same crop type.Variations in spectral signatures across the growing period and growing calendar, as well as the spectral variability of crops, including intraclass and interclass variability, also may vary with locations (Pena- Barragan et al. 2011;Sakamoto, Wardlow, and Gitelson 2011).Consequently, the relationships between the subpixel crop fraction and spectral signature can vary across geographical spaces.This type of fractional abundance mapping can provide the proportion of specific crops in a pixel, but it cannot provide the specific location of crops.Super-resolution mapping (SRM), can be considered as a follow up processing technology of fractional abundance mapping, aims to explore both the proportion and location by reconstructing finer-scale images from coarser images and has achieved significant progress in land cover mapping.For example, Wang et al. (2020) constructed an SRM scheme that first utilized spectral unmixing to calculate the land cover proportions corresponding to the subpixels and then allocated the land cover labels to all subpixels by class allocation according to their land cover proportions.He et al. (2022) used deep learning-based SPM for super resolving 30 m Landsat images to generate meter-level high-resolution land cover maps for 28 metropolises in China.To our knowledge, however, the SRM techniques have not yet been applied to crop-mapping applications, which may represent a promising research direction for fine crop mapping.
Note that the subpixel crop mapping has led to the transformation of the crop-mapping mode from the initial single type of crop mapping to the planting structure monitoring composed of multiple crops.Traditional hard classification is considered to be single-crop-type mapping because it only classifies a pixel according to a certain type of crop.For example, single-category mapping of paddy rice (Clauss, Yan, and Kuenzer 2016;Dong et al. 2015;Frolking et al. 2002;Gumma et al. 2014;Xiao et al. 2005;Zhang et al. 2015), wheat (Zhong et al. 2019), maize (Zhang, Feng, and Yao 2014), and soybean (Song et al. 2017;Zhong, Gong, and Biging 2014) has been extensively explored.Although some studies have focused on multicrop mapping, such as soybean and corn (Zhong, Gong, and Biging 2014;2016), as well as mapping of four crops of rice, maize, soybeans, and wheat (Hu et al. 2016), or even more types of crops (Akbari et al. 2020;Bargiel 2017;Foerster et al. 2012;Wardlow and Egbert 2008), they have generated hard classification maps for each type of crop separately and have not considered the harmony of multiple crops.Thus, they are essentially single-crop-type mapping.Early subpixel abundance mapping has explored the estimation of the proportion of single crops, such as wheat (Atzberger and Rembold 2013;Lobell and Asner 2004;Pan et al. 2012), paddy rice (Liu et al. 2018), and maize (Sakamoto, Wardlow, and Gitelson 2011).Crop-planting structure monitoring should consider the relative contributions of all crops to a planted area (Wu and Li. 2012;Pena-Barragan et al. 2011).For example, cropping patterns (e.g.soy-maize, soy-cotton, soy-pasture, soy-fallow, fallow-cotton, and single crop) in the state of Mato Grosso, Brazil, were investigated by Chen et al. (2018).Yin et al. (2020) first generated probability maps for three crops of rice, corn, and soybean, and then the probability maps were composited into a final crop layer by considering the probability of each crop at every pixel.China's global crop-monitoring system (CropWatch) has used remote sensing data and field data to determine key crop production indicators, such as the planting proportion of more than 10 major crops (wheat, rice, and maize) in China (Wu et al. 2014).The National Agricultural Statistics Service (NASS) of the US Department of Agriculture (USDA) has produced a nationwide crop-specific land cover map of the United States each year since 2008, covering 155 classes of cultivated crops (Lark, Schelly, and Gibbs 2021).
Because of the influence of imaging conditions, image acquisition capability, mixed pixels, and scale conversion, crop types and areas generated based on hard classification and subpixel abundance mapping methods are often inconsistent with statistics.Therefore, it has been difficult to achieve ideal crop-mapping accuracy by simply relying on satellite-based datasets.The latest progress in crop mapping is the emergence of new 'data fusion' techniques to integrate satellite images with agricultural statistics to produce high-accuracy crop type and area maps (See et al. 2015).By assuming that the agricultural statistical data are the reference truth, some previous studies have used a satellite-derived land cover dataset to determine the spatial distribution of cropland, and a statistical downscaling model with several constraints was developed to spatially disaggregate the statistical crop-planting or crop-harvesting areas within each geopolitical unit scale to the grid (Frolking et al. 2002;Monfreda, Ramankutty, and Foley 2008;You and Wood. 2006).The crossinformation entropy method has been used to allocate crop statistical information to 5 arcmin grids and to produce spatial distribution maps of world crops ( You et al. 2014).Hu et al. (2021) mapped subpixel crop distributions by integrating MODIS time series and agricultural statistics using a random forest (RF) regression model with high-spatial-resolution reference crop distribution maps (i.e.SPOT and Landsat) as training samples.Combining satellite-based datasets with agricultural statistics has been a powerful way to map crop-planting and harvesting area distributions at regional or global scales.In particular, advanced machine learning technology has shown great potential for combining statistics with satellite observations and transferring knowledge to the pixel level.Because of the relatively coarse spatial resolution and low mapping accuracy, existing large-scale crop-planting and harvesting area maps produced by these methods often have shown poor regional suitability, especially under conditions of complex terrain and heterogeneous planting structure.In addition, existing regional-scale crop-mapping methods often require highresolution reference crop distribution maps or a large quantity of ground survey data as the response target.It often is difficult to obtain this information or requires experts to perform interpretations based on prior knowledge, which results in poor universality of the method.
Based on these achievements, we selected Gansu Province in northwest China, which is characterized by a heterogeneous small-scale agricultural system, as the research area.The objectives of this study were twofold: (1) by leveraging the power of satellite data in the detailed expression of crop spatial distribution and statistical data in the description of crop types and quantitative characteristics, we proposed a robust method combining particle swarm optimization (PSO)-based feature selection with RF regression to allocate crop-plating area statistics to the 1 km × 1 km grid scale without the need of high-resolution reference crop distribution maps or ground survey data support; and (2) we explored the planting structure monitoring of fractional abundance of multiple crop types in regions characterized by complex terrain and heterogeneous smallholder farming landscapes.This paper is organized as follows: Section 2 introduces the geographical information of the study area and describes the study dataset that was used to develop the subpixel fractional crop-planting area (FCPA) mapping model.Section 3 presents the details of methodology.Sections 4 and 5 present the results and a discussion, respectively.Finally, the conclusions are presented in Section 6.

Study area
Located in the inner regions of Northwestern China, Gansu Province (32°31 ′ -42°57 ′ N and 92°13 ′ -108°46 ′ E) (Figure 1) is an agricultural province whose primary production is grain and cash crops.The geographical environment and climatic conditions, however, are not advantageous for agricultural development.In terms of geographical environment, the region covers an area of 42.58 × 10 4 km 2 and has a complex terrain with an elevation ranging from 624 to 5602 m.The physiognomies of Gansu Province are complex and diverse and include six ecological districts: the gully region of the Loess Plateau, the Qinling Zhongshan canyon area, the grassland meadow area of Gannan Plateau, the hilly area of the Loess Plateau, the Hexi Corridor gobi district, and the desert district.In terms of climatic conditions, the various climate types range from humid (southwest) to semi-humid, semiarid, and arid.The mean annual temperature is 273.16-287.26K (An et al. 2020).Water resources are scarce and unevenly distributed; for example, the annual total precipitation ranges from 40 to 800 mm, with a drying gradient from southeast to northwest (Wang et al. 2017).Owing to the unique geographical and climatic conditions, the study area is characterized by the smallholder farming systems with tremendous heterogeneity.Although the study area is rich in crop variety resources, large regional differences in agricultural production, complex planting structures, heterogeneous and fragmented landscapes, diverse cropping patterns, and cultivation habits have made the mixed pixel effect and scale sensitivity prominent, posing a serious challenge for crop mapping.

Datasets and preprocessing
The data used in this study, including geographic and agricultural statistical data, are shown in Table 1.We used geographic data to provide spatial distribution details for subpixel crop mapping, and the county-level agricultural statistics served as the reference 'truth' within an administrative unit.

Geographic data
The geographic data used in this study included China's land use and land cover dataset (CNLULC), China's five-day NDVI composite product (MODND1F), digital elevation model (DEM) of China (Tang 2019), World Database of Protected Areas (WDPA) (UNEP-WCMC, 2022), and China land cover dataset (CLCD).The CNLULC is a multiperiod remote sensing monitoring dataset of land use and land cover change in China, which currently includes 10 periods of data from 1980, 1990, 1995, 2000, 2005, 2010, 2013, 2015, 2018, and 2020(Xu et al. 2018)).CNLULC adopted a hierarchical classification system, which include the following six classes of land cover: cropland, woodland, grassland, waterbody, unused land, and built-up land.We divided the six classes of land cover were into 25 land cover classes, as shown in Table 2.The spatial resolution of CNLULC was 100m, and the overall accuracy of the 25 classes of land use was above 91.2%(Liu et al. 2014).The MODND1F product was generated by taking the five-day maximum value from the daily 500 m MODIS/Terra NDVI built from daily surface reflectance data (Chen et al. 2004), the five-day temporal intervals fully captured the subtle biophysical differences in critical crop growth stages.The China land cover dataset (CLCD) is the first fine-resolution (i.e. 30 m) Landsat-derived annual land cover dataset of China for 1990-2021.It has better accuracy than other existing fineresolution land cover products, including ESACCI_LC, FROM_GLC, and GlobeLand30 (Yang and Huang 2021).In this study, taking the year 2010 as an example, we extracted the corresponding data for Gansu Province from these datasets, in which the NDVI data considered only the growing season of crops (i.e.April-September).Then, we projected all of the data in the study area to the Krasovsky ellipsoid coordinates and Albers projection coordinate system, and the NDVI data were resampled to a resolution of 1 km.

Agricultural statistics
We downloaded country-level agricultural statistics for the year 2010 from the Gansu Province Bureau of Statistics, China, which provides the sown area for each crop type in 87 counties in Gansu Province.In this study, we reclassified all crops into the following six classes: wheat, maize, oil-bearing, vegetables, orchards, and other crops.

Methodology
The proposed methodology for integrating geographic data and agricultural statistics for subpixel crop-type mapping consisted of four main parts (Figure .2): (1) feature extraction, (2) feature selection, (3) disaggregation of statistics data within each county unit to 1 km × 1 km, and (4) adjustment of the predicted crop planting area value to match the statistical data.

Feature extraction
We extracted relevant features from the geographic data as potential predictive variables for FCPA mapping.First, we calculated the proportion of each of the 25 land cover classes per 1-km pixel using 100 m of CNLULC data.In addition to providing NDVI features of 35 time phases, we further extracted 16 phenological features (i.e.beginning of season, end of season, length of season, base value, time of middle of season, seasonal amplitude, rate of increase at the beginning of the season, rate of decrease at the end of the season, large integrated value, small integrated value, value for the start of the season, value at the end of the season, maximum value, minimum value, mean value, and variance value) from the NDVI time series of the growing season (i.e.April-September) using the TIMESAT program (Jonsson and Eklundh 2004) and statistical calculations.We used digital elevation model (DEM) data to extract three topographic features: elevation, slope, and aspect.Thus, we extracted 79 feature variables as potential predictive variables.The WDPA data were used to mask terrestrial and inland water protected areas within the study area.We used the CLCD to calculate the proportion of cropland within each 1-km pixel as an independent validation dataset.
For each county-level administrative unit, we calculated the proportion of each of the six crop class planting areas by dividing the statistical sown areas of each category of crops by the total land area after masking the protected area for the administrative unit, which were used as response variables for FCPA mapping.

Feature selection
Despite the apparent functionality of providing more information, a large number of studies have proven that blindly increasing the number of input features will does not increase the modeling accuracy but instead increases the complexity and reduces the efficiency of the model, leading to that is called the 'curse of dimensionality.'.Moreover, multiple feature variables formed by a single feature extracted from multitemporal remote sensing images may be highly correlated or redundant.Irrelevant and redundant features may even reduce the performance.Feature selection aims to select an optimal subset of input features (even though the number of features is small, it contains the most effective features) to achieve similar or even better modeling performance than using all features while reducing computational costs and data redundancy (Gilbertson and van Niekerk 2017;Xu, Du, and Younan 2017).
Because of the two conflicting objectives in feature selection-that is, maximizing the modeling performance and minimizing the number of features-it is necessary to generate a Pareto front of nondominated solutions (feature subsets) for feature selection that makes a compromise between these two objectives.Most existing feature selection algorithms, however, treat the task as a single objective problem.Xue et al. (2012) proposed a multi-objective PSO feature selection scheme and revealed that the PSO-based multi-objective feature selection not only effectively remove redundant features but also kept or yielded better accuracy and reducd computational complexity.Therefore, we used this method to select the informative features subsets in this study.
The PSO algorithm simulates the behavior of bird flocking and is an efficient global search technique for feature selection (Kennedy and Eberhart 1995;Xue, Zhang, and BrowneSchool 2014).In this study, we adopted binary PSO-based feature selection to automatically evolve a feature subset from the original feature set as prediction variables to improve the accuracy of crop planting area mapping.Note that because the target response variable is at the county level, we had to aggregate 79 feature variables of 1 km × 1 km spatial resolution to the county level to match the response variable.We performed regional statistics and average operations to calculate the mean values of each feature variable in counties to obtain the original feature set of 79 feature variables at the county level.A particle in the population was a candidate solution, which was formed by a binary vector with a length equal to the number of the original features; its cell value was Boolean, where '1' means that the corresponding feature is selected and '0' otherwise.Suppose that in an N-dimensional (79 in this paper) target search space, there are m particles to form a population, where the position vector of the i-th particle is and p in [ {0, 1}.Each particle also has a velocity vector , which represents the direction and speed of particle motion.
1. Initialization.The population of particles was first initialized randomly in a multidimensional search space.2. The fitness of the particles was calculated.This study used RF regression to train particle samples in the population and estimate the accuracy of each particle.3. Iteration.Each particle searched for the optimal solution separately in the search space, recorded it as the current local optimal solution pbest i (t) at iteration t, and further determined the global optimal solution gbest(t) in all pbest i (t).The update of particle velocity and position can be explained using Equations ( 1) and (2): where ) is the sigmoid transformation function; c 1 and c 2 are acceleration constants (set as c 1 = c 2 = 1.496 in this study); r 1 , r 2 , and r 3 are random values between 0 and 1; w is the inertia weight with an initial value of 1 and a damping ratio of 0.99; and the maximum number of iterations was 1,000.

Disaggregating the statistics data to 1 km × 1 km
In this study, we selected the RF regression algorithm to establish the relationship between the prediction and response variables to disaggregate the statistical data within each county unit to 1 km × 1 km.RF is a fast, simple, dynamic, and durable classification and regression model very few limitations, and it has been proven to generate good predictions on similar tasks (Hu et al. 2021;Li, Hou, and Huang 2021;Nicolas et al. 2016).The agriculture statistics are considered to be the reference 'truth' for the crop-planting area, and we used the geographic data to spatially locate the specific crop within an administrative unit and derived the total area of the crop-planting area in an administrative unit from the statistics data.
RF is an integrated learning algorithm that uses a bagging procedure to integrate multiple decision trees (Breiman 2001).The basic idea of RF regression is to average a collection of noisy but approximately unbiased tree models, and hence reduce the variance.We achieved variance reduction in RF by randomly selecting input variables during tree growth.The disaggregation process of crop statistical data based on RF regression is as follows: 1. RF-based subpixel crop mapping model at the county level.
For crop class k, assuming that the above PSO feature selection process has selected M k (k = 1, 2, . . ., 6) features, we used these features as prediction variables, and used the crop area proportion as the response variables to generate a training sample set.The RF-based regression model was then constructed for each crop class, which included the following two main steps: Step 1.For b = 1 to B: (a) Perform a bootstrap sample Z * of size M (the number of county-level administrative units, 87 in this study) from the training sample set Z. (b) Grow a decision tree T b using the bootstrapped sample Z * by recursively repeating the processes of randomly selecting m variables from M k features, picking the best split point among m, and splitting the mode into two daughter nodes for each leaf node of the tree.In addition, to clarify the effectiveness of feature selection, we constructed a corresponding RF subpixel crop-mapping model that did not consider feature selection.
Step 2. Output the ensemble of decision trees {T b } B 1 , and take the average value of the prediction results of B decision trees as the prediction value of the RF-based model.

RF-based subpixel crop-mapping model at the 1 km × 1 km scale
Assuming that the previous quantitative relationship was invariable on a multiscale, we directly applied the constructed six RF models at the county scale to the 1 km ×1 km scale.In particular, we directly used the selected corresponding feature variables of each crop class at a 1 km × 1 km spatial resolution as the inputs of the established RF models, and the output of the RF models was the estimated subpixel crop planting area at the 1 km × 1 km grid scale.

Adjusting the predicted crop planting area value to match the agricultural statistics
The total area of the subpixel crop-planting area at 1 km × 1 km grid scale estimated by the RF models was inevitably inconsistent with the statistical data, and the errors were affected by two main factors: potential assumptions and six RF models are independent of each other.The potential assumption of this study was that the relationship between statistical proportion of crop-planting area and the 79 feature variables was identical at the county scale and at the 1 km × 1 km grid scale.Nevertheless, we found clear differences in the distribution characteristics of the input predictive variables at the two scales, and these differences inevitably led to errors when the models established at the county scale were directly applied to the grid scale.During the process of developing the subpixel crop-mapping model, the supervised RF method accommodated only a single response variable composed of the six classes.Although we developed six separate RF models for wheat, maize, oil-bearing, vegetables, orchards, and other crops, the six independent models may have caused the sum of the areas of the six class crops on a pixel to exceed 100%.Therefore, we needed to adjust our spatially explicit predictions from the RF models to match the agriculture statistics at the county scale.In this study, we adopt the iterative area gap allocation (IAGA) method.The specific adjustment process is shown in Figure 3, which includes the following three steps: Step 1. Aggregate the subpixel planting area distribution data for each class of crops predicted on a 1 km × 1 km scale to the county scale.Assume that the aggregated planting area is Y ij for crop class i in county j and A ij is the sown area of crop class i in county j from the agricultural statistics.
Step 2. For each crop class i in county j, calculate the area gap G ij between the predicted and statistical area values, as follows: If the area gap between the two was smaller than 5% of the statistical area value, no adjustment was made; otherwise, the correction factor was cf ij , as follows: (4) We adjusted the subpixel crop area values of crop class i in county j to the original predicted value multiplied by cf ij .
Step 3. The sum of the subpixel planting areas of the six crop classes on a pixel (x, y) is calculated as follows: where Y i xy is the subpixel planting area of crop class i in pixel (x, y).If the area SA xy is not greater than 100%, no adjustment is made; otherwise, the subpixel planting area of each crop class in this pixel is divided by SA xy .Then, we return to Step 1 and repeat until no adjustment is required.

Performance evaluation
We used three common performance indicators-the coefficient of determination (R 2 ), root mean square error (RMSE), absolute bias (AB), and structural similarity (SSIM) index-to evaluate the performance of the proposed FCPA mapping method.

Feature selection
The selected optimal subset of input features for each of the six crops is shown in Figure 4, which reveals that the optimal subset varied among the different crops, mainly because crop distribution was influenced by the interaction of factors, including climate conditions, soil conditions, topography, human factors, and the biological characteristics of the crops.The optimal subset of land cover for wheat, maize, oil-bearing, vegetable, orchards, and other crops contained 46, 44, 44, 43, 45, and  51 features, respectively.The main land cover features selected in this study were paddy land, grassland, streams and rivers, other built-up land, and other unused land (Figure 4a).Of note, the land cover class of beaches and shores is redundant features and was not selected into the optimal subset for all six crops as the proportion of beaches and shores in the study area is 0. These results were consistent with reality.As is well known, most crops are planted in these land cover areas and Gansu Province contains all landforms except the ocean.The key phenological features selected were the length of the growing season, the middle of the growing season, and the large integrated value, which were also the most direct responses to the biological characteristics of crops.The selected temporal NDVI features were concentrated primarily in the growth stages of germination, small-term, medium-term, and maturity, which, to some extent, represented the growth status of crops.In terms of topographic features, the impact of slope seems to have been slightly greater, possibly because of the negative correlation between slope and crop distribution.

Crops distribution
The distribution of six types of crops based on statistical data and the corresponding disaggregating results of FCPA at 1-km resolution based on RF regression models with or without PSO feature selection over the study area are shown in Figure 5.It is evident the planting area of wheat, corn, and other crops in the research area was relatively large, whereas the area of oil-bearing, vegetables, and orchards was relatively small.The spatial patterns of downscaled crop distributions using RF-based models were correlated with agricultural statistics.However, we still observed mismatching in the high-concentration crop areas and some of the low crop coverage areas between the statistical data and downscaling results, and the most notable differences were found in vegetables (Figure 5 d1-d3) and orchards (Figure 5 e1-e3), which had relatively small planting areas.The statistical data and RF-based downscaling results showed that all six crops were distributed mainly in the Hexi Corridor and the southeastern region of the study area.The Hexi Corridor is mainly irrigated agriculture, whereas southeastern Gansu has relatively more rainwater resources and milder temperatures, providing better natural conditions for crop growth.Statistical data provided only quantitative characteristics of crop planting areas in counties, and subpixel crop distribution maps derived by RF-based downscaling models provided specific planting locations and spatial details for various crops.Moreover, the subpixel crop-mapping results of the RF-based models with and without feature selection were very similar, and from a visual perspective, we found only a weak difference between the two.
As mentioned in Section 3.4, because of the potential assumptions when developing the RF-based models and the limitation of the independence of the six crop-mapping models, the downscaling results were inevitably inconsistent with the statistical values.The agricultural statistical distribution and adjusted subpixel planting areas of the six crops are shown in Figure 6.Evidently, the adjusted spatial distribution of the six crops had a better match with agricultural statistics, and the scope of crop aggregation areas was more consistent with the actual situation.The corrected subpixel crop planting area had a clearer spatial granularity and more prominent spatial contours, which better reflected the details of the spatial distribution of each type of crop.Thus, we considered that the RF-based disaggregating models that combined postprocessing correction performed a reasonable job of reproducing agricultural statistical data and obtaining detailed information on the spatial distribution of crop planting in the study area.

Comparison with agricultural statistics
The regional comparison of agricultural statistics and RF-based downscaling results aggregated at the administrative unit level is shown in the scatter plot in Figure 7.This verified the conclusion given in Section 4.2 that the RF-based downscaling results were well correlated with the statistical  data, regardless of whether or not we considered feature optimization.However, we also noted a certain degree of mismatch, which could be explained as a phenomenon of significant overestimation and underestimation in the estimated crop-planting area aggregated at the county level.The corrected RF-based models significantly improved the matching degree and alleviated the problems of overestimation and underestimation.Moreover, the scatter between the estimated and statistical values of the RF-based models containing PSO feature selection was slightly closer to the 1:1 line.The quantitative performance evaluation results for R 2 , RMSE, and AB in Table 3 further confirmed this description.Table 3 clearly shows that the corrected RF models had significantly improved accuracy compared with the uncorrected ones, with higher R 2 above 0.97, and much lower RMSE and AB.The RF-FS-C model with the best performance had RMSE values below 1% and AB values of 2.83%, 1.53%, 2.93%, 4.61%, 5.24%, and 3.27% for six crops of wheat, maize, oil-bearing, vegetable, orchards, and other crops, respectively.Although the performance of the uncorrected RF models was somewhat poor (particularly for oil-bearing and vegetable), the accuracy was still within an acceptable range in most cases, such as for both wheat and maize with R 2 > 0.73.Because of the complex terrain and heterogeneous smallholder farming landscapes in the study area, the RF regression model inevitably encountered problems of low-value overestimation and high-value underestimation.When the predictive factors were taken as the mean value for each variable at the 1-km scale in the county, the scale effect exacerbated the problem of low-value overestimation in the RF-based models.The RMSE and AB of wheat, maize and other crops appeared to be larger, mainly because the total planting area of these crops was much higher than that of the other three crops.Our results suggested the advantages of integrating satellite-based datasets and agricultural statistics to map subpixel crop-type distributions and to provide a consistent estimation of crop acreage.

Differences in six ecological districts
To explore the spatial distribution characteristics of each type of crop in the study area and the error distribution of various RF-based FCPA mapping models, we calculated the statistical planting area of each type of crop in six ecological districts and the estimated planting area aggregated to the ecological district level (Figure 8).It was evident that all six types of crops were distributed in the hilly areas of the Loess Plateau and in the gully region of the Loess Plateau, followed by the Qinling Zhongshan canyon area and Hexi Corridor gobi district, and the least distributed in the desert and grassland meadow areas of the Gannan Plateau.The estimation of various models in the gully region of the Loess Plateau was more accurate, possibly because of the relatively flat terrain and the lower heterogeneity in this district.In addition, the adjusted RF-based model estimates the planting area of various crops with excellent consistency across the six ecological districts, except for the types of other crops that are slightly underestimated in the hilly areas of the Loess  Plateau and gully regions of the Loess Plateau.The uncorrected RF crop estimation model of oilbearing, vegetables, and orchards had the greatest overestimation, especially in the hilly area of the Loess Plateau and Qinling Zhongshan canyon area.For wheat, corn, and other crops with larger planting areas, the estimation of wheat was relatively accurate in various ecological regions, and the error of corn was overestimated mainly in Hexi Corridor gobi district and Qinling Zhongshan canyon area, whereas the other crops were underestimated mainly in the hilly areas of the Loess Plateau and overestimated in the Hexi Corridor gobi district.In short, more crops are distributed in ecological areas that are naturally suitable for crop growth, which is in agreement with the actual situation.The comparison between the statistical planting area and RF-based downscaled estimates within the ecological district fully confirms the rationality of the RF-based regression model and the necessity of feature selection and postprocessing adjustments.

Comparison with CLCD's cropland
To further analyze the reliability and rationality of the crop distribution trend estimated by the RF model, we also compared the proportion of cropland derived from 30-m LCLD data with the estimated total crop planting area on the 1-km pixel scale (Figure 9).The total crop-planting area on a 1-km pixel refers to the sum of six types of subpixel crop-planting areas.We found that the correlation between the proportion of cropland and the total crop-planting area estimated by the four types of RF-based models (RF, RF-C, RF-FS, and RF-FS-C) was 0.75, 0.69, 0.76, and 0.69, with SSIM index values of 0.434, 0.436, 0.433, and 0.436, respectively.Notably, although the postprocessing adjustments significantly improved the matching between the estimated crop-planting area and statistical data, they may have had a negative impact on the spatial distribution pattern of crops and also may have had a side effect on the spatial distribution pattern of crops.These results revealed that the estimated subpixel crop area not only had good consistency with the CLCD cropland distribution but also retained a reasonable spatial distribution pattern.Because CLCD is an independent data source, it is by far one of the most accurate land use and land cover datasets.This analysis indirectly confirmed the rationality and reliability of our proposed subpixel crop-mapping method.

The impact of model scale invariance hypothesis
A prerequisite assumption of this study is that when constructing the RF model, it is assumed that the causal relationship between the FCPA and various prediction factors remains unchanged at the county-level and 1-km grid scales.A similar assumption has been widely used in various downscaling studies (Lloyd et al. 2017;Wilby and Dawson 2013).With respect to the actual situation, however, the distribution characteristics of the prediction factors on the two scales were inevitably different.When the average value of all grids in a county was used as the predictive factor value corresponding to the county-level scale, it resulted in a significant reduction in the physical distribution threshold for each factor.Because our study area only had 87 counties, this led to insufficient sample representativeness.It is well known that the performance of the RF model depends on the representativeness of the training datasets.Thus, using the model trained on a coarse scale to predict the distribution of subpixel crop-plating areas at the fine scale introduced a certain degree of uncertainty, which may have led to a large estimation error.

Contribution analysis of feature selection and postprocessing adjustment
Based on the previous description, we determined that the RF model including feature selection combined with postprocessing correction, had good accuracy and reliability for estimating the subpixel crop-planting area.To understand the contribution of feature selection and postprocessing adjustment to the subpixel crop area retrievals, we used a pure RF-based model as the cornerstone, and then calculated the differences between the subpixel crop area retrieved by the RF-based model that included only adjustment (i.e.RF-C), feature selection (i.e.RF-FS), and both and the cornerstone model (i.e.RF-FS-C), as shown in Figure 10.We concluded that the contribution of feature selection to the estimation accuracy of the six crops was 12.28%, 19.09%, 13.05%, 6.75%, 4.69%, and 13.26%, whereas the postprocessing adjustment contributed 87.72%, 80.91%, 86.95%, 93.25%, 95.31%, and 86.74%, respectively.Thus, the effects of feature selection and postprocessing adjustment were not fixed but rather varied among the different crop types, reflecting variations in the crop-planting structures.Feature selection alleviated the problem of low-value overestimation in pure RF models to a certain extent, whereas adjustment effectively corrected for both low-and high-value underestimation.

Conclusion
We proposed a new methodology for FCPA mapping based on collaborative medium-resolution geographic grid data and county-level administrative unit agricultural statistical data.First, we executed binary PSO-based feature selection to automatically select the optimal combination of independent variables and to provide valuable insight for understanding variable importance.Second, we constructed an RF regression model to allocate agricultural statistical data at the county level to a grid scale of 1 km × 1 km.Finally, we proposed an IAGA postprocessing correction to reconcile the inconsistency between the statistical data and estimation results.This study considered Gansu Province, one of the regions with the richest crop varieties in China, as the research area and verified the reliability and stability of the method using six main crops (i.e.wheat, maize, oil-bearing, vegetable, orchards, and other crops) as the targets under the circumstances of landscape fragmentation, which have had negative impacts on crop mapping accuracy.These results showed that the subpixel crop distribution maps based on RF regression not only had good consistency with agricultural statistical data in terms of quantity, but also preserved the spatial detail distribution characteristics of geographic data.Compared with the statistical data, the estimated FCPA achieved a high accuracy of R 2 above 0.97, RMSE below 1% for all types of crops, and AB values of 2.83%, 1.53%, 2.93%, 4.61%, 5.24%, and 3.27% for six crops.Compared with the subpixel crop area estimation model based solely on RF, the contribution of feature selection to accuracy improvement was 12.28%, 19.09%, 13.05%, 6.75%, 4.69%, and 13.26%, respectively, and the IAGA postprocessing adjustment contributed 87.72%, 80.91%, 86.95%, 93.25%, 95.31%, and 86.74% to accuracy improvement for six crops, respectively.Moreover, the consistency between the cropland areas obtained in this study and the independent CLCD was 0.70.These promising results suggest that harvesting the complementary characteristics of geographic data and agricultural statistics should be considered when exploring their relationship with subpixel crop acreage.The general approach described in this study can be applied to other geographical regions around the globe with similar availability of data and also can provide new support for the development of a crop map in large regions.Additionally, this approach enriches and develops technical methods for the fusion of geographic and nongeographic data, which can provide new references for the collaborative fusion of multisource data for FCPA mapping.Blending the advantage of two different sources of information may inadvertently introduce new errors because of inconsistencies, especially when agricultural statistical data are not 'accurate and perfect,' and these errors directly affect the accuracy of the final estimation results.Additionally, because of the inherent limitations of RF regression methods, our proposed method is not truly a synergistic estimation of multiple crop areas, and future work is needed to explore the benefit of incorporating new methods of 'multiple input and multiple output.'In short, the results of this study demonstrate the benefits of integrating machine learning techniques, remote sensing, and agricultural statistics to perform more reliable crop-planting area mapping.The study provides a novel solution that downscales county-level agricultural statistics to 1-km spatial resolution FCPA maps.

Figure 1 .
Figure 1.Overview of the study area: (a) geographical location; (b) six ecological districts; (c) elevation; (d) land cover and land use.

Figure 2 .
Figure 2. Workflow of integration of geographic data and agricultural statistics to generate FCPA estimates.

Figure 3 .
Figure 3. Workflow of adjusting the predicted crop-planting areas by RF models to match the agricultural statistics.

Figure 5 .
Figure 5.Comparison of agricultural statistics with disaggregating estimates based on RF-based regression models.The first to sixth rows of subplots from top to bottom represent six types of crops (i.e.wheat, maize, oil-bearing, vegetable, orchards, and other crops) respectively.The first to third columns from left to right are (a1-f1) agricultural statistics, and the predicted FCPA by RF models (a2∼f2) without feature selection and (a3∼f3) with PSO feature selection.

Figure 6 .
Figure 6.Comparison of agricultural statistics with disaggregating estimates based on adjusted RF regression models.Except for adjusting the prediction of RF-based models, the symbols are identical to those in Figure 5.

Figure 7 .
Figure 7.Comparison of agricultural statistics with RF-based downscaled estimates aggregated to the administrative unit level.The first to sixth rows of subplots from top to bottom represented six crops of wheat, maize, oil-bearing, vegetable, orchards, and other crops, respectively.The first to fourth columns from left to right represent downscaled results from 'RF,' 'RF-FS,' 'RF-C,' and 'RF-FS-C,' respectively.A red line portrays the 1:1 line for the estimated census relationship and each scatter represents a county.

Figure 8 .
Figure 8.Comparison of agricultural statistics of six crops with RF-based downscaled estimates aggregated to the ecological district scale.Each subplot represents an ecological area: (a) hilly area of the Loess Plateau; (b) desert district; (c) Hexi Corridor gobi district; (d) Qinling Zhongshan canyon area; (e) gully region of the Loess Plateau; and (f) grassland meadow area of Gannan Plateau.

Figure 9 .
Figure 9.Comparison of cropland distribution from CLCD and estimated total crop-planting area on a 1-km pixel scale: (a) the proportion of cropland calculated using 30-m LCLD data and estimated total crop-planting area by (b) RF, (c) RF-C, (d) RF-FS, and (e) RF-FS-C, respectively.

Figure 10 .
Figure10.The effect of feature selection and postprocessing adjustment.The first to sixth rows of subplots from top to bottom represent six crops of wheat, maize, oil-bearing, vegetable, orchards, and other crops, respectively.The first to third columns from left to right represent the effects of (a1∼a6) adjustment, (b1∼b6) feature selection, and (c1∼c6) a combination of the two, respectively.

Table 1 .
Geographic data and agricultural statistics for fractional crop-planting area mapping.

Table 3 .
Accuracy of the RF-based FCPA mapping models on the county scale.In the table, 'RF' refers to the RF model that does not consider feature selection, 'RF-FS' refers to the RF model that combines PSO feature selection, and RF-C and RF-FS-C, respectively, refer to the postprocessing correction of 'RF' and 'RF-FS.'