Reflectance-based Model for Soybean Mapping in United States at Common Land Unit Scale with Landsat 8

ABSTRACT The objective of this study is to validate the feasibility of a reflectance-based model for soybean crop area classification in advance of the county scale statistics from the United States Department of Agriculture (USDA). This classification method is named Reflectance-based North American Model (RNAM). It operates through the analysis of the main physically driven characteristics of farm fields and their specific radiometric profile obtained from Operational Land Imager (OLI) onboard Landsat 8. The state area of Illinois/US was selected because it is the largest soybean producer and accounted for nearly 35 percent of the total soybeans production in US. Farm fields within a set of 32 counties were analyzed for six crop years between 2013 to 2018. Results obtained from RNAM were compared to official estimates of USDA at county level. Coefficients R2 ranged between 0.92 and 0.96, indicating good agreement of the estimates. Results from RNAM were also validated with the geospatial reference map Cropland Data Layer (CDL) of soybeans from USDA. The overall map accuracy found was 93.86% with Kappa Index of Agreement of 0.795. Thus, RNAM was considered able to provide timely thematic soybean maps, in late September, in advance of the county scale statistics from USDA.


Introduction
Improvements of statistical data in agriculture play a key role in the precise and timely estimates for crop management, trade and pricing policies (Gusso, Ducati, & Bortolotto, 2017;Tilman, Balzer, Hill, & Belford, 2011;Zhong, Yu, Li, Hu, & Gong, 2016). Crop field sizes are also needed for a better understanding of the degree of agricultural capital investment, mechanization and labor intensity (Rudel et al., 2009;Yan and Roy, 2016).
In the United States, the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) created the Cropland Data Layer (CDL), a crop-specific georeferenced output layer, in a way to provide acreage estimates to the Agricultural Statistics Board for major commodities. The CDL is a high-quality spatial reference (King et al., 2017;Liu et al., 2018;Zhong et al., 2016) and is released to the public in January following the end of the typical US growing season. However, prior to the public release, the CDL is considered confidential and market sensitive during the growing season and cannot be released until after the official county estimates are published (USDA, 2018a).
Currently, the NASA Landsat Data Collection (NLDC) provides historical Landsat data products with medium spatial resolution of 30 m, with a geometric quality of the images that allows the imagery composition for time series analysis and guarantee improved pixel geolocation. Historically, Landsat data have been widely used for agricultural studies, mainly due to its long-term record provided with almost 40 years of data archive and freely available for research. However, relative few researches have been done on automated methods for extracting crop field areas and boundaries (Yan and Roy, 2016;Gusso & Ducati, 2012).
Landsat data series has an adequate spatial scale for detecting changes over many crop areas that allows long-term studies of regional and global land cover change (Irons, Dwyer, & Barsi, 2012). The launch of the Landsat Data Continuity Mission (LDCM) on 11 February 2013 will also provide continued agriculture and crop management practice researches by providing reliable timely spatial information. The Operational Land Imager (OLI) onboard Landsat 8 employs the new technologies which enable the data acquisition with much better signal-to-noise performance and higher radiometric resolution (Barsi, Lee, Kvaran, Markham, & Pedelty, 2014;Roy et al., 2014).
In Brazil, important know-how of large-scale agricultural monitoring was gained from partially successful projects as GeoSafras (soybean, maize and rice fields mapping) and CANASat (sugarcane mapping). The main difficulties are usually associated with the management of a large staff team for timely generation of the desired products while handling of a large volume of data for manual digitization of field boundaries (Yan and Roy, 2016;Gusso & Ducati, 2012).
The literature reviewing has shown that automatic processing models of moderate spatial resolution (~30 m) are able to generate timely classification results. In addition, medium-spatial resolution models can generate good mapping accuracy. This work takes a step further combining an automated classification method to obtain timely classification results with good mapping accuracy. The classification method developed in this study is named reflectancebased North American model (RNAM), and fundamentals are based mostly on previous studies described by the reflectance-based crop detection algorithm (RCDA) developed in Gusso & Ducati (2012). It operates through the analysis of the main physically driven characteristics of crop fields and their specific radiometric profile obtained from OLI onboard Landsat 8.
The objective in this study is the development of an improved specific reflectance-based model for an automated pre-peak growing season classification of soybean areas able to generate results in advance to the peak of growing season and to crop harvest statistics from USDA data using medium-resolution satellite imagery.

Study area
In 2017, the average contribution of Illinois State accounted for about 4.79% of the total national soybean crop areas in the United States, and it is the largest soybean producer in the United States with 16.65 million tons (USDA, 2018b).
In 2009, the three top producing states were Illinois, Iowa and Minnesota and accounted for 36% of the total US production (USDA, 2010). The second one, Iowa, is more than 242,000 ha lesser than Illinois. State-level production has shown significant fluctuations especially due to crop rotation with corn. Soybean areas spread over the state. The studied counties are aggregated inside an intensive soybean production area. The 32 counties analyzed are fully covered within a mosaic of two Landsat 8 scenes (path/row 022-032 and 022-033) according to Figure 1.
The covered counties represent nearly 38.5% of the total area of all producing counties of soybean in Illinois. The usual planting dates for soybean go from late May 2 to June 24 being most active between May 8 and June 10 (15% and 85% of the crop is planted), based on agricultural zoning for the Illinois State (USDA, 2010). Depending on the planting date inside this time window, maximum vegetation coverage is observed from early July to early September.

Data set
A complex mathematical combination of data set was needed in a way by the model outputs to generate realistic results according to the specific physically driven characteristics of soybean crop fields in the United States.
The data sources used for the algorithm development were the following: (i) 82 Landsat 8 images of Surface Reflectance Science Product -Collection 1 (path/row 023-032 and 023-033), obtained from Earth Explorer (https://earthexplorer.usgs.gov/), in the time window covering pre-planting to maximum vegetation coverage from late May to late September; (ii) annual soybean agricultural statistics from official NASS data (here after NASS), in state and county scale, were used to compare and evaluate the obtained results from the present soybean area estimation procedure for the 32 municipalities; (iii) we also used data from NASS-CDL mapping program (here after CDL) of crop season 2015 from USDA (2015). The CDL created by the NASS is as raster-formatted, georeferenced, land cover map data product of crop-specific land cover elaborated from multi-temporal satellite data images and is not released until the beginning of the subsequent calendar year, in January, for market-sensitivity reasons (Boryan, Yang, Mueller, & Craig, 2011). In 2012, CDL had a 30-m resolution, and categories were based on unsupervised classification approach with extensive agricultural ground truth using information from the Landsat 5 TM, Landsat 7 ETM+, Disaster Monitoring Constellation (DMC) DEIMOS-1 and UK2 sensors (Boryan et al., 2011;Roy et al., 2014).

Modeling management practices in the United Statesmain challenges
The previous versions of the model (RCDA) did not include the green band and were designed for very specific conditions found in Brazil where very different management practices of soybean cultivation and ecoclimatic conditions prevail. In this improved version RNAM, some adjustments and refinements are required for mapping soybeans under the specific physically driven characteristics found in the United States. Following crop evolution monitoring and soybean areas in the United States, the main challenges are listed below: (1) Land use. Corn and soybean areas, defined as Common Land Unit (CLU), are spatially next to each other and typically cover few acres each one. This is because the typical agricultural landscape in the North Central region of US configures a mosaic of rectangular crop areas of corn and soybean fields with contiguous boundaries (Song et al., 2017). Under these circumstances, CLU is defined as the smallest unit of land that has a permanent, contiguous boundary, a common land cover and land management, a common owner and a common producer in agricultural land associated with USDA farm programs (FSA, 2017). The CLU was created by the USDA Farm Service Agency (FSA) as a georeferenced polygon vector map of cropland coverage. In 22 May 2008, the FSA is no longer allowed to make the geospatial data, including access to the CLU records, available to the public (FSA, 2017) for market-sensitivity reasons. In a remotely sensed point of view, crop fields are discrete entities and need an appropriate spatial resolution to be unambiguously resolved (Chang, Hansen, Pittman, Carroll, & DiMiceli, 2007). Under these circumstances, global data coverage with 250 m as Terra/MODIS products offers a coarser spatial resolution and is not able to detect such small and intricate land-use management. This is because Terra/MODIS pixel sizes of around 6.25 ha (~15.5 acres) are mostly the size comparable to CLU polygons; (2) Planting date. Summer crops of corn and soybeans are also temporally close to each other, so they are planted almost at the same time.
Although corn is usually planted a few days before soybeans, the physiological maximum development stages (plant growing rate) of corn is a little slower than soybean in a way that, considering radiometric characteristics, they mask each other at the maximum vegetation coverage period.

Selected bands of Landsat 8
The soybean areas were directly visually inspected using OLI onboard Landsat 8 by using false-color composition of bands (RGB-564) as usually for mapping vegetation coverage according to good practice recommendations described in Olofsson et al. (2014). During the soybean vegetation development, a rapid increase of the medium-infrared reflectance values is observed, reaching its maximum values after a relative short period. After that, the maximum vegetation coverage is observed, and it is expected to occur in the time window between early July and early September according to planting dates (USDA, 2010), for the East North Central region.
Initially, values of the surface reflectance for soybean during the period of maximum vegetation coverage were tested. During maximum vegetation coverage stage, soybean crop cultivation presents a particular spectral behavior, when compared to other types of regional land-use coverage (Lee, Yao, Huang, Thomson, & Bruce, 2014). At that stage, vegetation coverage is expected to show some response in the green band (0.533-0.590 µm) low reflectance values at the red band (0.636-0.673 µm), high reflectance values of near-infrared band (0.851-0.879 µm) and a very steady reflectance variation at medium-infrared region (1.566-1.651 µm) (Gitelson, Viña, Ciganda, Rundquist, & Arkebauer, 2005;Jensen, 2007). In Brazil, Mercante, Lamparelli, Uribe-Opazo, and Rocha (2009), using data from Landsat 5 TM, found that spectral characteristic from bands 3, 4 and 5 of soybean crop vegetation typically presents calibrated reflectance of about 5%, 50% and 21%, respectively. On the development of the previous version of the model (RCDA) Gusso and Ducati (2012) found reflectance values of 7%, 39% and 15%.
In this study, in a prior visual inspection over soybean crop fields, some fluctuation was revealed through different crop years, especially during the pre-peak growing season (June). This is because those reflectance values, usually available on literature, are averaged from standardized conditions and represent soybean crop fields at full-pixel coverage over soybean areas. Or even represent cultivation areas under other agricultural management practices as the first version of the present studied model. Additionally, at SWIR-1 is closely associated with vegetation moisture (NASA, 2011;USGS, 2003), and so, its behavior through time needs to be more deeply investigated. Table 1 shows the Landsat 8 images selected in this study. However, it is also known to have the property of penetrating thin clouds due to wavelength size (USGS, 2010), which tends to be very useful in a mapping study and land-use change (Gao, Anderson, Daughtry, & Johnson, 2018). So, using an accurate mathematical rule, it is possible to establish algorithms into the model that are able to detect and classify soybean crop area characteristics that remain through time and different vegetation development conditions. Under these conditions, the bands 3, 4, 5 and 6 are especially required for detecting soybean crops during the maximum vegetation coverage. So, considering reduced unit of land sizes that has a permanent, contiguous boundary (CLU), it becomes clear that the main challenge is to identify the reflectance values of OLI in a way to include pure soybean pixels at the normal conditions and also mixed pixels located at the border of soybean fields (Gusso & Ducati, 2012), under the main physically driven conditions found in farm fields of US.
The narrower NIR band, higher signal-to-noise performance and better radiometric resolution indicate that the new OLI sensor has the potential to be less affected by atmospheric conditions and to be more sensitive to surface reflectance variability (Ke, Imb, Lee, Gong, & Ryu, 2015). Actually, it was observed from the test sites that soybean vegetation reflectance even under different development conditions, stands around 3% in the green band (band 3), lower than 4% in the red (band 4), greater than 38% in the NIR (band 5) and greater than 0.17 in the SWIR-1 (band 6). In terms of reflectance, these are crucial input parameters for soybean characterization. However, the threshold value for the green band (band 3) was increased till 0.06 (6%) making it more permissive.

Results and discussion
Area estimates with RNAM Soybean areas provided by RNAM were estimated at county level and compared to official estimates provided by NASS using regression analysis. No early estimate could be provided before 2013 crop season as this was the first crop year in which the Landsat 8 data became available. County level statistics from official NASS are not released shortly after harvest but rather they are released few months after the end of crop season. Figure 2 presents scatterplot diagrams and regression analysis for the 32 municipality estimates from the official NASS procedure and from RNAM for the crop years 2013-2018. Coefficients of determination (R 2 ) were between 0.919 (2017) and 0.950 (2015), with an overall average agreement of 0.945 between the estimates, according to Figure 3. The averaged slope value around 0.935 for all tested crops also indicates good agreement.
In Figure 3, the comparing totals from NASS and RNAM in counties estimates level are shown. In 2017, the total area mapped in 32 counties is 1.701 million ha with RNAM and 1.746 million ha from NASS, as shown in Table 2. The group of points nearby 100,000 ha (at NASS axis) represents the crop area of three greater counties inside the studied area (McLean, Livingston and Champaign).

Validation of the RNAM
In a way to understand the deviation from NASS estimation and RNAM data, the performance of the RNAM was also evaluated using a thematic CDL map of soybeans.
When comparing the obtained total soybean crop area from CDL NASS in 2015 (1,526,366 ha) with obtained from RNAM (1,375,993 ha), we have found a difference of 10.9% between them (Table 2). However, when comparing the total soybean crop area from CDL NASS with the official NASS county data in 2015 (1,617,814 ha), we also found a difference of −5.9% between the two official estimates. Especially when considering two points as follows: (a) the spatial similarity between measures using satellite data (RNAM and CDL); and this result also points out the possibility that official NASS county data (of soybeans) are slightly overestimated in 2015 crop season. It is important to note that although remotely sensed procedures as pixel counting are usually biased when  compared to the official data (Gallego, 2004), this difference between two official estimates distributed among 32 counties represents around 91.0 thousand ha. This total is quite comparable to the total soybean areas of the top three producing counties among the 32 studied ones: McClean, Livingston and Champaign with 121.4, 112.9 and 101.5 thousand ha, respectively.
In the spatial accuracy analysis, the confusion matrix and the overall map accuracy were provided. Table 3 shows the confusion matrix resulting from the comparison between model estimate RNAM and mapping provided by CDL. The overall map accuracy between them was 93.86% with Kappa Index of Agreement of 0.795 indicating a high spatial consistency between those maps, as also shown in Figure 4. Song et al. (2017) have found overall accuracy of 86%. King et al. (2017) have found 95%. Yan and Roy (2016) have found 92.7%. In the classification map assessment, usually Kappa Index of Agreement with values greater than 0.5 are considered good results (Ismail & Jusoff, 2008).
Typical values for soybeans are found around 0.03 (3% reflectance). This value was changed to a maximum of 0.06 (6% reflectance). Positive effects of this change are twofold: (1) allowed to detect emerging soybeans at the fields and (2) preidentification of low-productivity soybean profiles due to environmental/crop management stress as abnormal weather fluctuations. However, it is important to note that most of errors are still associated to date related features. Especially the remaining those ones associated to very late/late planting dates. Furthermore, CDL mapping is also dependent of typical spectral profile analysis of soybeans in the fields. So, at a first glance, this result suggests that either CDL or RNAM mapping accuracy has been affected by planting date-related uncommon spectral features of soybean vegetation profile as a consequence of planting dates nearby the time window borders (very early or very late ones). Although delayed planting dates occur, this practice tends to be mitigated in the next years by producers due to the risk of snow occurrence during crop harvesting.
Finally, it is important to note that RNAM has been evaluated against another geospatial layer at a pixel-bypixel basis, which indicates far more reliable results than using ground collected data for the evaluation of mapping accuracy (sparsely distributed samples). A direct visual inspection between the two maps and also the confusion matrix analysis of CDL and RNAM suggests that the planting date-related features in 2015 are increasing misclassification around 4%.
We have also observed that besides residual cloud contamination, a direct visual inspection on the omission errors shows that several pixels were mainly related to very low-yield potential profile (especially those far delayed ones). This could also lead to some pixels to be neglected in the pixel-by-pixel processing chain of the RNAM. In this way, even when they are inside soybean CLU areas, very low-yield potential plants (due to environmental/crop management stress as abnormal weather fluctuations) do not perform a minimum radiometric response in a way to be tagged as ordinary soybean cultivation. The influence of such agrometeorological source of errors in RNAM needs further investigation.
Because of the sensitivity of RNAM using medium-resolution satellite data to the main physically driven conditions in the United States as described previously (land use, planting date, crop area extension and coverage and crop rotation), the applicability to other data sets of coarse spatial resolution may be limited. The main modeling difficulty in previous crop seasons (till around 2011), may be associated with date-related features. Next, the main difficult using coarse spatial resolution will probably be associated with low-yield potential profile of vegetation due to climate and weather fluctuations. Ultimately, these ones are due to very small agriculture unit areas which are constantly changing through crop seasons (soybeans, corn and other cultures).

Conclusions
We have found that under normal conditions of Landsat 8 imagery quality and crop management practices in US crop fields, a precise soybean estimation can be released in mid-late September.
The RNAM development presented an objective and consistent methodology for classification of soybeans and provided crop areas with estimates that are similar to the official statistics available to public.
Because the model processing is designed for physically driven condition inputs (reflectance), RNAM represents a considerable gain in spatial information compared to most of the available satellite data procedures.
Finally, the three main advantages provided by RNAM are as follows: (1) it is a pre-peak growing season methodology and thus offers results prior to the peak of growing season and several weeks prior to crop harvest period; (2) pre-peak growing season analysis also helps to mitigate detection problems associated with date-related features from very early/early planting dates; (3) it is timesaving and requires no training data, no preprocessing parameterization and/or objective human interaction. RNAM requires only as input medium-resolution satellite as Landsat 8 data imagery. Thus, RNAM is considered able to provide timely thematic soybean maps in advance of the county scale statistics of harvest from USDA providing reliable and adequate spatialized information.