County-level corn yield prediction using supervised machine learning

ABSTRACT The main objectives of this study are (1) to compare several machine learning models to predict county-level corn yield in the study area and (2) to compare the feasibility of machine learning models for in-season yield prediction. We acquired remotely sensed vegetation indices data from moderate resolution imaging spectroradiometer using the Google Earth Engine (GEE). Vegetation indices for a span of 15 years (2006–2020) were processed and downloaded using GEE for the months corresponding to crop growth (April–October). We compared nine machine learning models to predict county-level corn yield. Furthermore, we analyzed the in-season yield prediction performance using the top three machine learning models. The results show that partial least square regression (PLSR) outperformed other machine learning models for corn yield prediction by achieving the highest training and testing performance. The study area’s top three models for county-level corn yield prediction were PLSR, support vector regression (SVR) and ridge regression. For in-season yield prediction, the SVR model performed comparatively well by achieving testing R2 = 0.875. For in-season corn yield prediction, SVR outperformed other models. The results show that machine learning models can predict both in-season yield (best model R2 = 0.875) and end-of-season yield (best model R2 = 0.861) with satisfactory performance. The results indicate that remote sensing data and machine learning models can be used to predict crop yield before the harvest with decent performance. This can provide useful insights in terms of food security and early decision making related to climate change impacts on food security.


Introduction
Corn is one of the important crops globally which is used for several purposes such as food, fodder, and producing ethanol (Ranum et al., 2014).United States (US) is one of the largest corn producing country, which accounts for approximately 36% of corn production globally (Green et al., 2018).The increasing population and the changing climate pose a threat to global food production.The United Nations, Sustainable Development Goals aim to eliminate food insecurity by 2030.Therefore, accurate crop yield estimation is important.County-level yield estimation can be very important for several stakeholders and provide key decision-making information.This is since key policy decisions, such as the import/export of grains, are related to these estimations, which need to be done accurately and in a timely manner (Krasnova et al., 2021;Ma et al., 2021).In-season crop yield prediction can also help in ascertaining problems in key areas and early intervention to address those issues (Shahhosseini et al., 2021).
Crop yield and its estimation are affected by several factors, which can be grouped into two main categories of factors: biotic and abiotic factors.The biotic includes the crop genotype, diseases, weeds, pests, etc.On the other hand, abiotic factors generally include the environmental factors such as soil quality, nutrient availability, and climatic factors.The interaction between biotic and abiotic factors in agricultural systems can significantly affect crop yield (Wairegi et al., 2010).For crop yield prediction, parameters such as crop type, growing and harvesting dates, soil moisture, soil temperature, precipitation, land surface temperature, crop phenology, and several other parameters are typically used.An increase in the number of factors that affect crop yield can increase the uncertainty in prediction models.Moreover, collecting data for so many parameters is costly and time-consuming.
With advancements in remote sensing and computing systems, the potential of remotely sensed data for estimation of crop yield had been explored in the last few decades.Remote sensing data collected with high spatial and temporal resolution can monitor crop growth and assess related stresses, which can affect yield and can help estimate crop yield.Earlier studies have shown that crop yield can be predicted using remote sensing data at various scales (Feng et al., 2021).Remote sensing-derived vegetation indices (VIs) can better characterize vegetation conditions than raw reflectance values, as they highlight the vegetation part and downgrade the background information.VIs such as the Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), Soil-Adjusted Vegetation Index, Green Chlorophyll Index, Normalized Difference Water Index, etc. have been used for crop yield prediction and forecasting (Son et al., 2013;Wall et al., 2008).The usage of VIs is not limited to crop monitoring only and can characterize several other phenomena, such as vegetation change, phenology, and other phenomena (Ahmad et al., 2021;Khan et al., 2020).Other than VIs, several other datasets, such as climate, environmental, and soil data, have also been used for crop yield prediction in previous studies (Khaki & Wang, 2019).Climate data such as temperature, precipitation, and vapor pressure deficit are very important for crop growth monitoring and thus yield prediction.On the other hand, soil variables such as bulk density, soil pH, available water storage and soil organic carbon have also been used for yield prediction.These variables affect the crop growth conditions and changes in any of the above factors can improve or reduce grain yield which is why they can be helpful for yield prediction.
Remotely sensed data for crop yield estimation is usually utilized in two types of models, i.e. processbased crop simulations (Guan et al., 2017;Jones et al., 2017;Muller et al., 2017;Peng et al., 2018) or empirical statistical models (Bocca & Rodrigues, 2016;Gornott & Wechsung, 2016;Guan et al., 2017;Kern et al., 2018).Crop simulation models the crop growth at various crop phenological stages which limits its applicability only to farm or field levels (Sun et al., 2019), whereas empirical statistical models can be applied at regional or national levels as they require fewer input features as compared to process-based models.Over the last decade, artificial intelligence has been applied in agriculture including crop yield estimation owing to its capacity to statistically characterize relationships between crops yield and its determining factors.The biophysical factors which are indicators of crop yield change with time and often depict non-linear relationships with crop yield (Dash et al., 2018;Whetton et al., 2017;Wieder et al., 2018).Machine learning (ML) techniques can better model the dynamic factors affecting crop yield within the growing season (Elavarasan et al., 2018).ML algorithms for predicting crop yields proved to be robust and accurate as compared to traditional statistical crop yield estimation models (Cai et al., 2018;Johnson et al., 2016;Pantazi et al., 2016Pantazi et al., , 2016)).
ML algorithms include supervised and unsupervised algorithms.Supervised ML models involve labelled data (Caruana & Niculescu-Mizil, 2006).ML algorithms like Decision Tree (DT), Random Forest (RF), Bayesian Network, Artificial Neural Network (ANN) and Support Vector Machines (SVM) are supervised ML models, whereas unsupervised ML algorithms such as Markov chain model, K-means clustering, expectation maximization algorithm, density-based spatial clustering of applications with noise and Apriori algorithm process unlabeled data (Hahne et al., 2008).DT method which partitions the data into subsets based on certain features for decision making is simple to understand and identify important features but tends to overfit the data (Xu et al., 2005).RF being an ensemble method uses multiple DTs for prediction, handles large number of input features and is less prone to overfitting as final prediction is based on average of all predictions.ANNs consist of multiple layers of interconnected nodes like in human neurons to identify complex relationships between large number of variables but can be prone to overfitting.The support vector regression (SVR) model finds a hyperplane that best fits the data points while minimizing the error between the predicted values and the actual target values (Asif et al., 2023;Li et al., 2023;Liu et al., 2023).SVR model is very effective in high dimensional spaces and can handle non-linear decision boundaries.However, it can be sensitive to kernel function and can be computationally expensive for large datasets.
Importantly, supervised ML algorithms rely on provided input data and corresponding output values to best model the relationship between independent features and target feature.Among various supervised ML algorithms, neural network, linear regression algorithms, RF and SVM are widely used in crop yield estimation studies (van Klompenburg et al., 2020).ANNs were used to predict wheat and rice yields using within seasondata of various crop yield limiting factors (Baral et al., 2011;Russ et al., 2008).Johnson et al. (2016) used multiple linear regression (MLR) and ANN to predict barley, canola, and spring wheat crop yields from satellite derived VIs, i.e.NDVI and EVI (Tariq & Qin, 2023;Tariq et al., 2023).Polynomial regression ML model along with satellite derived NDVI and leaf area index (LAI) derived from field spectrometer performed better than logistic regression to predict maize yield (Kunapuli et al., 2015).Similarly, counter-propagation ANNs (CP-ANNs), XY-fused networks (XY-Fs) and supervised Kohonen networks (SKNs) multilayer soil data and NDVI were used to predict wheat crop yield.Reportedly, SKN performed better with overall accuracy of 81.65% as compared to CP-ANNs and XY-Fs with 78.3% and 80.92% accuracy respectively (Pantazi et al., 2016).RF as compared to MLRs reported to be versatile method for global and regional level wheat, maize and potato crop yield prediction based on its high precision and accuracy (Jeong et al., 2016).RF was used to predict sugarcane yield with decent accuracy using simulated biomass indices, observed climate and seasonal climate prediction indices (Everingham et al., 2016).Spiking neural networks (SNNs) were able to predict winter wheat crop yield six weeks before harvest with an average accuracy of 95.64% using moderate resolution imaging spectroradiometer (MODIS) 250-m resolution timeseries and historical crop yield data (Bose et al., 2016).Furthermore, MODIS NDVI timeseries and an ensemble model of ANNs were used to predict sugarcane yield in Brazil with relative root mean square error (RRMSE) of 8% and with the coefficient of determination (R 2 ) of 0.61 (Fernandes et al., 2017).Khanal et al. (2018) integrated high resolution remotely sensed data, crop monitor yield dataset, one-meter digital elevation model and soil properties data to predict soil properties and corn yield using ANN, SVM and RF (Tariq et al., 2022(Tariq et al., , 2023;;Wahla et al., 2022).RF was reported to outperform other methods with soil and VIs as input as compared to topographic variables (Tariq & Mumtaz, 2022;Tariq et al., 2023).
Studies mentioned above have used ML models to predict crop yield at several scales.Moreover, several features were used in these studies to predict crop yield.As mentioned earlier, collecting data for several features is not always feasible.Furthermore, crop yield prediction is sensitive to the scale of yield prediction.Since several ML models have different working principles, we hypothesize that the performance of different ML models will be diverse with some models showing superior performance over others.Moreover, since endof-season yield prediction utilizes more data than inseason yield prediction, their prediction accuracy will be higher than in-season yield prediction (Sun et al., 2019).Therefore, we proposed to compare several ML algorithms for predicting county-level corn yield.We train ML models using NDVI and EVI data obtained from MODIS.Moreover, we compare the performance of in-season and end-of-season yield prediction ML models.Specifically, we aim to address the following research questions: (1) What is the predictive performance of several machine learning models to predict countylevel corn yield? (2) How well can machine learning models predict county-level corn yield using in-season and end-of-season data?
The remainder of this article is structured as follows: Section 2 details the study area, data, and methods used in this study.Section 3 presents the results, while Section 4 discusses the results and their implications.Conclusions are presented in Section 5.

Study area
This study was carried out in the state of South Dakota, US.South Dakota is one of the US Midwestern states which is responsible for a considerable amount of corn production in the US (Olson et al., 2007).South Dakota is situated between approximately 42.5° N to 45.95° N latitude and 96.43° W to 104.06° W longitude bordered by Nebraska to the south, North Dakota to the north, Minnesota to the east, and Wyoming and Montana to the west.The study area has a diverse climate, spanning from a humid continental climate in the east to a semiarid climate in the western areas.Characterized by four different seasons, the climatic conditions extend from extremely cold and arid winters to warm and semi-humid summers (Todey et al., 2009).In summer, the average temperature of South Dakota ranges from 20.5°C to 32°C while in winter the average temperature ranges from −12°C to −3.8°C.South Dakota has a total of 66 counties.The eastern part of South Dakota is responsible for major crops production.
The main crops of South Dakota are corn, soybean, alfalfa, and wheat.Figure 1 illustrates the study area map.

Corn yield data
County-level corn yield data were collected from the United States Department of Agriculture (USDA) National Agricultural Statistical Service.The yield data is reported annually by USDA which represents the harvested yield after each season.In South Dakota, corn is usually harvested from October to early November.The yield data was downloaded through the NASS web service Quick Stats (https://www.nass.usda.gov/Quick_Stats/,accessed 18 December 2021).Yield data is reported in bu/ac; however, we converted it to MT/ha for this study.County-level yield records of the study area were collected and mapped to its geographic data using the county-level shapefiles obtained from the US Census Topologically Integrated Geographic Encoding and Referencing (TIGER) project (Marx, 1986).

Vegetation indices (VIs)
Several VIs are available to monitor vegetation health; however, NDVI and EVI are the most widely used VIs for crop health monitoring and yield prediction.For this study, we used the NDVI and EVI derived from MODIS data (Didan, 2015).VIs instead of raw reflectance provides unique capabilities since they highlight key vegetation conditions such as vegetation health, diseases and stresses which aids in better interpretation.This is because certain electromagnetic spectrum bands are sensitive to vegetation conditions such as vigor and stress.NDVI value ranges from −1 to 1. High positive values of NDVI show healthy green vegetation, while lower values represent relatively unhealthy vegetation.Values less than 0 generally show an area with no vegetation.EVI is a modified form of NDVI mostly used for areas with higher biomass.EVI also minimizes the soil and atmospheric effects due to the coefficients of aerosol resistance terms.To extract the VIs values on the county level, we used MODIS product (MOD13Q1).MOD13Q1 is a processed remote sensing product derived from MODIS surface reflectance and has undergone several levels of processing to remove known sources of errors and noises.We extracted the time series of NDVI and EVI each during the study area's corn growing season (April-October) for 15 years (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020).Since the temporal resolution of MOD13Q1 is 16 days, there were 12 time series observations for each county each year.

Croplands data layer (CDL)
CDL is a specialized land cover dataset developed by USDA with a primary focus on delineating various crop types (Craig, 2010).The spatial resolution of CDL is 30 m provided annually by USDA.
In this research, we used the CDL mask to exclude non-corn pixels from the VIs dataset.Since county-level yield the aggregation of VIs to the county level, it is very important to eliminate the pixels which belongs to other crops or other land cover types.The CDL mask was applied to MODIS VIs in Google Earth Engine (GEE) (Gorelick et al., 2017).

Data preprocessing
The following steps were applied in data preprocessing: (1) Shapefiles for county and state-level were obtained from the US census TIGER project.Since, TIGER shapefiles are available in GEE, they can be directly imported using their production details available in GEE.Using the study area details, the shapefile were filtered to include counties in the study area only https:// developers.google.com/earth-engine/datasets/catalog/TIGER_2018_Counties (accessed December 20, 2021).
(2) The boundaries of the study area were used to get MODIS VIs during the study period (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018).Specifically, MOD13Q1 product which is available on GEE was used.The boundary layer of study area used to retrieve the VIs for the study area.(3) CDL data was used to filter the VIs and only corn pixels were kept.Pixels of other crops and other land cover types were eliminated.To filter the data using CDL, a binary raster of CDL was generated.The binary mask used to eliminate non-corn pixels.(4) The resulting data was aggregated to countylevel using median operation and downloaded from GEE. Median aggregation operation is available in GEE and is widely used to aggregate data.
(5) The downloaded data was again matched to county-level records and yield data was integrated with the VIs.County-level yield records were obtained from USDA.The records were matched based on GEOID which is a unique identifier consisting of county and state codes.( 6) Data was cleaned and counties where no yield was reported were removed from the study.In this step, several cleaning operations were applied on the data.Counties where no yield records were provided by USDA were eliminated.Counties where CDL have no corn pixels were also eliminated from this study.Figure 2 illustrates the step-by-step workflow of this study.

Methods
The following ML methods were used to predict county-level corn yield.

Multiple linear regression (MLR)
MLR is generally used to estimate the value of one (dependent variables) using multiple (at least two) independent variables (Eberly, 2007).This is the very basis of regression when the relationship between several independent variables and one dependent variable is determined.For example, how different values of rainfall, temperature and vapor pressure deficit affect the crop yield can be determined by establishing the relationship between those variables and crop yield.In MLR, a linear line is fitted between the dependent variable and each independent variable which is represented in form of mathematical equation.The mathematical formula of MLR is shown in Equation 1: where y represents the dependent variable (crop yield in this case), β 0 is y-intercept (the value dependent variable when all the independent variables are zero), β i represents the regression coefficient of independent variable x i while e is the model error which the model variations in estimated y.There are several assumptions associated with MLR which are as follows: (1) MLR assumes that the relationship between the dependent variable and independent variable is linear, (2) all the independent variables are normally distributed, (3) there is no multicollinearity in the data i.e. independent variables are not correlated with each other, (4) the mean of residuals (different between actual and observed values) is zero, and (5) residuals are independent of each other.However, these assumptions are not always true which reduces the model performance.

Polynomial regression (PR)
In linear regression, as discussed above, the model assumes the relationship between independent and dependent variable to be linear which is not always true and may make the model inaccurate.Sometimes the straight linear line is not able to capture the relationship and we need complex models.Polynomial regression is another type of regression where instead of completely linear equation, we fit the data using a curvilinear equation and establish the relationship between dependent and independent variables (Ostertagová, 2012).The term "polynomial regression" derives its name from its fundamental approach of expressing the relationship between dependent and independent variables through the utilization of a polynomial equation of the nth degree.The mathematical formula of a polynomial equation is shown in Equation 2:

Ridge regression (RR)
RR is a modified form of linear regression which adds a penalty term to OLS regression to avoid overfitting in the function (McDonald, 2009).Generally, the penalty term is proportional to the square of magnitude of coefficients.This regularization helps improve the model by avoiding overfitting and improving models' generalization.RR model is very helpful in studies where there is possibility of multicollinearity in the independent variables.RR uses l2 regularization which add the penalty term to the squared values of model's coefficients.Equation 3 illustrates the mathematical equation for RR, where λ is the regularization term and the term P p i¼1 β i represents the sum of the squares of the coefficients, penalizing large coefficients:

Lasso regression (LR)
LR is another regression technique which uses the same method to apply regularization like RR; however, it uses the l1 regularization instead of l2 (Ranstam & Cook, 2018).The key difference between l1 and l2 regularization is the l1 regularization add the penalty term to the absolute values of model's coefficients while the l2 regularization adds the penalty term to the squared values of model's coefficients.Like RR, LR is also used in problems where there is potential of multicollinearity in the independent variables.The mathematical equation of LR is depicted in Equation 4. As shown in all the equations, the only difference between MLR and other two regression models, i.e.RR and LR, is that they have additional penalty terms which works as a regularizer.

Partial least square regression (PLSR)
PLSR is a technique used to analyze and model the relationship between several independent variables and a dependent variable.PLSR is very useful when many highly correlated predictors are used to model their relationship with the dependent variables.It is particularly useful when the number of samples is relatively small.It transforms a large set of correlated predictors to a small number of uncorrelated predictors.The resultant predictors account for explaining the maximum amount of variance in both the input and response variables.Subsequently, the uncorrelated variables are used to construct a linear regression model between the dependent and independent variables.Equation 5 illustrates the mathematical working of PLSR model.The only difference in PLSR as compared to the previous models is the presence of terms (t 1 ; t 2;:: t n Þ: In PLSR, the term t n represents the scores of the nth PLSR component.These scores are derived from a linear combination of the original predictor variables (x) and capture the maximum covariance between x and y. (5)

Decision tree regression (DTR)
DTR uses decision trees to model the relationship between dependent and independent variables (Xu et al., 2005).DTR models recursively divide the data based on the features which provides the most information again about the predicted or estimated variable.This process is executed recursively until some conditions are met.The DTR model estimates the final value by averaging the predicted values from all the individual trees for the same sample (Asadollah et al., 2021).Equation 6illustrates the equation of DTR.y is the predicted value at the leaf node while n is the number of samples:

Support vector regression (SVR)
SVR is another supervised learning method which is used to resolve regression problems (Smola & Schölkopf, 2004).SVR is the counterpart of famous ML learning model -SVMs which is widely used for classification related problems.SVR constructs a hyperplane in high-dimensional space so that it represents the data very optimally.SVR uses a kernel function which initially converts the data to a very high dimensional space (Ramedani et al., 2014).The model then uses the high-dimensional data to find a hyperplane which is very optimal.The optimal hyperplane is then used to make predictions on new data samples.The most important parameter in SVR model is selection of kernel function since the choice of kernel function determines the transformation of data into high-dimensional space (Üstün et al., 2007).
The mathematical formula of SVR model is shown in Equation 7, where y is the predicted output, x is a set of features, w is the weight vector which determines the optimal hyperplane and b is the bias term.Equation 8shows the constraint which the SVR model aim to satisfy while finding the optimized hyperplane, where is the allowed margin of error.

Random forest regression (RFR)
RF is an ensemble learning method being used for both classification and regression tasks (Breiman, 2001).RFR is an ensemble learning model used for regression problems based on the decision tree algorithm (Huang & Liu, 2022).Multiple decision trees are combined to form a RF model.In RF, each decision tree is built using a random subset of variables, and each tree makes its own prediction.In classification problems, the majority vote determines the predicted class, while in regression problems, the final prediction is the average of the predictions from all the trees (Belgiu & Drăguţ, 2016).One of the major benefits of the RF model is its ability to handle nonlinear data very effectively.This makes it a popular choice for solving complex regression problems.Equation 9shows the mathematical formula of RFR model, where y is the regression results, N trees is the number of decision trees in the RF model while f i x ð Þ is the result of i-th decision tree for input features x.The RF equation shows that final output is the average of all the trees present in the RF model:

Gradient boosting machines (GBM)
GBM is a ML model used for regression and classification problems (Friedman, 2001).The GBM model also belongs to the family of ensemble learning models which use multiple models to achieve better prediction performance (Sagi & Rokach, 2018).The main idea behind the GBM model is to combine several weak models to create a strong model that can improve prediction performance.The model works in such a way that each tree corrects the errors of the previous tree sequentially, thus minimizing the loss function as the model is trained.Initially, GBM uses a decision tree for training and then uses the errors generated by the first tree to train a subsequent tree, which improves the predictions and reduces the error (Spedicato et al., 2018).The mathematical equation of GBM for regression is shown in Equation 10.All the trees are trained based on the errors generated by preceding trees, and final predictions are made based on the collective output of all the trees.GBM is widely used due to its ability to handle complex datasets and produce highly accurate predictions.The term f 1 ; f 2; :::f n represents the prediction from nt h tree:

Experimental setup
In this study, our aim was to predict county-level corn and compare nine different ML models for countylevel corn yield prediction.We used corn yield and VIs data from 2006 to 2020.Furthermore, we compared the performance of in-season and end-of-season data to assess the accuracy of ML models at early stage.End-of-season yield prediction generally produces good results; however, it is important to predict yield early and before the harvest.In this context, the whole season was divided into approximately two parts based on the input data.Since the input data contained 12 timesteps from April to October with 16 days temporal resolution, for in-season yield prediction, we used the first six timesteps.The data was split into 70% and 30% for training and testing, respectively.For a reliable comparison of ML models, the training and testing data was saved independently to avoid a random split for each model.Hyperparameter tuning of models was performed using 10-fold cross validation which is effective for county scale crop yield prediction (Shahhosseini et al., 2021).To assess the performance of ML models, we used coefficient of determination (R 2 ), root mean squared error (RMSE), and relative root mean squared error (RRMSE %).
Figure 4 shows the temporal distribution of yield in the study area.Corn yield in the study area is consistently increasing except for 2012 where extreme droughts hit the Midwestern US affecting the corn yield.

Performance of ML models
Nine ML models were tested to predict county-level corn yield in South Dakota, USA.Tables 1, 2 and 3 illustrated the training and testing R 2 , RMSE, RRMSE (%) for all the ML models used in this study.The results of all ML models were considerably high.PLSR model outperformed the rest of the ML models with R 2 = 0.902 and 0.861 for training and testing, respectively.The second-best model for corn yield prediction was SVR model with R 2 = 0.950 and 0.856 for training and testing, respectively.The third-best model for county-level yield prediction was RR with R 2 = 0.867 and 0.854, respectively.Other evaluation metrics followed the same pattern.The top three models for county-level corn yield prediction are presented as bold and underlined in Tables 1, 2 and 3 three leading models do not manifest consistent proclivities towards overestimation or underestimation.Year-wise scatterplots of best model (PLSR) is illustrated in Figure 8.

In-season and end-of-season yield prediction
One of the important aspects of crop yield prediction is the time of yield prediction.For end-of-season yield prediction, we used the full data, i.e. 12 timeseries of both NDVI and EVI.To compare in-season and endof-season prediction, we used the three best models from Table 1 to predict county-level corn yield prediction in the study area.The three best models were selected based on model performance such as R 2 and RMSE values.Results of in-season and end-of-season yield prediction are presented in the subsequent sections.Table 4 presents the results of in-season corn yield prediction using the top three ML models.
Compared to the end-of-season yield prediction (Table 1), the performance of three ML models dropped since they are utilizing a smaller number of timeseries features; however, the drop in SVR model performance is comparatively less.For end-of-season yield prediction, the RR achieved R 2 = 0.854.The same model achieved R 2 = 0.688 for in-season yield prediction.The drop in model testing R 2 was 0.166.
Similarly, the PLSR model achieved R 2 = 0.861 and 0.692, respectively, for end-of-season and in-season yield prediction with a drop in model testing R 2 = 0.169.Finally, the SVR model achieved R 2 = 0.856 and 0.771 for end-of-season and in-season yield prediction, respectively, with a drop in model testing R 2 = 0.085.This shows that RR and PLSR experienced a nearly similar drop in model performance with a small number of features; however, the performance of SVR model was less affected by reducing the number of features (6 timeseries features instead of 12).Results obtained from these models are very important in terms of yield prediction using long-term data for food security and agricultural decision-making.

Discussion
Traditionally county-level yield is estimated through surveys which is time-consuming and costly.
Secondly, the traditional county-level crop yield  estimates can only be done once the crop is harvested.
After harvesting, it is not possible to address policy related decisions.Since ML models can predict county-level corn yield with considerable accuracy, the models can be used on large-scale to predict yield before the harvest (Sun et al., 2019).
In terms of performance, three models performed very well as compared to the rest of the models.PLSR, SVR and RR outperformed the rest of models by achieving high training and testing performance.Overall, only two models, i.e.MLR and DTR, did not perform too well as compared to the rest of the models due to some inherent limitations.MLR assumes the relationship between the dependent and independent variables as linear which is not always true.Furthermore, MLR model also assumes the errors to be normally distributed.Linear models are also affected by the presence of outliers as they assume the data does not have any outliers.When any of the above assumptions are not true in the data, it affects the model performance.The low performance of DTR model can be attributed to its nature as DTR model is generally prone to overfitting when a smaller number of data samples are available.These findings are important in the context of agriculture since mostly a very small number of samples are available.Our results agree with other studies which have used the PLSR model in different domains (Ali et al., 2023;Cheng & Sun, 2017).The performance of PLSR, SVR, and RR owes to the nature of these models which have a few advantages over others.Firstly, they can handle multicollinearity effectively.Multicollinearity is the presence of high correlation among the features.MLR is very sensitive to multicollinearity, thus affecting its performance.On the other hand, PLSR can overcome this issue by creating  new composite variables which reduces multicollinearity in the features.Another advantage of PLSR model is its ability to handle small sample size effectively and high dimensional data since this study involves 12 timeseries of two predictors (24 features).
On the other hand, SVR model has the ability to model non-linear relationships between the input and output data.This makes SVR superior over linear models such as MLR.On the other hand, the SVR model can capture non-linear relationships between the input and output data, giving it an advantage over linear models like MLR.Furthermore, SVR can yield better results in situations with limited data.This advantage stems from SVR's inherent capabilities, such as robustness to outliers and model simplicity, which aid in better generalization.These qualities help   prevent overfitting and contribute to superior performance over other models, even when data is limited.
Although RR is a variant of linear regression, it adds a regularization term to ordinary least squares (OLS) which helps improve the model performance and reduces overfitting.
Our results also prove that VIs derived from satellite data can predict corn yield with high accuracy.VIs as compared to raw reflectance has the ability to highlight the vegetation part (Khan et al., 2022).This is since healthy vegetation reflects most of the incident light in the near-infrared region of electromagnetic spectrum and absorbs most of the light in red region.NDVI have been used very extensively for crop monitoring and estimating crop parameters such as LAI, chlorophyll content, nitrogen content and crop yield.Similarly, EVI is an enhanced VI which can model the high biomass regions very effectively.Since VIs focus more on the vegetation part, they can explain the variability in yield.Another aspect of MODISderived VIs is their global availability.Mostly, the products derived from MODIS data are accessible on a global scale and can offer valuable insights in regions where collecting data on other variables is challenging.
In-season yield prediction is very important for early planning and decision-making.Figure 8 shows the scatterplots and model performance in terms of R 2 , RMSE and RRMSE (%).The in-season yield prediction models also produced considerable results.The training and testing R 2 of SVR model for in-seasoncorn yield prediction was 0.87 and 0.77, respectively (Figure 9).This means the corn yield can be predicted with 77% accuracy in the mid-season using VIs derived from satellite data.Results produced by the models using in-season data show that the first half of growing season is very important for county-level corn yield prediction.Although the second half of the growing season provides valuable additional information, in terms of yield prediction they are not as much important as the first half.This is in line with previous research where county-level soybean yield was predicted using data from start of growing season to end-of-season with multiple timesteps.In the growing season, each timestep adds unique information about crop conditions to assist in predicting yield; however, the information obtained during initial stages proved to be more useful for yield prediction.In terms of model performance for in-season yield prediction, the overall performance of all models reduced due to a small number of features; however, the SVR model was less affected by reducing the number of features.In the presence of a small number of features, the superior performance of SVR model indicates that it is more robust and can be used for yield prediction in future studies when a small number of features are available.
The availability of data on large scale and high temporal resolution provides several opportunities to explore crop yields and other related parameters (Jin et al., 2020).Furthermore, such studies can be further expanded by incorporating additional parameters and using a wide range of ML models on different scales to see how it affects the crop yield prediction of different regions, crop species and time.Year-wise scatterplots of in-season corn yield prediction using the SVR model are presented in Figure 10.
Finally, our study also has some limitations.For yield prediction on the county level, yield and other variables are aggregated over a large area.This can result in loss of information due to the variations and landcover in each county.To address this, yield modelling on different scales can be done to see if the county-level yield prediction results are similar to other scales.There are several factors which affect crop yield prediction differently and which may not be characterized by RS-derived VIs.These issues can be addressed in future by conducting very detailed studies at different scales, crops, and different regions.

Conclusions
In this study, we compared the performance of several ML models in estimating corn yield at the county-level in the study area.Our results indicate that the ML models used in this study were effective to predict corn yield at the county level.Data and models used in this study can be used to predict the yield of different crops.Furthermore, the utility of the data and models used in this study is not limited to yield only, as they can be useful to predict other parameters such disease severity, LAI, and nitrogen content.Supervised ML models trained with MODIS-derived VIs demonstrated robust performance, even with relatively small sample sizes.In terms of model selection, the top three models were PLSR, SVR, and RR, which achieved testing R 2 values of 0.861, 0.856, and 0.854, respectively.The performance of other ML models was also noteworthy.We also tested the prediction of yield using in-season data.The top three models selected from the end-of-season predictions were used to predict in-season yield, and the drop in performance was considerably less.SVR model outperformed other ML models for inseason yield prediction.This suggests that ML models and MODIS-derived VIs can predict corn yield with considerable accuracy before harvest.Further studies are needed to check the robustness of the ML models for diverse climatic conditions and at different scales.

Figure 1 .
Figure 1.Study area map.The corn pixels for year 2020 are shown as yellow.

Figure 2 .
Figure 2. Overall workflow of the study.

Figure 3 .
Figure 3. County-level corn yield distribution in the study area.

Figure 4 .
Figure 4. Temporal distribution of corn yield in the study area.

Figure 5 .
Figure 5.The scatterplots of observed and predicted yield using PLSR model (a) training, (b) testing.

Figure 6 .
Figure 6.The scatterplots of observed and predicted yield using SVR model (a) training, (b) testing.

Figure 7 .
Figure 7.The scatterplots of observed and predicted yield using RR model (a) training, (b) testing.

Figure 8 .
Figure 8.The scatterplots of in-season observed and predicted corn yield using PLSR model (year wise).

Figure 9 .
Figure 9.The scatterplots of observed and predicted corn yield of SVR model using in-season data.

Table 1 .
Training and testing R 2 of ML models.

Table 2 .
Training and testing RMSE of ML models.

Table 4 .
Training and testing R 2 of ML models for in-season corn yield prediction.