Integration of maximum crop response with machine learning regression model to timely estimate crop yield

ABSTRACT Timely and reliable estimation of regional crop yield is a vital component of food security assessment, especially in developing regions. The traditional crop forecasting methods need ample time and labor to collect and process field data to release official yield reports. Satellite remote sensing data is considered a cost-effective and accurate way of predicting crop yield at pixel-level. In this study, maximum Enhanced Vegetation Index (EVI) during the crop-growing season was integrated with Machine Learning Regression (MLR) models to estimate wheat and rice yields in Pakistan’s Punjab province. Five MLR models were compared using a fivefold cross-validation method for their predictive accuracy. The study results revealed that the regression model based on the Gaussian process outperformed over other models. The best performing model attained coefficient of determination (R 2), Root Mean Square Error (RMSE, t/ha), and Mean Absolute Error (MAE, t/ha) of 0.75, 0.281, and 0.236 for wheat; 0.68, 0.112, and 0.091 for rice, respectively. The proposed method made it feasible to predict wheat and rice 6–8 weeks before the harvest. The early prediction of crop yield and its spatial distribution in the region can help formulate efficient agricultural policies for sustainable social, environmental, and economic progress.


Introduction
The current world population of 7.6 billion is expected to increase by up to 9.8 billion by 2050, with most population growth in developing countries of Asia and Africa (United Nations 2019). Future dietary requirements of these developing nations will require a regular increase in agriculture production by maintaining a stable agroecosystem. A changing climate is also a profound threat to the food security of developing regions. Agriculture policies play a vital role in achieving productivity growth and raising an agrarian society's overall economic status (Simoncini et al. 2019). An effective agriculture policy relies on timely and accurate crop yield information to better manage supply and demand to ensure food security in the region (Maya Gopal and Bhargavi 2019). Additionally, a thorough picture of crop yield status helps control market swings that can be extremely disruptive in regions with an agriculture-based economy (Giannakis and Bruggeman 2015).
Crop yield monitoring in developing countries is mainly based on two types of sampling surveys. The early crop yield prediction is based on subjective surveys, like taking opinions from growers and the field officers' visual judgment. The later crop yield estimation is done using objective surveys such as a whole plot harvest or crop cut measurement from sample fields (Craig and Atkinson 2013). The sampling sites for a region are selected through a systematic random sampling scheme, and the small data sets often do not reflect the complete status of the seasonal croplands as samples are only collected from accessible fields. This labor-intensive crop monitoring system can be an adequate provision for overall agriculture management, but the crop yield estimation through this mechanism is time-consuming with large associated uncertainties. Moreover, final estimates of crop yields at a national scale are finalized after months of a crop harvest, making it challenging to take timely decisions of import and export to ensure food security and economic growth.
Satellite remote sensing is a cost-effective tool for realtime monitoring and crop status assessment (Fritz et al. 2019). Earth-orbiting satellites capture essential information about vegetation conditions over large areas with frequent revisits. Medium to coarse spatial resolution data from different satellites (e.g. Sentinel, SPOT, Landsat, and MODIS) are freely available for analyzing vegetation conditions at the regional and global scales (Jianxi Huang et al. 2019). Many recent studies have used remote sensing derived biophysical parameters in Crop Growth Models (CGM) to estimate crop yields. These biophysical parameters include leaf area index, soil moisture, the fraction of absorbed photosynthetically active radiation, evapotranspiration, and above-ground biomass. The use of these data in CGM has helped the researchers to evaluate different crop management strategies for improving the crop yield at the regional and global scales (Jin et al. 2018). However, the need for local calibration makes it difficult to apply the CGM in developing countries where data scarcity often presents a prohibitive obstacle. The lack of spatial data to capture the heterogeneity of land surface also makes the CGM application misleading at the local scale.
Statistical modeling seems to be a more reasonable approach for data scarce regions. A regression model can be built between reported crop yield and remote sensing derived Vegetation Index (VI). Although regression models are considered regionspecific and time-dependent, they can effectively fulfill spatiotemporal yield gaps by mapping regional crop yield at a fine scale within the study period (Lobell 2013). Previous studies like Jingfeng Huang et al. (2013) used Normalized Difference Vegetation Index (NDVI) derived from NOAA AVHRR to develop a stepwise regression model to estimate crop yield in major rice grown provinces of China. Petersen (2018) developed a multivariate regression model using indices derived from MODIS to predict the real-time yield of corn, soybean, and sorghum for the continent of Africa. The proposed methods in these studies do not need to map crop cover areas and utilize the monthly anomalies of indices to measures relative vegetation health.
More recently, Liu et al. (2019) compared the results of yield estimation on aggregated croplands and masks of specific crops in Canada. A better correlation was found between MODIS two-band Enhanced Vegetation Index (EVI2) at the peak growth stage of the crop and its national yield using crop-specific masks. A multilinear regression model was developed to estimate the crop yields of winter wheat, corn, and soybean. Machine Learning(ML) methods in solving complex non-linear problems have also been utilized in crop yield prediction using remote sensing data. Johnson et al. (2016) employed these methods to predict yield canola, barley, and spring wheat using NDVI and EVI data. Two non-linear models (Bayesian Neural Networks and model-based recursive partitioning) were used to estimate crop yield in Canadian Prairies. In addition, recent studies are utilizing Machine Learning Regression (MLR) approaches with remote sensing data to tackle multiple cropland issues in a geospatial modeling environment (Prins and Van Niekerk 2020;Ujoh, Igbawua, and Ogidi Paul 2019;Mfuka, Byamukama, and Zhang 2020).
The advances in machine learning algorithms and increasing availability of multitemporal remote sensing data have largely helped to deal with data nonlinearity and multi-dimensionality, which often create difficulties in applying regression models. Taking the advantages, this research aims (1) to develop a method using remote sensing data and machine learning regression models to estimate crop yield of wheat and rice in a large crop-production region; (2) to compare the performances of different regression models to identify the most suitable one for the yield prediction of each crop; and (3) to analyze the spatial variation of crop yield across the study region and to evaluate the proposed prediction model for preharvest yield forecasting. This study will help managers formulate efficient agricultural policies to improve crop production in low-performing regions to benefit society.

Study area
The study area for this research is the Punjab province of Pakistan, which has the largest share (73%) of the country's total cropped area ( Figure 1). Wheat and rice, along with cotton and sugarcane, are the cash crops of this region. Punjab contributes 76% of wheat and 56% of rice in total national production (Dempewolf et al. 2014;Rehman et al. 2017). Wheat is the main crop of the Rabi season (November to April of the next year). Rice is the main crop in the Kharif season (May to October). The period of study was from November 2002 to October 2018. All districts that had wheat and rice cultivated area greater than 50 kHa were used to estimate the crop yield. The change (%) of wheat and rice cultivation area based on reported statistics in selected districts since 2003 is shown in Figure 1. No more than a 3% increase and a 1% decrease were observed for each selected district. A total of 16 years of historic EVI and reported crop statistics of both crops were used for this research.

Reported crop statistics
Under the directorate of agriculture, the Crop Reporting Service (CRS) is responsible for producing district level crop statistics for the Government of Punjab. The CRS Punjab has 1038 crop reporters that collect field data from 1240 sample villages. These sample villages account for 5% of Punjab's total villages and are selected for 5 years through stratified random sampling. The CRS generates three survey reports for each major crop. The first survey is a visual inspection of field reporters in sample villages to estimate the crop cultivated area after completing each crop's sowing period. The second inspection is done in the middle of crop season to prepare a list of all fields of a particular crop in the sample village, known as a frame of concerned crop. It also helps forecast crop yield based on grower's opinion, availability of input products, weather conditions, and field officers' expert judgment.
The third survey is taken during the harvest season by measuring crop yield in three experimental plots of 20 × 15 ft. These plots are selected randomly within each sample village using the crop frame. The sample data is interpreted and aggregated at the district by the statistical experts of CRS. The final crop production estimates are generated using crop areas generated through a complete census at the district level. This complete enumeration of crop area in each district is carried out by field officers of the Department of Revenue, Pakistan. CRS estimates the final crop production by multiplying district level measured yield values with the final area values. For this study, district-level crop yield data of wheat and rice was collected from CRS for the study period.

Remote sensing data
Two satellites Terra and Aqua carries a sensor Moderate Resolution Imaging Spectroradiometer (MODIS) that record the reflection from the earth in 36 different spectral bands. Two MODIS products MOD13Q1 and MYD13Q1 from Terra and Aqua, respectively, contain EVI data that was used in this study. These products have a spatial resolution of 250 m, and their composite temporal resolution is 8 days. A total of 46 images complete one calendar year. Based on the study area's crop calendar, images from Nov 2003 to Oct 2018 were used for this analysis. Three tiles (h24v05, h23v05, and h24v06) covers the entire study area.

Overall methodology
The flow diagram of the procedure adopted for this research is shown in Figure 2. An eight-day temporal MODIS-EVI at 250 m spatial resolution was first stacked by seasons for rice and wheat crops. A six-month EVI stack from November to April next year was prepared for the wheat crop. For the rice crop, a six-month EVI stack from May to October was prepared. From EVI stacks of rice and wheat, the maximum seasonal EVI (EVI max ) was extracted for each crop season. The Day of the Year (DOY) for EVI max was used to evaluate the model's preharvest prediction ability. As no significant areal change in rice and wheat croplands in the selected districts, an existing crop cover data set was used to extract the EVI max values. To match the reported crop yield data, the district average EVI max was calculated for input to MLR models. The best performing MLR model for each crop was then selected to simulate the crop yield using the EVI max value.

Linear Regression (LR)
The linear regression models have been used in many previous yield estimation studies to indicate a linear or exponential relationship to single or multiple variables (Hou et al. 2019;Gaso, Berger, and Ciganda 2019). The slope and intercept are the two parameters that describe the linear relationship between the dependent and independent variables. To accommodate multiple dependent variables in LR modeling, a stepwise regression analysis is often used to eliminate non-significant variables in the linear expression. The other machine learning methods are more efficient in handling predictor variables' nonlinearity than simple LR models (Chlingaryan, Sukkarieh, and Whelan 2018).

Support Vector Regression (SVR)
In ML methods, a Support Vector Machine (SVM) is a training-based discriminative model that offers various distinctive edges in handling complex multidimensional data using hyperplane (Schölkopf and Smola 2001). Apart from being used as a classification method, SVM can also be used as a regression model with a few minor differences. The support vector regression model maintains all the main elements that distinguish the algorithm, i.e. maximal margin. Different kernel functions such as linear and Radial Basis Function (RBF) enable the SVR to operate in higher dimensions. For linear SVR, the function used to predict values f x ð Þ depending on the support vectors can be mathematically expresses as: Where x i is the number of training dataset in N observations, a i and a i � are non-negative real numbers known as Lagrange multipliers, and b is the intercept.

Decision Tree Regression (DTR)
A decision tree regression model is a non-parametric supervised ML method that develops a hierarchical tree structure by learning from training data. The DTR model divides the data into smaller subsets using decision nodes and leaf nodes. A decision node has further sub-classes, while the leaf node has an associated decision value (M. Xu et al. 2005). A deep tree structure leads to complex decision rules and is prone to overfitting. Alternately, to mitigate the problem of overfitting, multiple coarse decision tree structures are combined in Ensemble Learning Regression (ELR) for prediction. There are two commonly used ensemble methods. One method is to group decision trees is in parallel order called Bagging, and the other is in sequential order called Boosting. Bagging is used to reduce the variance in estimation by creating multiple random samples from the original dataset to train multiple models individually. The final predictions are determined by combining the predictions from all individual models. Boosting is an iterative method that alters the weight of an observation based on the last prediction. The main goal of boosting is to achieve higher accuracy (Zhou 2012). The mathematical expression of DTR is similar to LR model and nodes are defined using standard deviation and variance formulas.

Gaussian Process Regression (GPR)
The Gaussian process regression is another nonparametric kernel-based probabilistic model that makes a significant ML application impression. A well-trained GPR model not only predicts the response variable but also quantifies the uncertainty associated with it. The random fluctuations and the coefficients are estimated from the training data. A GPR is a function space model and supposes that the covariance between any two random variables is a multivariate Gaussian. A multivariate Gaussian is defined by its mean and covariance (kernel) function. The kernel function defines the spatial or temporal similarities between two random variables. There are multiple kernel functions (e.g. rational quadratic and marten 5/2) available to choose from that capture the smoothness of the response variables. As GPR model is probabilistic model, an instance of response y can be modeled as: Þ is the density of the sample, σ 2 is the noise in the density, f x i ð Þ is a latent variable developed for each observation x i , h is explicit basis function, and β is a coefficient vector. More details on different parameters of GPR model can be found in Rasmussen (2003). The present study has compared LR, SVR, DTR, ELR (both bagged and boosted), and GPR models to predict crop yield in selected districts. The Bayesian optimization with 50 iterations was used to fine-tune hyperparameters in all MLR models.

Model evaluation
All regression models used fivefold cross-validation to indicate model performance in yield estimation. In the fivefold cross-validation, for each MLR model, the whole data is randomly divided into training and testing datasets for five simulations. The average value of evaluating coefficients from five simulations depict each model's efficacy. The comparison of all regression models led to the selection of the best predictive model for each crop. The best selected, trained model was used to simulate pixel-level crop yield information for the entire study period. The coefficient of determination (R 2 ), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) were used to evaluate each model. The formulas for R 2 , RMSE, and MAE for one simulation are given below: Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼1 RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼1 x i À y i ð Þ 2 n s (4) Where x i and y i are the individual reported and estimated crop yield during the whole study period (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018); � x and � y are the average values of reported and estimated yields in that period, respectively, and n is the total sample size for each crop. The higher value of R 2 and lower value of RMSE and MAE indicated a better performing model.

Evaluation of MLR models
The comparison of five MLR models in predicting crop yield of rice and wheat using EVI max is shown in  Figure 3 for both wheat and rice.

Spatial variation of crop yield
A district-level trained GPR model was further used to simulate the pixel-level yield information from 2003 to 2018. Figure 4 shows the spatial variation of average wheat yield at the pixel-level and district-level simulated using the GPR model during the study period. The results indicated that for the past 16 years an average annual yield of 2.60 (t/ha) with a standard deviation of ±0.48 (t/ha) was produced in the study area. The maximum wheat yield was observed in the Gujranwala district with an average of 3.19 ± 0.23 (t/ ha). On average, the arid districts of Punjab province (Chakwal, Rawalpindi, and Attock) have the lowest yield of wheat (1.52 ± 0.27 t/ha) throughout the study period. The maximum yield variation of 0.41 (t/ha) was observed in the Rajanpur district, with an average yield of 2.53 (t/ha) during the study period. In southern Punjab, Khanewal, and Lodhran have the  Figure 5. The selected districts had an average yield of 1.86 ± 0.09 (t/ha) in the past 16 years. The maximum rice yield was estimated in the Gujranwala district (with an average of 2.06 ± 0.11 (t/ha)), while the Gujarat district had the least seasonal yield (with an average of 1.76 ± 0.07 (t/ha)). In Punjab province, the varieties of cultivated rice are divided into three main categories: a) basmati rice, b) long grain/non-basmati rice, and c) coarse/medium rice. The basmati and long grain/non-basmati have a similar yield percentage but the coarse/medium rice has a higher yield. The selected rice districts of this study are mostly cultivated under basmati or long grain/non-basmati variety. The coarser rice is mostly sown in the southern part of Punjab and Sindh province of Pakistan and is not considered in this study.

Timely yield estimation
The rice and wheat crops reach to the EVI max value in the middle of their growth cycle during the greening stage. The day of the year to reach EVI max was studied for each crop during the study period to assess the effectiveness of the proposed method on timely yield prediction. The median value of DOY of EVI max and its spatial variation during the study period for both crops is shown in Figure 6. More than 75% of the area with wheat crop shows EVI max in the two weeks of mid-February. The harvesting period of the wheat crop is from mid-April to mid-May. This implies  that the wheat yield can be predicted using the EVI max with GPR model 7-8 weeks before the harvest.
Similarly, for the rice region, more than 70% of the cultivated area shows EVI max in the last two weeks of August. The rice harvesting season is from mid-October to late-November. The prediction of rice yield can be achieved 6-7 weeks before the start of the harvest. The spatial variation of EVI max can also provide a general trend of crop harvest. In general, the wheat is harvested in the southern area earlier than in the northern area. The spatial trend of EVI max can be analyzed according to various climatic conditions,  cropping sequence, and crop varieties cultivated throughout the region. However, this association analysis is beyond the scope of this study.

Discussion
The crop yield estimation is of paramount importance in recent times, especially for developing countries where food production is under high pressure. Satellite remote sensing data provides regular updates about crop conditions throughout its growth cycle. Many past studies have analyzed crop yield relation with remote sensing derived parameters at different crop growth stages. Most studies pointed out that the VIs and biophysical parameters around the crop's peak season are positively correlated with final yields for various crops (Dempewolf et al. 2014;Bolton and Friedl 2013;Sakamoto, Gitelson, and Arkebauer 2013;Ren et al. 2008). However, developing regression based on a single day VI during peak growth duration for a vast area is problematic as crop sowing time changes from one field to another. Climatic conditions throughout the region also played a crucial role in deciding the start of the crop season. To compensate this, these studies used the spatiotemporally aggregated information of crop phenology to estimate crop yield. Most past studies have also overlooked the use of cropland masks for aggregating remote sensing VI values to regress against reported crop yield that can be misleading in crop yield estimation and its spatial variation.
In this study, a crop mask was used to select VI values of a specific area. The use of the season's maximum EVI compensated for the variation in the crop's sowing timing and simplified the use of suitable crop phenology information for regression purposes. The seasonal EVI max was then used to train various MLR models, and their prediction ability was compared to select the one with the best performance. This EVI max based approach, however, cannot cater to the impact of extreme weather events after the crop's greenness stage as these events can greatly affect the crop yield during the critical stage of ripening before the harvest. Abbas and Mayo (2020) studied the effect of rainfall and temperature on rice production in Punjab, Pakistan, and reported a negative impact of rainfall during the rice ripening stage and its production. Crop lodging is a major yield-reducing factor that is connected to extreme weather events. The crop lodging during the ripening stage can seriously damage the production of cereal crops (Niu et al. 2016).
Nevertheless, this study has proven that the use of EVI max with the GPR model can provide a reasonable estimate of crop productivity 7 weeks before the harvest. The DOY analysis based on MODIS has identified the spatial patterns of EVI max occurrence throughout the region. Further studies using fine resolution remote sensing data based on identified DOY can improve the spatial resolution and quality of yield prediction. The timely information on the productivity of primary cereal crops helps to ensure regional food security and regulates market prices in regions with an agro-based economy. The spatial mapping of yield levels can also help policymakers improve the crop productivity of underperforming areas. Furthermore, the pixel-level information of crop productivity levels can help to derive actual spatial diversity of crop management practices in the region. These management practices include the actual levels of irrigation and farm chemicals (fertilizers and pesticides) being applied in the region. Steduto et al. (2012) have expensively reported the direct nexus between crop water used and its yield for multiple crops including wheat and rice. Similarly, studies (Xu et al. 2019;Havlin and Heiniger 2020) have reported positive impact of soil fertility on crop yield. Therefore, by extrapolating the results of study, one can indirectly estimate the actual levels of crop management practices in the region that is a valuable information for a datascarce developing regions like Pakistan. The information of DOY for EVI max can be further explored to derive indirectly the crop calendar events (like crop sowing and harvest time) at regional scale. Such localscale information of diverse crop management practices are a valuable input to robustly assess the impacts of changing climate on regional crop productivity.

Conclusions
This paper reports a study to develop an efficient crop yield forecasting model with a case study in a diversified agriculture region of Punjab, Pakistan. The crop-specific EVI max from MODIS was used to train and test the predictive ability of five machine learning regression models. The GPR model was identified as the best prediction model for the yields of rice and wheat crops. The analysis of EVI max revealed that the proposed method is capable of estimating wheat yield 7-8 weeks and rice yield 6-7 weeks before the harvest. Further studies using fine spatial resolution remote sensing datasets and incorporating the spatial information of EVI max occurrence derived from this study will enhance yield results. The findings and method developed in this study would help to forecast regional crop productivity and produce timely crop yield predictions well before the harvest, which can help better management of crop market, food security, and rural development.

Data availability statement
The input data used in this research can be accessed freely from online sources. Remote sensing data used in this study can be downloaded via various web portals of NASA, e.g. https://earthdata.nasa.gov/. The district-level reported crop yield data are available at the Crop Reporting Service (CRS), Punjab's website (http://crs.agripunjab.gov.pk/reports). All the derived data are included in the manuscript. The machine learning regression models were trained and tested using MATLAB's Regression Learner app. The MATLAB script to estimate pixel-level crop yield and DOY for EVI max can be made available upon request to corresponding author.

Funding
The research is supported by the Natural Science

Notes on contributors
Ali Ismaeel has a PhD in geography and master's degree in remote sensing and GIS with bachelor's degree of agricultural engineering. His research broadly covers ecological modeling using remote sensing and GIS approaches with prime focus on landuse/landcover mapping, vegetation trend analysis, hydrological modeling and cropland ecosystem modeling.
Qiming Zhou is a Professor of Geography, Associate Dean of Faculty of Social Sciences (Research) and Founding Director of Centre for Geo-Computation Studies at Hong Kong Baptist University. His research interests cover a broad area of geo-spatial information science, particularly in geo-computation and remote sensing applications. He has been actively engaged in research such as digital terrain analysis, climate change and its impacts on regional and global ecosystems, landuse and land cover change detection, and GIS and remote sensing applications to urban, environment and natural resource management.