Forecasting reference evapotranspiration using data mining and limited climatic data

To accurate forecast of water evaporation and transpiration (reference evapotranspiration, ET 0 ) is imperative in the planning and management of water resources. The Penman-Monteith FAO56 (PM-56) equation which is recommended for estimating ET 0 across the world. However, it requires several climatic variables; the use of the PM-56 equation is restricted by the unavailability of input climatic variables in many locations. In the current study, the potential of k-Nearest Neighbor algorithm (KNN), which is a data mining method for estimating ET 0 were investigated using limited climatic data in a semi-arid environment in China. In addition, a KNN based ET 0 forecast model were tested against the PM-56 equation. The accuracies of the models were evaluated by using three commonly used criteria: root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (r). The results obtained with the KNN-based ET 0 forecast model (through normalization, weighted and K = 3) were better than it without any process. The prediction result is consistent with the PM-56 results, and confirmed the ability of these techniques to provide useful tools in ET 0 modeling in semi-arid environments. Based on the comparison of the overall performances, it was found that t the KNN-based ET 0 forecast model which requires max air temperature, min air temperature and relative humidity, input variables had the best accuracy.


Introduction
Evapotranspiration (ET) is the process of water loss from soil and crop surfaces to the atmosphere by evaporation and transpiration.Evapotranspiration is one of the most important components of water cycle, and is a key factor for agriculture, irrigation scheduling and water resources.It can be measured directly using micrometeorological techniques based on energy balance and water vapor mass flux transfer methodologies, however, it is much cost.A more economical alternative to these methods is the application of mathematical models with measured meteorological parameters as independent variables for ET estimation.Evapotranspiration will affect the surface ecology and environment of the land.Accurate and quick prediction of potential evapotranspiration will help to analyze environmental change, and be necessary and extremely important for crop irrigation, irrigation water dispatch, water resources management in river basins, ecological environment assessment, water resources balance research at different scales and hydrological and ecosystem model modeling.
There are some mathematical models for ET estimation Thornthwaite (1948), Blaney (1952), Turc (1961), Jensen and Haise (1963), Priestley and Taylor (1972), Makkink (1957), Hargreaves and Samani (1985), and Allen et al. (1998).And one of them, the adapted FAO-56 Penman Monteith equation (PM-56) has been adopted as a reference equation for estimating the reference evapotranspiration (ET 0 ) and calibrating other equations.But it requires several measurements of climatic variables like air temperature, solar radiation, sunshine duration, relative humidity and wind speed, unfortunately, there are a limited number of sites in many regions where complete meteorological stations of these climatic variables Bandoc and Golumbeanu (2010) and Panaitescu et al. (2014).
To overcome this problem, many researchers have developed empirical models which fewer weather data requirements for the alternative.For instance, the Hargreaves model which requires only the temperature data.The choice of any one method depends on the accuracy of the equation under a given condition and the availability of the required data.Most empirical methods do not show unanimous results regarding to the climatic condition.According to Shih (1984), an ideal method used for reference evapotranspiration estimation should be chosen based as minimally as possible on the input data variables without affecting the accuracy of estimation.
Other approaches which have captured researcher's attention in the past decades are the artificial neural networks (ANNs) applied in various fields of hydrology engineering including classification, forecasting and modeling problem.ANN application in hydrology due to its high nonlinear functional characteristic has provided rapidly many advantages in river flows extrapolation Cigizoglu (2003), rainfallrunoff modeling Firat (2008), sediment forecasting Wang et al. (2008), and ET-ref modeling Kumar et al. (2002).Researchers have obtained outstanding results by using different algorithms of ANN to model the reference evapotranspiration as a function of climatic data Trajkovic et al. (2003), Keskin and Terzi (2006), Parasuraman et al. (2007), Doğan (2009), Sathishkumar and Cho (2020), Sudheer et al. (2003), andZanetti et al. (2007) in their reference evapotranspiration estimation, simplified the ANN inputs data to air temperature, extraterrestrial solar radiation and daily light hours.Recently, Khoob (2008) and Landeras et al. (2008) used similar data set without the daily light to estimate successfully the reference evapotranspiration.By observing the study sites of the above-mentioned studies, it is found that there is no study conducted under the climatic condition of the Sudano-Sahelian zone.
With the rise of machine learning technology and artificial intelligence, scholars began to explore how to combine intelligent algorithms with traditional estimation methods to estimate ET 0 more accurately and effectively.Kisi used temperature, solar radiation, relative humidity and wind speed as inputs in 2015 to compare the accuracy of LSSVM, MARS and M5Tree in estimating monthly ET 0 .The results show that LSSVM has the smallest relative error in the test period Kisi (2015).In 2017, ADNAN M first used principal component analysis (PCA) to find 5 meteorological data most relevant to ET 0 from seven meteorological data, namely, the highest, lowest, average temperature, precipitation and wind speed, and reduced their dimensions, then combined with ANN algorithm to predict ET 0 .This method not only saves the time and cost of calculation, but also keeps the accuracy of prediction.Based on five meteorological data, eight input combinations were set up by Mattar (2018) who established eight ET 0 estimation models using GEP intelligent algorithm.The estimation results are very close to FAO-56 PM estimation.The results also show that GEP is more accurate than Hargreaves and Samani, Irmak, Turc in estimating ET 0 .However, we found that the time complexity of these machine learning algorithms is high, and still needs more meteorological data as input.
The objective of this study is to investigate the potential of k-Nearest Neighbor algorithm (KNN) for estimating ET 0 using limited climatic data in a semi-arid environment in China.
The objective of this study is to demonstrate the adequacy of different approach for forecasting daily ET 0 using data mining and limited weather information.The data mining algorithm used is the k-Nearest Neighbor algorithm (KNN).The reference evapotranspiration is estimated by the Penman-Monteith equation (PM-56) using the meteorological data from 1951 to 2000 of the 24 weather stations in Ningxia.And then, analyzed the correlation between ET 0 and meteorological elements, in order to obtain the meteorological elements which are most closely with ET 0 .These will be converted into vector space and use KNN algorithm to predict ET 0 with the meteorological data from 2001 to 2012.The next section presents a description of the study area and the methods applied in this study and provide the information about data, methodological structure and statistical indexes.The applicability of the models on evapotranspiration estimation and the results are examined in the third section.Finally, the last section provides conclusions.

Study area and climate dataset
Ningxia irrigation area is China's four major ancient irrigation districts, one of more than 2000 years of irrigation history, known as "Frontier of Jiangnan" reputation, mainly grain, cotton and oil-producing areas in Ningxia.It is also one of the 12 commodity grain bases in China.The irrigation area is located in the Ningxia Hui Autonomous Region, which located in northwest China (Geographical coordinates: east longitude 104 ° 17 '~ 107 ° 39ʹ, latitude 35 ° 14 '~ 39 ° 23ʹ.North and south about 465 km long, 45-250 km wide) and covering some 60,000 km 2 .With the Yellow River passing through the region, Ningxia enjoys a convenient irrigation system.There are numerous rivers, lakes and channels in the region.As shown in Figure 1.
Ningxia few precipitations, evaporation strongly, and the air is dry.For many years, the average annual rainfall is 289 mm, from north to south increasing, change in 180 ~ 800 mm.In the water evaporation capacity 1250 mm, is 4.3 times of precipitation, the trends and precipitation instead, from north to south decline, change in 1600 ~ 800 mm.The above two contrary trend decided the difference between the north and the south is drought index, by southing north change in 1 ~ 9, most areas for 3 ~ 9, belong to an arid and semi-arid area.
The weather data were collected from the Meteorological Administration of China (www.cma.

R E T R A C
T E D

Penman-Monteith method
The FAO-56 Penman-Monteith equation which is given by Allen et al. (1998) as follows: where ET 0 is the reference crop evapotranspiration , γ is the psychrometric constant(kPa/°C), e s is the saturation vapor pressure (kPa), e a is the actual vapor pressure (kPa), and Δ is the slope of the saturation vapor pressure-temperature curve(kPa/°C), T mean is the daily mean air temperature (°C), and U 2 is the mean daily wind speed at 2 m(m/s).The computation of all data required for calculating ET 0 followed the method and procedure given in Chapter 3 of FAO-56.

Correlation analysis
This study analyzed the correlation between the meteorological elements and ET 0 using Pearson correlation coefficient, which defined as follows: r ¼ P ðx À xÞðy À yÞ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P ðx À xÞ 2 ðy À yÞ 2 q (2) where x and y represent the variable to be predicted and the predicted value, respectively, r is the Correlation coefficient which is in the range[−1,1]; r > 0 indicates a positive correlation, r < 0indicates a negative correlation, r j jrepresents the low degree of correlation between the variables.Specially, r ¼ 1is called perfect positive correlation, r ¼ À 1is called perfect negative correlation, and r ¼ 0 is called irrelevant.Typically, r j jis greater than 0.8, that the two variables have a strong linear correlation.

K-nearest neighbor algorithm (KNN)
The KNN technique can be used especially for classifying data into multiple categories, but it also can be successfully applied for purposes of forecasting some objective-related attributes that have a numerical value, as a result of non-linear dependencies.This confers the possibility of performing multiple category classifications and modeling non-linear data relationship (for forecasting).The KNN algorithm does not force the samples to satisfy a specific distribution.But if the sample set is normally distributed, the prediction effect will be better.The data set used in this article follows a normal distribution.
Although it is a base algorithm within instancebased learning, KNN poses some disadvantages that need to be taken into account, depending on the nature of the problem.Firstly, the algorithm does not excel in computational speed when the dataset contains a large number of instances.Secondly, the algorithm is limited to supplying an estimation of the objective-attribute value without offering other information about the instance being evaluated or the dataset as a whole.
The KNN class will implement the algorithm based on finding k of the nearest neighbors of some Figure 1.Ningxia is located in the upper and middle reaches of the Yellow River in northwestern China, with a total area of 66,400 km 2 .The terrain is long and narrow from north to south, with a distance of 456 km between north and south and about 250 km between east and west.Its north is flat and is a large irrigation area nourished by the Yellow River, the middle is an arid zone, and the south is hilly and mountainous.

R E T R A C T E D
instances is the dataset.The first problem the algorithm implementation is going to be facing is that of transposing these instances in a normalized form, which can be represented in a two-dimensional space as points with their corresponding coordinates.The problem lies in resolving these coordinates in order to be able to apply the geometrical formulas that represent the steps of this algorithm.For this purpose, the KNN class will first defined two helper methods.The first method computes the arithmetic average of the numerical projection of an attribute value, projection that is identified by the Numeric Value property.The value of this property is supplied by the following mathematical expression: where v will be iterating through all the possible values of attribute A and S is the dataset.
The second method computes the standard deviation of an attribute using the following formula: σ A ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P The parameters in Equation 4have the same definitions as the parameters in Equation ( 3).These values are needed in order to be able to scale the dataset in such manner that the global arithmetic average of the entire dataset will be zero and the standard deviation will be 1.This scaling is performed by replacing the numeric value of each attribute with a new value obtained using the following calculation: where v represents the value of the attribute A, which will be replaced with its new value v0.The implementation will have to create a method that will generate a matrix representation of the initial dataset.This matrix structure will be represented through the Dictionary classes, thus offering the possibility to access a value based on its attribute and the instance is can be found in.The KNN algorithm relies on evaluating distances between two points in the dataset plan; hence the need of a method that can compute this distance.Given that there is already a matrix containing the scaled values of the dataset and these can be easily accessed based on their initial source, the Calculate Distance method will be defined with two parameters of type Instance, which will be responsible for performing all the required projections for the instances; an invocation of this method could be interpreted as evaluate the distance between instances x and y00.The method uses the Euclidian procedure for computing the distances; the distance between two instances in the dataset identified by the (scaled) values of their attributes is defined by the Euclidian geometry through the following formula: dða; bÞ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi X where a and b are instances within the dataset; A will iterate though each attribute values in the dataset; x A represents the value of attribute A for instance, x; x0is the scaled value of that attribute.

Parameters of error analysis
The performances of the models were evaluated using the following statistical parameters: root mean square error (RMSE), mean absolute error (MAE), and the coefficient of determination (R 2 ) .The R 2 , RMSE, MAE are commonly used indicators to evaluate the prediction performance of the algorithm.Generally, with a good R 2 , the RMSE and MAE will be as small as possible.Readers interested in the formulations of the above-mentioned statistical parameters are referred to Hosseinzadeh Talaee et al. (2011).

KNN-based ET 0 forecast model
The basic principle of this model is: receiving a new data item to be predicted, then concentrated with a set of sample data items are compared to find out the data items to be predicted with the K closest data items, and its demand means to obtain a final prediction result, as shown in Figure 2. Throughout the forecast model execution process is, first, to be predicted from meteorological data to extract features using correlation analysis, these features will be converted into vector space witch like Table 1 shows, and then use KNN algorithm to predict the resulting prediction ET 0 .

Correlation between ET 0 and meteorological elements
Pearson correlation analysis has been carried out between meteorological elements and daily ET 0 which calculated by FAO-56 Penman-Monteith equation.And the results are shown in Table 2.
Various meteorological elements and average daily ET 0 relevance in descending order are maximum temperature, relative humidity, average temperature, average wind speed, minimum temperatures, and sunshine hours.ET 0 negatively correlated with the relative humidity, the rest of meteorological elements are positively correlated.In addition to elements of sunshine hours was significantly correlated at the 0.05 level, other factors were significantly correlated at the 0.01 level.These data suggest that in the Ningxia region temperature and relative humidity are the main factors affecting ET 0 , ET 0 increases with rising temperature, the relative humidity drops decrease.Therefore, in this study, we will use the maximum temperature, minimum temperature and relative humidity which to predict ET 0 .

Results of KNN algorithm forecasting
First, calculate an average day ET 0 with PM-56 Formula, from 1951 to 2000 years.Then, based on limited meteorological data (max temperature, min temperature and relative humidity) of 2001-2012 years, using KNN algorithm (K = 1) forecast annual average daily ET 0 .The predicted results and PM-56 calculated results were analyzed, the results shown in Figure 3 Clearly, the predicted results are not satisfactory, RMSE = 1.37,MAE = −0.58,R 2 = 0.6453.The predicted value than the PM-56 formula calculation value much lower.This shows that there is underestimation.We found that the composition predicted characteristic vector space the dimensions are not the same, and their range is different.Such as temperature ranges from −20°C to 35°C, relative humidity is between 10% and 50%.This is clearly not conducive to precise prediction, we try to normalized features.

Results of KNN algorithm with normalization
After normalization processing features, the new forecast results shown in Figure 4.The result was better than the previous few, but there are still underestimated.RMSE = 1.014,MAE = −0.365,R 2 = 0.7836.So we need to continue to improve the prediction algorithm.

Results of KNN algorithm with different k value
In the KNN algorithm, K represents the selected k nearest neighbors in the forecasting process.The, what would be different k values influence on the predicted results?We tried different situations, such as K = 1, K = 3 and K = 5.The results shown in Figures 5-7 and Table 3.
We found that different values of k on the predicted effect is very large.Too small values of k or too large values of K have caused the decline in the prediction  When the value of k is small or large, the error of the estimation result will increase.Therefore, the algorithm must first choose an appropriate value of k, which is a hyperparameter.A more appropriate value of K, K = 3.At this time the prediction accuracy is acceptable.

Results of weighted KNN algorithm
We observe predictions KNN algorithm (K = 3), and the result compared with PM-56 is not very good.So, we try given different weights to different neighbors.
Prior to this, the weight of each neighbor is the same, because we take the average of each neighbor.Gaussian function is chosen as the right of our valued function.
The reason is this function will give high weight to the near neighbors, and the relatively give a low weight to the distant neighbors, but the weight will not be zero.Weighted KNN algorithm produces prediction results

R E T R A C T E D
shown in Figure 8.We can find, the curve of prediction results is consistent with the curve of PM-56 results.RMSE = 0.5778, MAE = 0.091 and R 2 = 0.9209.
Because the Penman-Monteith method estimates ET 0 , it is necessary to input more meteorological observation data, which makes the method difficult to apply in some developing countries or regions where the observation equipment is not available.the rise of machine learning technology and artificial intelligence, scholars have begun to explore how to combine intelligent algorithms with traditional estimation methods to estimate ET 0 more accurately and effectively.
Compared with other methods for estimating potential evapotranspiration, the method proposed in this paper uses less meteorological data, which provides a way to estimate evapotranspiration in areas where meteorological data is incomplete.In addition, the method in this study is simple, and the accuracy can meet the application of hydrology and water resources related fields.

Conclusions
In this study an attempt was made to determine the best method for estimating ET 0 in the absence of the full weather data for PM-56 method application in Ningxia which is a semi-arid environment in China.We have designed the ET 0 prediction model, which based on KNN algorithm.During the testing process, we found that the original KNN algorithm prediction accuracy is not good, so we made some improvements processing, such as normalization, different K values, and different weights.Finally, The KNN-ET 0 Forecast Model provided good agreement with the ET 0 obtained by the PM-56 method.Meanwhile, the Gaussian function was the best membership function for the KNN models.Furthermore, more weighting function will be to try to improve the prediction more accurate for the KNN Based ET 0 Forecast Model.
The k-Nearest Neighbor algorithm (KNN) is a nonparametric lazy supervised classification algorithm in machine learning technology.The principle is simple and easy to implement.The algorithm time complexity is only O(n).It has many advantages: (1) the algorithm is simple, easy to understand, and easy to implement, without the need to estimate parameters.( 2 3) Compared with algorithms such as Naive Bayes, there is no assumption on the data, the accuracy is high, and it is not sensitive to outliers.Its disadvantages include: (1) It is impossible to give rules like decision trees.(2) It is a lazy learning method, resulting in slower prediction speed than algorithms such as logistic regression.(3) When the samples are unbalanced, the prediction accuracy rate for rare categories is low.Considering the advantages and disadvantages of this algorithm comprehensively, it needs to be treated with caution in practical applications, and take some appropriate measures to ensure stable and reliable operation of the algorithm.
The findings of this case study provide basic guidance to irrigation engineers and agriculturists as to which models will give a better estimate of ET 0 , in light of data availability, for irrigation scheduling and water resources management.The KNN Base ET 0 forecast model developed here can be embedded as a module for estimating ET 0 data in hydrological modeling studies in the study area and in areas with similar hydrometeorological characteristics.
gov.cn).The data is composed of mean, maximum and minimum air temperatures, relative humidity, wind speed, atmospheric pressure and sunshine hours for the period 1951-2012.The data is divided into two parts.The first part was used to calculate ET 0 using the Penman-Monteith equation (PM-56) and train the KNN.The second part (2001--2012) was used for validation as testing period.The first 50 years data were used to calculate ET 0 using the Penman-Monteith equation (PM-56) and train the KNN, and the remaining data were used for validation.It should be noted that the mean 10-days values of the weather data were used for the analysis.
) The training time is zero.It does not show training, unlike other supervised algorithms that use the training set to train a model (that is, fit a function), and then use the model to classify the validation set or test set.KNN just saves the sample and processes it when it receives the test data, so KNN training time is zero.( Further research is needed to test the model used here in other climates for evaluation of climate type effects.Construction Project of Ningxia, China.(No. NXYLXK2017A03); the Natural Science Foundation of Ningxia, China (No.2019AAC03049); the China Scholarship Council (No.201708645016); and the Priority Research and Development Projects for Ningxia (No.2020BEG03021).

Table 1 .
Vector space model for KNN based ET 0 forecast model.

Table 2 .
Correlation between climate elements and ET 0.

Table 3 .
Errors of different k statistical analysis.