Prediction of the sodium absorption ratio using data-driven models: a case study in Iran

ABSTRACT In this investigation, two data-driven models, i.e., Gaussian Process (GP) and Support Vector Machine (SVM), were used to predict the sodium absorption ratio (SAR) in three sub-watersheds (Khorramabad, Biranshahr, and Alashtar) in Iran. A comparison was also done with these data-driven models with Artificial Neural Network (ANN). The parameters total dissolved solids, electrical conductivity, pH value, CO3, HCO3, chlorine (Cl), SO4, calcium (Ca), magnesium (Mg), sodium (Na), and potassium (K) were used as input variables and SAR as output. For SVM and GP regression, two kernel functions (radial-based kernel and Person VII kernel function) were used. The results from this investigation suggest that the ANN model (correlation coefficient [CC], root mean square error [RMSE], Nash–Sutcliffe coefficient of efficiency [NSC], and mean absolute relative error [MARE] = 0.9966, 0.0286, 0.9906, and 0.0194) is more precise as compared to the GP (CC, RMSE, NSC, and MARE = 0.9570, 0.2982, 0.8288, and 0.3705) and SVM (CC, RMSE, NSC, and MARE = 0.9948, 0.0365, 0.9847, and 0.063). Among GP and SVM, SVM with PUK kernel is more accurate for estimating the SAR of the watershed. Thus, ANN is a technique which could be used for predicting the SAR for given study area.


Introduction
Freshwater is the essential and finite resource for agriculture, human being, and industries. Sustainable development of any country or area totally depends upon freshwater. It could not be possible without adequate quality and quantity of freshwater. Water quality control and assessment is the major issue in the water resource planning and management (Bartram & Ballance, 1996). It is the most important part of operation and design of the irrigations system. Corp pattern is also decided by it in any region (Suarez, 1981). In the last few decades, urbanization and industrialization are at its peak. With the increase of the industrialization and urbanization, the quality and quantity of the freshwater decrease. A large amount of domestic and industrials waste are being dumped into the freshwater resources (Palmer, 2001), which results to the generation of a large amount of the wastewater or conversion of freshwater into waste, ultimately enters into the irrigation system and affects the crop productivity of any region.
Sodium absorption ratio (SAR) of the irrigation water is one of the major and common indicators on which the suitability of the irrigation water depends (Sposito & Mattigod, 1977). The SAR indicator is calculated by measuring the concentrations of sodium, calcium, and magnesium in water used for irrigation (Weiner, 2012). Presence of higher values of sodium decreases the infiltration rate of soil, reduces the stability of the soil, and increases the sodium accumulation in leaf tissue (Micke, 1996) and gave poisonous effects to the crops. The normal soil may convert in the saline soil by using the sodium affected water for irrigation purpose (Saha et al., 2017).
The higher the concentration of sodium, the lower the hydraulic conductivity and infiltration rate of soil and it decreases to such a level that minimum amount of water reached to the plant crop and ultimately reduce the crop yield (Bouwer & Idelovitch, 1987). The excess amount of SAR also affected the leaf of the crop such as stone fruits, avocado, and almond (Bouwer & Idelovitch, 1987). Higher SAR also affected the permeability of the soil. The dispersion of the soil particles occurred due to exchangeable sodium present in the soil which replaces the magnesium and calcium absorbed in the soil clay (Asadollahfardi, Khodadadi, & Gharayloo, 2010). Ayers and Tanji (1981) studied the SAR and categorized the hazardous effect of the SAR into three categories: (1) if the value of SAR below is 3 then no sodium problems, (2) if in between 3 and 9 then less sodium problems, and (iii) if above 9 then higher the sodium problems.

GP regression
GP regression relies upon the postulation that nearby observation must share the information mutually and it's an approach of mentioning earlier straight over the function space. The simplification of Gaussian distribution is known as Gaussian regression. The matrix and vector of Gaussian distribution are expressed as covariance and mean in GP regression. Due to having earlier knowledge of function reliance and data, the validation for generalization is not essential. The GP regression models are capable of recognizing the foresee distribution consequent to the input test data (Rasmussen & Williams, 2006).
A GP is the collection of number of random variable, any finite number of them has a collective multivariate Gaussian distribution. Assume u and v stand for input and output domain accordingly, thereupon x pairs (g i , h i ) are drawn freely and equivalently for distribution. For regression, it is assumed that h ⊆ R e , then a GP on p is expressed by the mean function v 0 : u, R e and covariance function µ: u × u, R e . Readers are requested to follow the Kuss (2006) to get the exhaustive details of GP.

SVM
This method was first proposed by Vapnik (1998) and based on statistical learning theory or VC theory. The present form of the SVM was developed by the Vapnik and coworker at AT&T laboratory (Boser, Guyon, & Vapnik, 1992). The main principle of SVM is the optimal separation of classes. From the separable classes, SVM selects the one which have lowest generalization error from infinite number of linear classifier or set upper limit to error which is generated by structural risk minimization. This way the maximum margin between two classes can be found from the selected hyper plane and sum of distances of the hyper plane from the nearby point of two classes will set highest margin between two classes. Readers are requested to follow the Smola (1996) to get the exhaustive details of SVM. Cortes and Vapnik (1995) give the idea of kernel function for nonlinear support vector regression. The main idea of SVM is to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated. The main advantage of the SVM is that it can be used to avoid difficulties of using linear functions in the high dimensional feature space and optimization problem is transformed into dual convex quadratic programs. The flowchart of the SVM is discussed in Figure 1.
ANN ANN is one of the most used data-driven techniques and inspired by the functioning of nervous system and brain architecture (Haykin, Haykin, Haykin, & Haykin, 2009). ANN has one input, one or more hidden, and one output layer. Each layer consists of number of nodes and the weighted connection between these layers represents the link between nodes. Input layer having nodes equal to the number of input parameters distributes the data presented to the network and doesn't help in processing. This layer follows one or more hidden layer which helps in processing of data. The output layer is final processing unit. When an input layer is subjected to an input value which passes through the interconnections between nodes, these values are multiplied by the corresponding weight and summed to obtain the net output (z j ) to the unit where W ij is weigh of interconnection from unit i to j, y i is the input value at input layer, z j is output obtained by activation function to produce an output for unit j. The detailed discussion about NN is provided (Haykin 1999). In present analysis, a three-layer feed forward multilayer perceptron ANN based on the back propagation algorithm is used.

SAR calculation
The SAR was calculated for these three watersheds by using the following formula (U. S. Salinity Laboratory Staff, 1954): where Na + , Mg 2+ , and Ca 2+ are in milli equivalents per liter (mEq/l).

Study area and dataset
The study area (Khorram Abad, Piranshahr, and Alashtar sub-watersheds) covers a part of Lorestan province located in the center of Iran ( Figure 2). They are part of Karkheh River basin (Persian Gulf drainage basin). The study area is located between 33°11′47″ N and 34°03′27″ N, and between 48°03′10″ E and 48°59′ 07″ E. Sub-watersheds are defined according to the position of sampling site and cover an area of 3576 km 2 as shown in Figure 2. The area is a part of the semiarid region and has a mean annual rainfall of about 400 mm with the highest precipitation occurring in January and February. Elevation varies from 1158 to 3646 m a.s.l.  The upper part of the study area is mountains built mainly of Cretaceous and Miocene limestone and the lower part is alluvial plain. Dataset consists of 775 datasets (training and testing) of the water quality for the three watersheds (Alsthar, Piranshahr, and Khoramabad) in Iran from May 1971 to May 2017. The parameters which were studied during this investigation are total dissolved solids, electrical conductivity, pH value, CO 3 , HCO 3 , chlorine (Cl), SO 4 , calcium (Ca), magnesium (Mg), sodium (Na), potassium (K), and SAR, whereas SAR used as a output variable and remaining all as the input variables. Table 1 shows the characteristics of the parameters in training and testing stage. The values of the SAR lie in between 0.029 and 2.44. Hence, if the values of the SAR are below 3, then it didn't create any serious problem to the crop and agriculture system (Ayers & Tanji, 1981).

Detail of kernel functions for SVM and GP
The SVM and GP-based regression approaches design includes the scheme of kernel function (Pal & Deswal, 2011;Rasmussen & Williams, 2006). There are several kernel functions in GP and SVM. In this study, two kernel functions were used with GP and SVM technique.
(1) Radial basis kernel (RBF) = e Àγ aÀb j j 2 (2) Pearson VII function kernel (PUK) ¼ Üstün, Melssen, & Buydens, 2006) where γ, σ, and ω are kernel parameters. It is well known that GP and SVM estimation performance depends on a good setting of meta-parameters, Gaussian noise, C, γ, σ, and ω. The selections of Gaussian noise, C, γ, σ, and ω control the prediction (regression) model complexity. In this study, a physical method was used to select primary parameters (i.e., C, γ, σ, ω, and Gaussian noise). In order to minimize the RMSE and to maximize the correlation coefficient (CC), suitable values of various primary parameters are selected. The same kernel-specific parameters were taken for GP regression and as well as for SVM.

Statistical performance evaluation criteria
CC, RMSE, Nash-Sutcliffe coefficient of efficiency (NSC), and mean absolute relative error (MARE) values were calculated to investigate the performance of GP, SVM, and ANN techniques.

CC
The CC is computed as

RMSE
The RMSE is computed as The NSC (Nash & Sutcliffe, 1970) is computed as

MARE
The MARE is computed as

Results of GP
The user-defined functions for the GP are Gaussian noise, γ, σ, and ω and finding of the user-defined functions is trial and error process using Weka (3.9). RBF and PUK are two kernel functions which were used in this investigation. The values of the user-defined function for GP are summarized in Table 2. The values of the Gaussian noise are kept constant for both the kernel. Table 3 furnished the performance of the GP model using performance evaluation parameters (CC, RMSE, MARE, and NSC). It is found in Table 3 that the values of the performance evaluation parameters for the PUK kernel (CC, RMSE, NSC, and MARE 0.9570, 0.2982, 0.8288 and 0.3705, respectively) are much better than RBF kernel (CC, RMSE, NSC, andMARE 0.8931, 0.1870, 0.5988, and0.2385) with testing dataset. Figure 3 gave the details of the performance of the GP model. Outcomes from Table 3 and Figure 3 show that PUK kernel gave the excellent result as compare to the RBF kernel. Hence, GP with PUK kernel was selected for further comparison.

Result of SVM
The user-defined functions for the SVM are C, γ, σ, and ω and finding of the user-defined functions is also a trial and error process using Weka 3.9. In SVM also, two kernel functions (RBF and PUK kernel) are used. Table 4 gave the values of the user-defined functions for SVM model. Table 5 summarized the performance of the SVM model using two performance evaluation parameters, i.e., CC and RMSE. It is clear from Table 4 that SVM with PUK kernel gave good result as compared to the SVM with RBF. The values of CC, RMSE, NSC, and MARE for PUK and RBF were 0.9948, 0.0365, 0.9847, and 0.063 and 0.9509, 0.0992, 0.8873, and 0.1328 with testing dataset. Figure 4 gave the details of the performance of the GP model. Outcomes from Table 5 and Figure 4 show that results of SVM with PUK kernel were superior to RBF kernel. Hence, SVM with PUK kernel was selected for further comparison. Table 2. Details of the user-defined functions for GP.

Comparison of results
In this section, the comparison was done of the bestselected kernel function of SVM   Kernel function User-defined functions RBF kernel C = 3, γ = 0.8 PUK kernel C = 3, ω = 0.8, σ = 6.8    Figure 5 gave the performance of the ANN model and the best-selected kernel of GP and SVM model. It is clear from the outcomes of Figure 5 that the output of the ANN model follows the same trends as the actual values of the SAR. Hence, the ANN model is the most accurate to predict the SAR for the given study area. Further, the best fitted model (ANN) was compared to with the recent study (Seilsepour & Rashidi, 2008

Conclusions
The SAR is the indicator which used to find out the suitability of the irrigation water for the crops and possible hazards to the soil. In this investigation, two datadriven models GP and SVM (two kernels each, i.e., RBF and PUK) were used to predict the SAR; further results were compared with the third most used data-driven method (ANN). Obtained results concluded that the ANN model is the most efficient model to predict the SAR than GP and SVM for given study area. Among SVM and GP, the SVM model gave highly precise result than GP. Similarly, among the kernel function, the PUK kernel gave the better result than the RBF for both the techniques model. Seilsepour and Rashidi model was also failed to predict the SAR. Thus, ANN model was the most suitable model for predicting the SAR for the given study area.

Disclosure statement
No potential conflict of interest was reported by the author.