Estimating longitudinal dispersion coefficient in natural streams using empirical models and machine learning algorithms

The longitudinal dispersion coefficient (LDC) plays an important role in modeling the transport of pollutants and sediment in natural rivers. As a result of transportation processes, the concentration of pollutants changes along the river. Various studies have been conducted to provide simple equa-tionsforestimatingLDC.Inthisstudy,machinelearningmethods,namelysupportvectorregression, Gaussianprocessregression,M5modeltree(M5P)andrandomforest,andmultiplelinearregressionwereexaminedinpredictingtheLDCinnaturalstreams.Datasetsfrom60riversaroundtheworld withdifferenthydraulicandgeometricfeaturesweregatheredtodevelopmodelsforLDCestima-tion.Statisticalcriteria,includingcorrelationcoefficient(CC),rootmeansquarederror(RMSE)and meanabsoluteerror(MAE),wereusedtoscrutinizethemodels.TheLDCvaluesestimatedbythesemodelswerecomparedwiththecorrespondingresultsofcommonempiricalmodels.TheTaylor chartwasusedtoevaluatethemodelsandtheresultsshowedthatamongthemachinelearningmodels,M5Phadsuperiorperformance,withCCof0.823,RMSEof454.9andMAEof380.9.Themodel ofSahayandDutta,withCCof0.795,RMSEof460.7andMAEof306.1,gavemorepreciseresultsthantheotherempiricalmodels.ThemainadvantageofM5Pmodelsistheirabilitytoprovidepractical formulae.Inconclusion,theresultsprovedthatthedevelopedM5Pmodelwithsimpleformulationswassuperiortoothermachinelearningmodelsandempiricalmodels;therefore,itcanbeusedasa propertoolforestimatingtheLDCinrivers.


Introduction
In the past few decades, variations in water resource systems have attracted attention all around the world. From the viewpoint of community health, the most significant and vital areas are cities where rivers provide drinking water and factories are located adjacent to these streams (Pourabadei & Kashefipour, 2007;Tayfour & Singh, 2005). Therefore, the transportation of pollutants in natural streams is very important in water resource management. The direct estimation of the longitudinal dispersion coefficient (LDC) by experimental methods requires costly and time-consuming studies. Since different parameters cause complexity in the process of mixing, estimation of the LDC becomes a challenging task. This necessitates accurate knowledge and data, i.e. a wide range of variables such as the geometry of channel and CONTACT Shahaboddin Shamshirband shahaboddin.shamshirband@tdtu.edu.vn the variation in the velocity in the cross-section, on the transmission and mixing of contaminants in the river and the ability of the river flow to transport these materials (Chau, 2000;Pourabadei & Kashefipour, 2007). The dispersion issue applies to mixing in natural rivers as well as in open channels (Zeng & Huai, 2014). When pollutants are discharged into natural streams, they move with the flow and mixing occurs in three stages (Jirka, 2004). In the first step, the pollutant is rapidly mixed in the vertical direction. Lateral mixing takes place in the second stage and the pollutant is distributed sporadically. In the last step, the pollutant is dispersed longitudinally, as a result of lateral variation in the longitudinal velocity. For water quality analysis, a one-dimensional model is used, which includes the last stage, and its severity can be determined by the LDC, which is a key factor in modeling and estimating the distribution of sediment and pollutants in water (Kashefipour & Falconer, 2002).
The LDC values in rivers may be estimated by several empirical equations, which are valid only within a specific range of flow conditions and geometry. When real data on this process are available for the river, i.e. data sets such as the mean (U) and shearing velocity (U * ), channel width (W), depth of water (H) and channel slope (S), the LDC can be determined readily (Julínek & Říha, 2017;Wang, Huai, & Wang, 2017). Several methods have evolved to estimate the LDC value. Julínek and Říha (2017) used fluorescein dye as a tracer in an open channel to determining the LDC value. Their results were compared with values gained by the earlier empirical formula and showed good agreement with the aforementioned studies (e.g., Jirka (2004)). An artificial neural network (ANN) model was established by Sahay (2011) for predicting the LDC in rivers. The results of the ANN demonstrated that it had higher performance than other methods. Noori, Deng, Kiaghadi, and Kachoosangi (2015) used three methods, namely ANN, adaptive neuro-fuzzy inference system (ANFIS) and support vector machine (SVM), for LDC estimation in natural streams. A high degree of doubt was found in the models, while the LDC estimated by the SVM method had a smaller error compared with the ANN and ANFIS models. Azamathulla and Ghani (2011) used a genetic programming (GP) method to estimate the LDC in streams and found that GP provided more accurate predictions than the empirical models. For prediction of LDC, an ANFIS method was used by Riahi-Madvar, Ayyoubzadeh, Khadangi, and Ebadzadeh (2009), who implemented several statistical methods for scrutinizing the model. The results showed that ANFIS is superior to the empirical models in estimating LDC. Machine learning algorithms have been used in several fields of environmental and water resource engineering (Alizadeh, Jafari Nodoushan, Kalarestaghi, & Chau, 2017;Chen & Chau, 2016;Dehghani et al., 2019;Esmaeilzadeh, Sattari, & Samadianfard, 2017;Houichi, Dechemi, Heddam, & Achour, 2012;Mosavi, Ozturk, & Chau, 2018;Olyaie, Banejad, Chau, & Melesse, 2015;Qasem et al., 2019aQasem et al., , 2019bSamadianfard et al., 2019b;Samadianfard, Delirhasannia, Torabi Azad, Samadianfard, & Jeihouni, 2016;Samadianfard, Ghorbani, & Mohammadi, 2018;Zhu et al., 2019).
The importance of the LDC in the transport of pollutants along rivers and its dependence on hydrodynamic and geometric parameters has motivated many researchers to estimate this coefficient (Bencala & Walters, 1983;Etemad-Shahidi & Taghipour, 2012;Fischer, 1979;Kashefipour & Falconer, 2002;McQuivey & Keefer, 1974;Rutherford, 1994;Seo & Cheong, 1998;Wang et al., 2017). To provide a satisfactory estimation of LDC, different empirical formulae have been presented. The accurate estimation of LDC enables accurate modeling of pollutant concentrations along rivers and streams (Kashefipour & Falconer, 2002). Derivation of empirical formulae for the LDC is based on the -Buckingham theory (Seo & Cheong, 1998). This classic procedure is used for most complex hydraulic problems when the theory is incomplete to allow accurate and/or analytical study.
In the current work, attempts have been made to predict the LDC using the non-dimensional parameters obtained by the -Buckingham theory and machine learning algorithms. To achieve this aim, data sets from 60 rivers around the world with different hydraulic and geometric features were gathered and separated as training (67%) and testing data (33%) to develop the models. In this regard, the main contribution of the current research was utilizing the machine learning and data-driven algorithms to improve the precision of LDC estimation. Thus, the applicability of Gaussian process regression (GPR), support vector regression (SVR), M5 model tree (M5P), random forest (RF) and multiple linear regression (MLR) was examined and their results were compared with the outputs of common empirical models. To the best of our knowledge, the application of GPR and RF has not been reported in the literature for estimating the LDC in natural streams.

Theory of dispersion
The water quality in rivers is affected by pollutants and their distribution along the riverine flows. Nonuniformity in the geometry of natural streams, along with the effects of shear stresses and flow turbulence, result in a complex flow field (Wang et al., 2017). After the completion of cross-sectional mixing, the following onedimensional unsteady advection-dispersion equation is extensively used to predict the water quality in rivers: where C is the cross-sectional averaged concentration, U is the cross-sectional averaged velocity, K is the LDC, S is the source term, and x and t are the mean flow direction and time, respectively. According to Equation (1), the main transport processes are advection and dispersion. In the mixing process, the pollutant is diffused owing to the velocity differences over the cross-section. By transporting the pollutant downstream, turbulent diffusion causes complete mixing of the pollutant and then the concentration of pollutants along the river depends mainly on the LDC. Hence, the LDC is an essential parameter in predicting the solute concentration in the flow direction (Fischer, 1979). Based on the Taylor (1954) study, shear velocity and turbulence have the main effects on the mixing intensity, and the combination of longitudinal advection and longitudinal mixing can result in the LDC. The effects of hydrodynamic and geometric parameters on the LDC indicate its variability in different streams and rivers.

Experimental data
The data used in the current research were measured from over 50 rivers in the USA and the UK, gathered from different studies (Fischer, 1968;Graf, 1995;McQuivey & Keefer, 1974;Nordin & Sabol, 1974;Rutherford, 1994). Thus, 147 sets of data, the statistical characteristics of which are presented in Table 1, were used in the current study. In Table 1, W, H, U, U * , W/H, U/U * , K and K/(HU * ) denote the channel width, depth of water, mean velocity, shear velocity, ratio of channel width to depth of water, ratio of mean velocity to shear velocity; LDC and non-dimensional LDC, respectively. The frequency distributions of all these parameters are illustrated in Figure 1.

Empirical models
Six empirical models used for estimation of the LDC in rivers are presented in Table 2. These empirical models may only be applicable within a range of specific flow and geometry, and may not provide appropriate results for other ranges.

M5 model tree (M5P)
The M5P algorithm, first introduced by Quinlan (1992), is an extended version of the M5 algorithm. Model trees can consider a set of data with a large number of features and sizes, and can work with a high degree of efficiency. The M5P algorithm contains four stages. The first stage is building a tree by dividing the input space into numerous subspaces. The variation in intra-space from root to node is lessened by using some attributes and division criteria. To measure the subspace variability, values of the standard deviation for each node are utilized. By using the standard deviation reduction (SDR) factor, the tree is built. This method remarkably reduces the expected errors in the node using the following equation (Behnood, Behnood, Gharehveran, & Alyamac, 2017): where sd denotes the standard deviation, T is a set of examples which reach the node, and T i is the outcomes of the node division pursuant to the attributes (Wang & Witten, 1997).
In the second step, the linear regression model is advanced in each of the subspaces for each node. Then, the pruning method is used to overcome the problem of overtraining, which happens when the correspondent SDR value of the linear model becomes lower than the predetermined error. The adjacent linear model can show severe disturbances in the results of pruning. This can mostly happen for models that are constructed from a small amount of training data, but can be balanced by smoothing in the last step. In the smoothing procedure, to create the last model of the leaf, all models from the leaf to the root are combined.

Support vector regression (SVR)
SVR evolved from SVMs, which were created by Vapnik (1995Vapnik ( , 1998, and has been used in many hydrological applications (Choubin et al., 2019). This approach is a data-based method and it deals with the predicted problems and the structural risk minimization principle (Pai & Hong, 2007). Achieving a regression model with suitable predictive performance is the main goal of SVR. Variables in the SVR model comprise ( in which x i is the input parameter, y i is the output parameter and the total number of data is represented by N. The SVR is expressed as (Kalteh, 2013):  where w is a weight vector, b is a threshold value, and ϕ(x) is a nonlinear mapping variable. Input patterns are designed in a large space; therefore, in the mapped space the model can be linearly regressed. In the SVR model, the optimal amounts of w and b are computed by the following formula: where C is the penalty parameter and n is the sample size.

Gaussian process regression (GPR)
GPR is defined as a set of random variables in which each variable has a common Gaussian distribution. To represent the relationship between inputs (x) and outputs (y), the f function should be defined and modeled for each possible entry. To achieve a model between x and y, a GPR model is constructed as the regression function and the L noise term (ε ∼ N(0, σ 2 n )) is used in this function: where σ n is the standard deviation of the noise. This can be completely determined by a mean m(x) and a covariance k(x, x ): where m(x) = 0 is assumed to facilitate the computation and there are different choices for k. The covariance function, which is known as the kernel function, is a linear separator and is used to obtain the connection between the input and output of the model. If points are moved to higher spaces, their internal multiplication (k) will be changed too. Selecting a suitable kernel function based on assumptions such as smoothness and possible patterns in the data is highly significant. The kernel functions used in this study are the polynomial kernel, the normalized poly kernel, the radial basis function or Gaussian kernel (RBF) and the Pearson universal kernel. In this section, the GPR modeling method has been introduced briefly; a more detailed explanation is given in Rasmussen and Williams (2006).

Random forest (RF)
RF is a series of complex relationships that are able to consider the interaction between predictors and responses without any relationships between them by including decision trees (Breiman, 2001). Each of the component trees forms an RF using available data. For each tree, a subset of predictions is chosen with the same chance. By combining and averaging the single predictions of all compounding trees, the predictive output is achieved. The RF algorithm consists of two random levels in each tree. The first step is bagging and the second is selection of the features randomly; studies have indicated that the performance of this model is superior to other models (Archer & Kimes, 2008;Woznicki, Baynes, Panlasigui, Mehaffey, & Neale, 2019). RF, without any assumptions about independent or dependent variables, explains both linear and nonlinear relationships.

Performance criteria
To statistically examine the performance of the models created in the current study, the correlation coefficient (CC), root mean squared error (RMSE) and mean absolute error (MAE) were used. The mathematical representations are cited as follows (Kargar, Sadegh Safari, Mohammadi, & Samadianfard, 2019;Samadianfard et al., 2019a): where O i and P i are the measured and estimated value of the dispersion coefficient,Ō is the mean of measured O, and N represents the number of data. In addition, Taylor diagrams (Taylor, 2001) were used to check the precision of the implemented models and empirical models for LDC estimation in natural rivers. It is notable that in these diagrams, measured and some corresponding statistical parameters are presented simultaneously. Moreover, different points on a polar plot are used in Taylor diagrams to investigate the differences between measured and estimated values. The CC and normalized standard deviation are specified by the azimuth angle and radial distances from the base point, respectively (Taylor, 2001).

Results and discussion
The capabilities of machine learning and data-driven algorithms, such as GPR, SVR, M5P and RF, in estimating LDC values in different streams were compared with the potential of common empirical models. For this purpose, the hydraulic parameters of several streams in different geographic locations, including channel width, depth of water, mean velocity, shear velocity and LDC, were gathered. In the current study, the whole data set including the hydraulic and geometric properties of 147 streams was randomly partitioned into training (67%) and testing data (33%), and the early stopping method (Graf, Zhu, & Sivakumar, 2019;Piotrowski & Napiorkowski, 2013)    are lower only than GPR-3, RF and MLR. This implies that SVR-3 and M5P are able to attain more accurate performance than the empirical models. From the statistical metrics presented in Tables 3 and 4, it can be concluded that the accuracy of M5P far exceeds the mentioned models and empirical models. In terms of the practical value of the M5P model, the resulting explicit formulations can provide a powerful and easy tool for accurate estimation of LDC values. Scatterplots of the observed values of LDC and the corresponding values estimated by the studied methods and empirical models are shown in Figure 2. It is clear from Figure 2 that the estimates of M5P are less scattered across the bisection line. So, the estimates of M5P are much closer to the bisection line than those of the data-driven algorithms and empirical models. Figure 3 presents the measured and estimated values of LDC in the testing phase and Figure 4 illustrates the boxplots of the models. In accordance with the concluding remarks on the scatterplots and boxplots, it is obvious that the estimates of M5P are in better agreement with the measured LDC values. Additional evaluation of measured and estimated     LDC values by GPR-3, SVR-3, M5P, RF and S-D was carried out and Figure 5 presents the obtained results. Figure 5 is an instrumental tool in understanding the different potential of the studied models. In the Taylor diagram, an accurate model is marked by the reference point with a correlation coefficient of 1 and the same amplitude of variation as the observations. According to the lower distance from the measured point (the green point in Figure 5) to the correspondent point of the M5P (the cyan point), the estimates of M5P were more accurate than the results of the other models. In other words, the lower distance of the M5P correspondent point to the observed point clearly indicates the high potential and capacities of M5P for providing accurate predictions of the LDC. The tree structure resulting from the M5P model, as the most accurate model, is displayed in Figure 6. This illustrated tree model is based on the characteristics of used streams, and two linear equations are applied for LDC estimation, while the other data-driven models studied do not have such accuracy and capability. Being able to obtain explicit formulae for the LDC in   M5P is another advantage of this model. Explicit formulae make it possible to evaluate the relative importance of the effective dimensionless parameters on the LDC. The formulae indicate that the parameter U/U * has a great effect on LDC estimation, and the effect of this dimensionless parameter will be more significant in wide rivers.
Comparison of the obtained results with the ANN results of Tayfour and Singh (2005) showed that although the accuracy of M5P (with RMSE of 454.9), as the best studied model, is lower than the ANN developed by Tayfour and Singh (2005) (with RMSE of 193.0), the practical mathematical formulations of M5P mean that is highly applicable in LDC estimation.

Sensitivity analysis
To investigate the influence of input parameters on LDC estimation, CC, RMSE and MAE evaluation criteria were used for different input variables. For this purpose, the M5P was selected for sensitivity analysis, as this was the most accurate model in LDC estimation (Table 5). Each model confirmed the extent to which the eliminated variable would affect the model accuracy. It is clear from Table 5 that the precision of the M5P model decreased if either the W/H or U/U * input parameter was removed from the modeling. Furthermore, it can be seen that W/H has the most significant effect in increasing the prediction accuracy. In other words, eliminating W/H caused a sharp increase in the RMSE value.

Conclusion
In this study, various machine learning algorithms, including GPR, SVR, M5P and RF, were used to estimate LDC values in natural streams and rivers. LDC values can be estimated using flow variables and geometric characteristics of the channel. So, in the current research, W H and U U * were considered as input parameters and K U * H as an output parameter. The performances of the models were evaluated based on error measures of CC, RMSE, MAE and the Taylor diagram. The results indicated that although GPR-3 and SVR-3 using the Pearson VII function-based kernel and M5P showed satisfactory performance, M5P provided the most accurate estimations of LDC. Furthermore, among six common empirical models that were implemented in the current study, the S-D model gave the best result. In conclusion, the developed M5P model outperformed others in terms of accuracy and it is recommended for LDC estimation. In addition, the importance of the dimensionless parameter U/U* on LDC estimation, especially for wide rivers, was reported based on the findings of the current research. Finally, other hybrid machine learning algorithms could be implemented to test their capabilities in order to investigate possible improvements in the accuracy of LDC estimations.