Estimation of natural streams longitudinal dispersion coefficient using hybrid evolutionary machine learning model

Among several indicators for river engineering sustainability, the longitudinal dispersion coefficient (Kx) is the main parameter that defines the transport of pollutants in natural streams. Accurate estimation of Kx has been challenging for hydrologists due to the high stochasticity and non-linearity of this hydraulic-environmental parameter. This study presents a new hybrid machine learning (ML) model integrating a Gaussian Process Regression (GPR) and an evolutionary feature selection (FS) approach (i.e. Covariance Matrix Adaptation Evolution Strategy (CMAES)) to estimate Kx in natural streams. The dataset consists of geometric and hydraulic river systemparameters from 29 streams in the United States. Themodeling results showed that the proposedmodel outperformed othermodels in the literature, producing more stable and accurate estimations. The FS approach evidenced the significance of the cross-sectional average flow velocity (U), channel width (B), and channel sinuosity σ to estimate the dispersion coefficient. In quantitative terms, the integrated GPRmodel with feature selection approach attained theminimum root mean square error (RMSE = 48.67) andmaximum coefficient of determination (R2 = 0.95). The proposed hybrid evolutionary ML model arises as robust, flexible and reliable alternative computer aid technology for predicting the longitudinal dispersion coefficient in natural streams. ARTICLE HISTORY Received 17 May 2021 Accepted 16 August 2021


Research background and K x empirical formulation
Various types of contaminants are discharged into water bodies across the world, both point and non-point sources, and these contaminants adversely affect the water quality (Noori et al., 2007;Tenebe et al., 2016). Hence, providing an accurate pollution dispersion analysis in rivers is a tedious issue for authorities responsible for monitoring and managing water resources. Even though pollution can disseminate in all directions of rivers (Desiante et al., 2021;Park & Hwang, 2019;Tayfur & Singh, 2005), the longitudinal dispersion usually is dominant at a considerable distance from the pollution source after discharge into the water body (Riahi-Madvar et al., 2009;Steele, 1991). As per several environmental and hydrological investigations, the most critical parameter when studying the longitudinal transport of pollution in rivers is the CONTACT Zaher Mundher Yaseen zaheryaseen88@gmail.com, yaseen@alayen.edu.iq longitudinal dispersion coefficient (K x ) (Baek, 2019;Carter & Okubo, 1972;Dehghani et al., 2020;Etemad-Shahidi & Taghipour, 2012;Farzadkhoo et al., 2019Farzadkhoo et al., , 2018Hamidifar et al., 2015;Kashefipour & Falconer, 2002;Wang et al., 2017). Accurate estimation of K x is essential for numerous practical applications, such as environmental engineering, river engineering, intake designs, assessment of dangerous contaminants discharge into water bodies, etc. (M. J. Alizadeh, Ahmadyar, et al., 2017). Several models exist which rely on the advection-dispersion equation shown in Equation (1); however, the modeling results are highly affected by the nature of the K x (Camacho Suarez et al., 2019).
where K x is the longitudinal dispersion coefficient, t is the time, x and u are the longitudinal coordinate and mean cross-sectional velocity, respectively, and C is the averaged cross-sectional concentration (Noori et al., 2011). Regarding K x , it can be determined experimentally, empirically, and theoretically (Camacho Suarez et al., 2019;Data, 2019;Deng et al., 2002Deng et al., , 2001Disley et al., 2015;Milišić et al., 2019;Perucca et al., 2009;Seo & Baek, 2004;Seo & Cheong, 1998;Shin et al., 2019;Shucksmith et al., 2011;Swamee et al., 2000;Wang & Huai, 2016;Zeng & Huai, 2014). The direct experimental determination of the dispersion coefficient demands cost and time inefficient tracer studies and could be done only with rectangular flumes data (Etemad-Shahidi & Taghipour, 2012). For certain conditions, pipes K x can be formulated using the advection-diffusion equation subjected to specific water flow criteria (laminar or turbulent). However, for natural streams, the determination of the K x does not have a complete theoretical understanding owing to the stochasticity phenomena . In addition, there is no knowledge on the transverse profiles of both the flow depth and flow velocity (Deng et al., 2001;Paulson, 1970). The earlier predictive equations differ in their findings and are laden with certain levels of uncertainty and thus many research have focused on the development of more accurate empirical formulation (EF) for K x estimation (Altunkaynak, 2016;Nezaratian et al., 2018;Wang et al., 2017). Several studies have been devoted over the past decade to discovering the relationship between K x and the hydraulic geometric characteristics (Noori et al., 2021;Sahin, 2014;Shen et al., 2010;Tutmez & Yuceer, 2013); however, the presented EFs still have indicated their limitations on attaining high predictability performance.

K x estimation using artificial intelligence models: literature review
Over the past decades, the implementation of computer aid models reported promising results in modeling hydraulic engineering problems (Sharafati et al., 2021). The importance of artificial intelligence (AI) machine learning models in environmental and hydrological applications has increased over the years (Yaseen et al., 2019). One of the earliest studies conducted on the K x modeling was based on implementing an artificial neural network (ANN) model developed by Tayfur and Singh (2005) in rivers and natural streams. The evaluation of the model proved its capability in predicting the K x compared to the earlier suggested EFs. In another study by Tayfur (2006), the authors established fuzzy set, ANN, and regression-based models for the K x estimation in natural streams. The estimation results showed that the developed models outperformed the existing EFs as they are satisfactorily estimated the measured data with minimum errors. A fuzzy model was developed for K x estimation in natural channels by Fuat Toprak and Savci (2007). The study compared the performance of the developed fuzzy model with the measured data and the existing equations. The research results indicated that the feasibility of the fuzzy model outperformed the other models in terms of results reliability. Three ANN models based on different learning algorithms (i.e. the radial basis function neural network (RBFNN), feedforward backpropagation neural network (FBNN), and the generalized regression neural network) were tested for K x estimation (Toprak & Cigizoglu, 2008). The research findings showed that the developed model's accuracy was higher than the accuracy of the other existing EFs. The capacity of support vector regression (SVR) and adaptive neuro-fuzzy inference system (ANFIS) models were inspected for K x estimation in natural streams by Noori et al. (2009). The study found that the SVR model performed better than the ANFIS model in achieving better threshold statistical analysis. The FBNN model with a 2D convergent flow tracer transport model was developed for improving the evaluation of transverse and longitudinal dispersivities from a convergent flow tracer test (Shieh et al., 2010). The developed model required less computational time and offered more accurate values of the transport parameter. The study also confirmed the effectiveness of the developed model for achieving fast and accurate transverse and longitudinal dispersivities evaluation for a field convergent flow tracer test. A genetic programming (GP) model was developed for modeling K x (Tu et al., 2015) and the study corroborated the capacity of the GP model in comparison with EFs. The use of the M5 model tree for K x estimation was conducted by Etemad-Shahidi and Taghipour (2012). The model was compared to the other existing equations in terms of performance based on error measures. The developed model performed better than the existing formulas and could be a valuable tool for K x estimation. Two gene expression programming (GEP) models were examined for K x estimation based on 150 published data sets of hydraulic and geometric parameters in natural streams (Sattar & Gharabaghi, 2015). The analysis showed that the proposed relations were accurate, simple, and effective in K x estimation in natural streams.
Of the recent exploration for the advanced versions of AI models, hybrid version where two or more models are integrated for the purpose of improving the learning process has been noticed remarkably (Ghaemi et al., 2021;Tao, Al-Bedyry, et al., 2021;Tao, Habib, et al., 2021). The capacity of the particle swarm optimization (PSO) algorithm was explored to improve the performance of the neuro-fuzzy-based group method of data handling (NF-GMDH) model for K x estimation in rivers (Najafzadeh & Tafarojnoruz, 2016). The NF-GMDH-PSO model was compared with several AI models and traditional EFs in terms of accuracy performance. The proposed hybrid model evidenced its potential against the standalone AI models and the EFs. The reliability of ANN, ANFIS, and SVR models were studied for K x estimation in natural rivers (Noori et al., 2016). The study performed forward selection and gamma test to sort the input parameters in the order of their effects and relevance on K x estimation. The study's outcome revealed less uncertainty in the SVR model compared to the ANN and ANFIS models for K x estimation in natural rivers. In addition, the performance of the ANFIS model was found better than that of the ANN model. A multivariate adaptive regression splines (MARS) was developed for K x estimation in rivers by Haghiabi (2016). The performance of the MARS model was compared against the ANN model and some EFs, and the modeling outcome showed that the MARS model performed better than the ANN model and the EFs in terms of accuracy. In another study, several training algorithms, including genetic algorithm (GA), imperialist competitive algorithm (ICA), bee algorithm (BA), cuckoo search (CS), and Levenberg-Marquardt (LM), were used for ANN model learning process development for K x estimation in rivers (M. . The evaluation of the predictive models showed that they could be successfully improved based on integration the swarm evolutionary algorithms. However, the performance of the CS, ICA, and BA algorithms was better than the GA and LM for training the ANN model. The use of the Bayesian network (BN) for K x estimation in natural rivers was presented by M. J. Alizadeh, Shahheydari, et al. (2017). First, the study applied the clustering technique as a data pre-processing technique to cluster the data in separate groups of similar characteristics. The study showed that the developed model was a suitable tool of pollutant transport prediction in natural rivers.
The use of evolutionary polynomial regression (EPR) model for accurate estimation of K x in rivers was presented by Balf et al. (2018). The modeling was based on the flow depth, channel width, and average and shear velocities parameters. The estimation of K x using the EPR model was compared with those estimated using other conventional estimation formulas. Based on the obtained estimation results, the introduced EPR model for K x estimation was found suitable to be incorporated in one-dimensional water quality models for better solute concentration estimation in natural rivers. An ANFIS-based principal component analysis (PCA) was performed for K x estimation by Parsaie et al. (2018). The evaluation of the model showed better accuracy of the ANFIS model compared to the experimental formulas.
A study was conducted based on the Pareto-Optimal-Multigene Genetic Programming (POMGGP) equation for K x estimation (Riahi-Madvar et al., 2019). The study analyzed 503 data sets of channel geometry and flow conditions in natural streams in order to develop a hybrid model. The developed hybrid model combines the Subset Selection of Maximum Dissimilarity Method (SSMD) with Multigene Genetic Programming (MGP) and Pareto-front optimization. The combined models aimed to establish a set of selected dimensionless equations of K x and the best equation with the widest applicability. As per the authors' findings, the proposed equation provided an accurate estimation of K x compared to the other published equations.
A hybrid predictive model based on the optimized ANN model with granular computing (GRC-ANN) was established for K x estimation in natural rivers (Ghiasi et al., 2019). The performance of the GRC-ANN model was compared with those of ANFIS/ANN models and other EFs in terms of accuracy and performance in different K x values. The results confirmed the potential of the GRC-ANN model. An integrative AI model was developed, including group method of data handling and extreme learning machine (GMDH-ELM) for K x estimation in water pipelines (Saberi-Movahed et al., 2020). The analysis of the GMDH-ELM model achieved a good level of precision in the training and testing phases. The results further showed that the proposed GMDH-ELM performed better than other AI and conventional predictive models.
Recently, several AI models, including support vector regression (SVR), Gaussian process regression (GPR), random forest (RF), M5 model tree (M5P), and multiple linear regression (MLR) for K x estimation in natural streams (Kargar et al., 2020). The study found that the M5P model with simple formulations exhibited better performance than the other AI models and EFs, and was recommended as a suitable tool for K x estimation in rivers. K x was modeled using an integrated AI model based on the Subset Selection of Maximum Dissimilarity (SSMD) method and the Whale Optimization Algorithm (WOA) as a simple optimization approach . The study presented a high accuracy formula for K x estimation, which was proven to be superior in terms of K x estimation compared to the existing K x estimation formulas. The WOA was also found to improve the predictive performance of the equations in other related fields by establishing the optimum coefficient values. A hybrid model was developed using hybridized ANFIS with firefly algorithm (FFA) for K x estimation . The FFA was used for the derivation of the optimum parameters of the ANFIS model. The analysis of the proposed ANFIS-FFA model showed significant improvement compared to the standalone ANFIS model, suggesting that the parameters optimization by the nature-inspired optimization algorithms contributed significantly toward improving the generality of the ANFIS estimations. On the same manner, other scholars developed a new hybrid AI model based on the integration of Harris Hawks Optimization (HHO) algorithm with ANFIS and least square support vector machine (LSSVM) models for the K x estimation (Azar et al., 2021). The advancement of AI models pointed that there is a motivation to explore the feasibility of the emerging technology deep learning models for the hydraulic engineering simulation. Hence, a convolutional deep learning model was constructed for the K x estimation (Ghiasi et al., 2021).

Research gap and current study motivation
According to the reported literature review studies, researchers devoted remarkable investigations on AI models' feasibility to improve their designs, comprehending, and evaluations for K x modeling in natural rivers. However, exploring a new version of AI models and their advanced versions is still ongoing research motivation to better understate the underlying pollutants transport in natural streams. Although several AI models have been explored for modeling the K x , several issues related to the methodological aspects observed remained unsolved, such as: (i) AI models are only applicable when there are enough training samples, and this is hard to obtain in real-world scenarios; (ii) the incorporation of FS approaches characterized with non-linearity could substantially improve the feasibility of the ML models; (iii) the learning of the intricate structure of the input parameters of AI models is critical; hence, a deeper network comes with a significant increase in computational cost and overfitting owing to the associated increase in the number of parameters to train, and sometimes due to inappropriate training samples; (iv) there is still the issue of appropriate balancing of the network depth and computational efficiency, while accommodation of nonlinear FS remains an essential step.
Also, FS is known as parameter selection or attribute selection, is an independent process generally combined with the learning procedure of ML models, which aims to reduce the number of attributes to improve the performance of a predictor (Creaco et al., 2016). The central premise of FS is to remove the irrelevant or redundant features from the dataset; thus, the quality of the final estimations does not deteriorate (Meng & Li, 2017). Depending on their interaction with the ML model, FS approaches can be classified into filter models, wrapper models, and embedded models. Filter models work independently of the predictor using information from the dataset relying on a metric constructed upon several statistical tests, which can be a correlation coefficient, a distance function, or an importance measure. Wrapper models use an estimation model and select a subset of features based on the predictor performance. Although wapper methods generally produce better performance, they are computationally intensive and timeconsuming than the filter methods (Tan et al., 2020). In the embed methods, the selection of relevant features is integrated into the learning of the estimation model (Cai et al., 2018). The selected features are those that best contribute to the accuracy of the estimation model (Mirzaei et al., 2017).

Research objectives
This study was established to propose a ML model that integrates FS and building robust and reliable predictive model to estimate longitudinal dispersion coefficient. Despite the massive applications involving statistical and ML models reported over the past two decades, the current research proposes for the first time a Gaussian Process Regression hybridized with an evolutionary algorithm to estimate the dispersion. The motivation of the study is to provide an accurate model that automatically selects the most relevant input parameters while releasing reliable and accurate estimations. For the validation purpose, the SVR model was developed in which it has been produced good estimation in several related studies. The evolutionary search is performed by the Covariance Matrix Adaptation Evolution Strategy (CMAES), a simple, well-established, and competitive evolutionary algorithm (Hansen, 2006;Li et al., 2020).
In order to show the scope of the model, the FS step was removed from the model, and its performance on the dispersion estimation was compared with the standalone GPR and SVR models. The purpose here is to investigate that even without the selection of characteristics, the proposed model can achieve competitive results against estimation models described in the literature (Toprak et al., 2004). Then, the FS approach was included again in the model, and then extensive computational experiments showed that the proposed approach was capable of selecting the most relevant features while training the ML models, improving the performance, and producing accurate and reliable outcomes.

Longitudinal dispersion coefficient dataset
The current research was established using the natural streams dataset, were collected from the published research (Tayfur & Singh, 2005). The dataset consists of 71 geometric and hydraulic parameters observations, and longitudinal dispersion coefficients were obtained from 29 rivers in the United States.
The samples in the dataset are scaled according to: where v i is the standardized value, and v max is the maximum values for the samples of the considered parameter. From the scaled dataset, 51 samples (72%) were used to compose the training modeling phase, while the remaining 20 observations samples (28%) were used for the testing phase. Tables 1 and 2 present the description of the parameters and the statistical information for training and test set, respectively. The relative shear velocity can be associated with the roughness and hydrodynamic characteristics of the river bed (Etemad-Shahidi & Taghipour, 2012). The channel shape parameter, given by β = ln (B/H), reflects the vertical irregularities of the river bed (Deng et al., 2001). The channel sinuosity is defined as the channel length ratio to the valley length (Sahay, 2013).

Feature selection encoding
The focus of this work is to develop an ML model to estimate the dispersion coefficient of natural streams. The model comprises a set of chained steps that involve processing the data, selecting the features, building the ML predictor, and finally releasing the final model to estimate the outputs. In addition, the proposed ML model aims to reduce human efforts in the simulation process when building a reliable and accurate methodological technique (Liu et al., 2017). Figure 1 depicts the hybrid predictive ML model proposed in this research integrating data processing, feature selection, and model building.
A simple procedure is used in the FS step. A feature vector, composed by 8 entries: that represent the logical variables for each parameter shown in Tables 1 and 2. An entry equals 1 represent a selected feature and 0 a removed feature. For example, the feature vector θ FS = [10100011] means that the set {B, Q, β, σ } is used in the estimation model to produce the K x estimations. Two ML models were used along with the integrated model: Gaussian Regression Regression (GPR) and Support Vector Regression (SVR).
The parameterization of the hybrid model comprises the adjustment of two vectors θ FS and θ MB . These two vectors are then concatenated in a parametric vector where θ FS is the feature vector, and θ MB is the vector of the internal parameters of the ML models.

Gaussian processes regression
Given a dataset of observed samples {x i , i = 1, 2, . . .}, a Gaussian Process (GP) is a non-parametric approach in which to find a distribution over the possible functions f (x) that are consistent with the samples. In Gaussian process regression (GPR), it is assumed at any point x is assigned the outputŷ of a random variable f (x) that can be written as:ŷ where ε is an independent and identically distributed noise contribution. The predictive distribution of the output parameterŷ, given its input x, is given by :ŷ where y = [y 1 , y 2 , . . . , y N ] T is the vector of training data, T is a positive definite kernel function, K is the covariance matrix function, and α is a regularization parameter. The covariance function models the dependence between the function values at different input points x i Figure 1. The proposed hybrid framework. Steps of the hybrid approach: the training data is processed, and then the evolutionary algorithm CMAES guides the FS and model building tasks using cross-validation and delivers the final predictor. After constructing the final predictor, the test data is used to assess its performance. A set of proper metrics are used to evaluate the model performance on the test set. and x j (Schulz et al., 2018): where ν and l are positive parameters, (·) is the Gamma function and K ν (·) is the modified Bessel function (Kumar, 2020). The building of the GPR model involves the proper setting of the kernel parameters k 0 , ν, and l and the regularization coefficient α. The model building vector for GPR ML model is represented by

Support vector regression -SVR
The Support Vector Regression (SVR) builds a linear regression model of the form: , is the vector of weights, b is a constant threshold and N is the number os samples. The nonlinear transformation used in this paper is the radial basis kernel function of the form where γ is the bandwidth parameter.
In SVR the optimal w and b are computed by minimizing the functional (Kargar et al., 2020): where y i is output data associated with x i , L ε is the εinsensitive loss function (Gunn, 1998) and ε is a SVR parameter. The internal parameters of the SVR model to be adjusted are C, ε and γ , resulting the the model building parameter vector The kernel function in Equation (4) assists in constructing a nonlinear decision hypersurface on the SVR input space, while the bandwidth γ acts as the inverse of the variance of the kernel function. The regularization parameter C determines the relationship between the cost of minimizing the training error and minimizing the model's complexity. The tube size of ε-insensitive loss function controls the accuracy placed on the training samples (Chen, 2007).

Covariance matrix adaptation evolution strategies
As shown proposed in the framework shown in Figure  1, an evolution strategy assists the FS and the adjustment of the internal parameters of the ML model. The encoding for a solution implementing the Gaussian Process is represented by the 12-dimensional vector while the 11-dimensional parametric vector encodes a solution implementing a Support Vector Regression model.
The hybrid model implements the Covariance Matrix Adaptation Evolution Strategy (CMAES), a variant of the evolution strategy algorithm (Hansen & Ostermeier, 2001). CMAES performs the optimization by iteratively sampling a population of λ candidate solutions from a multidimensional normal distribution based on a vector of means m, a covariance matrix C using a step size σ (Beyer & Schwefel, 2002). Figure 2 shows the pseudocode of the CMAES algorithm.
CMAES starts the search creating a population of λ candidate solutions. Each solution is then evaluated using the objective function given by the Root Mean Discrepancy averaged over all N observations is the Discrepancy Ratio (DR) for the ith sample, K xp i and K xm i are the estimated and observed dispersions. The Discrepancy Ratio (DR) is used as error measure for each sample. When K xp = K xm , DR is equaled to zero, and there is an exact estimation. Otherwise, there is either an over-estimation when DR > 0, that means K xp > K xm , or an under-estimation which leads to DR < 0 and consequently K xp < K xm .
The 5-fold cross-validation approach is employed to assess the fitness of the individual. This technique randomly splits the dataset into five subsets. The ML model using the parameters θ MB is trained in four of them, and the RMD from Equation (5) is calculated in the remaining subset. This procedure is repeated five times, and the objective function associated with the candidate solution is the RMD averaged over the five folds.
Then, CMAES evolves the population over N G generations. The goal is to find the best parameter θ * such that the RMD from Equation (5) to be reduced to the lowest possible value. At each iteration k, the algorithmic parameters m k , C k and σ k are modified based on the current evaluated population according to the algorithm shown in Figure 2. A detailed explanation of the CMAES algorithm can be found in Li et al. (2018).

Models assessment
After the simultaneous optimization of the feature and model building vectors, the ML model is ready to be used as a K x predictor in the test set. Three metrics were used to assess the model performance: Root Mean Square Error (RMSE), Coefficient of Determination (R 2 ) and Accuracy (Yaseen, 2021).
The RMSE is calculated according to where N is the number of samples, K xm i and K xp i are the measured and estimated longitudinal dispersions for the ith sample, respectively. The Coefficient of Determination is expressed as where K xm is the mean of the observed longitudinal dispersions. The Accuracy is categorized by the number of DR values between −0.3 and 0.3 relative to the total number of samples (Tayfur & Singh, 2005): where DR i = log(K xp i /K xm i ) is the Discrepancy Ratio for the ith sample, given by Equation (6), K xp i is the estimated longitudinal dispersion, K xm i is the measured longitudinal dispersion, and I(·) is a indicator function that returns 1 if the condition is true and 0 otherwise.

Computational procedure description
A robust procedure was proposed to report the results: (i) perform 100 independent runs establishing a unique random seed for each run; (ii) for each run, the hybrid models built on the training set are used to estimate the dispersion coefficients in the test set; (iii) calculate the RMSE metric and Accuracy to assess the performance of each model. The selected features in the first step of the experiments are summarized in Table 3. The computational experiments were divided into three parts: (1) Modeling the K x without feature selection. Performing the experiments using the proposed ML model by considering all features (Case 0) and also reproduce the models presented in (Tayfur & Singh, 2005) (Cases 1-7) shown in Table 3. The vector θ FS is kept constant for these experiments, depending on the K x model. Only the ML parametric vector θ MB is adjusted to fine-tun the ML model in this experiment.
(2) Modeling the K x using evolutionary FS. The proposed hybrid model is shown in Figure 1 to simultaneously search for the most suitable feature set and the model parameters. The FS procedure allows for exploring combinations of parameters that are not presented in Table 3, and that can potentially produce better predictability performance when those inputs parameters supplied for the GPR and SVR models.
(3) Modeling the K x with the most frequent established features. Building the ML models using the most frequent features found in the previous experiment exploits previous knowledge on the features and model parameters to propose an accurate model. Table 3, Case 0 employs the original features to model the dispersion coefficient of natural streams. Cases 1-7 were proposed in (Tayfur & Singh, 2005), where a neural network was used to estimate K x . Case 1 considers three parameters involving flow and geometric characteristics: the flow velocity U, the flow depth H, and the channel width B. Case 2 employs only flow discharge Q as the input parameter. As the velocity plays an essential role in the pollutant spill, Case 3 considers only flow velocity U. Case 4 considers flow velocity U and channel shape parameter β as input parameters, while Case 5 considers channel sinuosity σ in the feature vector θ FS along with the channel shape parameter β and flow velocity U. Case 6 considers only the relative shear velocity U/u * . Finally, Case 7 considers the channel shaper parameter β and channel sinuosity σ along with the relative shear velocity U/u * as input features.

Uncertainty analysis
Uncertainties in the dispersion values can occur due to inadequate estimation procedures and loss of tracker or measurements made in the advective zone (Moreno-Rodenas et al., 2019;Sattar & Gharabaghi, 2015). Other factors, such as software and hardware errors, can also interfere with the measurement of the hydraulic characteristics of a river (Rutherford, 1994). Consequently, the dispersion coefficients and hydraulic characteristics may present some uncertainties due to the reasons described above. This subsection presents a quantitative assessment of the uncertainties of the developed models in predicting longitudinal dispersion. The uncertainty analysis is applied to the testing phase dataset, comparing the predictive models' capacity.
The uncertainty analysis is calculated using the 95% confidence band around the estimated value K xm j . The confidence band can be approximated in the range where K xm j is the measured dispersion, K xp j is the estimated dispersion for the ith sample, andē and S e are the mean and standard deviation of the logarithm errors, respectively. The mean and standard deviation of the estimation are calculated according tō where e j = log 10 (K xp j ) − log 10 (K xm j ), j = 1, . . . , N and N is the number of test samples.

Monte Carlo simulation of the estimated dispersion values
To assess the effect of the stochastic feature of stream geometric and hydraulic parameters and their impact on estimated dispersion values, a Monte Carlo Simulation (MCS) method was implemented for finding the approximate distribution of the K x outcomes of the hybrid models. MCS is a robust and easy-to-implement numerical technique to calculate the output uncertainty of a model and assessing the distributions of the outputs that would result from the joint distribution of input uncertainties (Chaudhary & Hantush, 2017;Vose, 1996).
In the experiments conducted in this research, the variations of the input parameters were modeled using the uniform distribution with the lower and upper bounds given by the minimum and maximum values for each parameter, as shown in Tables 1 and 2. For each MCS run, the deterministic outcome is calculated for each model developed in this study. A total of 250,000 outcomes were calculated for K x . The Mean Absolute Deviation (MAD) around the median of the output distribution is written as: while the uncertainty of the model output can be given by: where K xp i is the estimated dispersion for the ith sample. A detailed description of the MCS method can be found in (Sattar & Gharabaghi, 2015).

Computational experiments and discussion
This section covers all the attained modeling results of the proposed ML model to estimate the longitudinal dispersion coefficient. All the computational process, FS scenarios, and the validation against the literature review (Tayfur & Singh, 2005) (i.e. ANN model 'reference model') were reported in this section. The experiments were conducted on a cluster of computers with the following specifications: Intel Xeon E5620 (2 cores of 2.4 GHz, and 12MB cache memory), RAM of 16 GB, and operating system Linux Linux 2.6.32.
During the optimization steps of CMAES, the algorithm samples a general multivariate normally distributed random vector that has a complexity of O(d 2 ), where d is the number of dimensions of the candidate solutions. Updating the covariance matrix has a complexity of O((μ + 1)d 2 ), where μ is a CMAES parameter (Ros & Hansen, 2008). As a result, the steps of CMAES algorithm have a computational complexity of O(d 2 ) where d is the number of dimensions of the candidate solutions. Concerning the ML models, the training procedure for GPR requires solving a linear system, while SVR training involves solving a quadratic programming problem (Chen, 2007). GPR and SVR algorithms have O(N 3 ) time complexity and O(N 2 ) memory complexity, where N is a size of the training set (Belyaev et al., 2014). Figure 3 compares the results of the K x modeling without FS approach. In this figure, the results of GPR and SVR models were averaged over 100 independent runs. According to Figure 3, it can be observed for Case 0, the average value of RMSE is considerably lower than those produced by the reference model for both GPR and SVR models. The averaged accuracy obtained with the GPR model was competitive with those produced by the reference model. However, there was a decrease in the performance for the accuracy of the SVR model. Notably, for the case where the width-to-depth ratios are lower than 50, the RMSE (B/H < 50) values were remarkably better than the reference model, representing a reduction of 64.9% for the GPR and 67.8% for SVR. This finding shows that the standalone ML models can provide accurate estimations for narrow streams. On the other hand, they did not produce better results than the reference model when the extreme values were not considered, as it can be observed for the RMSEs less than 100 m 2 /s, RMSE (K x < 100).

Modeling K x without feature selection
The modeling results for Case 1 were interesting since the features used as input parameters in this model are a subset of the collected ones. The parameters use in this model, U, H and B, are those exhibited higher correlations with the dispersion coefficient as it can be seen in    Figure 5 shows the strength of the linear relationships for U, H and B. It is important to notice that, although informative, the linear correlation may not represent the complex and the nonlinear nature of K x (Noori et al., 2016).
Considering the reported results in Figure 3, although the GPR model has produced better accuracy, the SVR model produced smaller RMSE when compared to the GPR model and the reference model. The GPR model poorly estimates the dispersion coefficient for narrow channels, as observed in the higher values for the RMSE if (B/H < 50) compared to the SVR model. However, considering the RMSE (B/H < 50), both models produced better estimations than the reference model.
For Cases 2, 3, and 6, the standalone models performed inadequately. It is interesting to notice that Case 2 involves the parameters implicitly in Case 1 since the flow discharge is the product of flow depth, velocity, and channel width, Q = HUB. In addition, the linear correlation between the flow discharge and the dispersion can be discarded due to the small correlation as shown in Figure 4. The models for Cases 2, 3, and 6 were built upon one parameter, which is insufficient to represent the nonlinear relationship between the single feature and the dispersion coefficient. The finding corroborates the results described in Tayfur and Singh (2005), that solely the flow discharge (Q), flow velocity (U), and the relative shear velocity (u * ) are not sufficient to estimate the nonlinear behavior of the dispersion coefficient. Case 4 was built upon Case 3, including the channel shape parameter (β) for Case 4. Whereas in Case 5, the channel sinuosity was added to Case 4. The standalone (i.e. GPR and SVR) models performed similarly to the conducted ANN model 'reference model' developed by Tayfur and Singh (2005). However, it was not able to improve the results when compared to the previous cases. Case 7 considered the relative shear velocity (U/u * ) as in Case 6, and channel shape parameter (β) along with the channel sinuosity (σ ) to estimate the dispersion coefficient. The statistical metrics shown in Figure 3 demonstrated the inclusion of the channel shape parameter, and the channel sinuosity did not improve the K x estimations.
The results depicted in Figure 3 obtained after several independent runs allow for draw the following conclusions on the test Cases presented in Table 3. From Cases 0-7, GPR and SVR models have produced the overall best results for Case 0 and Case 1, as it can be seen in Figure 3. Given that the GPR and SVR models' performance observed the highest for Cases 0 and 1, the comparative analysis will be focused at forward on these two cases. The RMSE for both models was inferior to the ANN model developed in Tayfur and Singh (2005). In Case 0, where the eight parameters were considered, the GPR and SVR models showed an average reduction of 58.9% and 58.8% using the RMSE metric, respectively. Similar behavior was observed for the narrow channels where B/H < 50: the reduction was 35.1% and 32.2% for GPR and SVR models, respectively. In addition, the average Accuracy for GPR was only 3.24% small than the reference model.
In Case 1, the averaged RMSE values obtained using GPR and SVR models were smaller than the RMSE attained by the reference model. However, the RMSE produced by GPR model was higher than that provided by SVR model. Similar behavior is observed for the RMSE of narrow channels (B/H < 50). For the Case 1, the GPR model produced an average Accuracy of 70.09% against 75% for the reference model, resulting in a percentage difference of 6.55% on average.
For Cases 0 and 1, Figure 6 depicts the scatterplots of the measured and estimated dispersion coefficient for the best models over the test set (according to RMSE). Besides the RMSE, accuracy and the coefficient of determination associated with each model were displayed as well. Although a few estimated K x present a visible error, the estimations for GPR and SVR models revealed a good agreement with the observations, as it can be verified by the R 2 magnitudes, above 0.90 for GPR in Cases 0 and 1. For SVR, R 2 = 0.93 for Case 0 and R 2 = 0.89, were attained for Case 1.
In the computational experiments, a total of 100 independent runs were performed. As a consequence, it is interesting to analyze the distribution of the internal parameters in all executions. As Cases 0 and 1 produced better results than the other ones, the discussion is focused on these cases. Figure 7 shows the distribution of the internal parameters for Cases 0 to 7. The GPR model parameters are namely k 0 , ν, l, and α. The first three parameters control the shape of the kernel shown in Equation (3) used to compute the estimations, while α, which is added to the diagonal of the kernel matrix during fitting, rules the level of noise in the observations. The k 0 parameter controls the magnitude of the GPR model approximation, and the final solutions showed similar distributions in cases 0 and 1. Similar behavior was noticed for the parameter α, which is associated with noise in the observations. The dispersion is a measure associated with a natural process, and thus that noise becomes an essential factor in the estimates. The parameter ν is a critical parameter in the kernel function and it controls the GPR smoothness (Stein, 1999): the smaller ν value, the less smooth approximated solution achieved. The distribution of ν in Figure 7 shows the GPR estimation functions produced for Case 1 smoother than Case 0. In addition, the length scales l for Case 1 are higher than Case 0 as well. These combined circumstances potentially  led to the highest RMSE values for case 1, as it can be visualized in Figure 3. Figure 8 displays the internal parameters distribution of the SVR model including C, γ and ε. C is the penalization parameter, ε specifies the penalization associated with the training loss function, and γ is kernel coefficient in Equation (4). By analyzing the boxplots, it can be observed that the distribution of the parameters γ and ε were they behaved in similar distributions for both cases. On the other hand, the values of C parameter for Case 1 were higher than those found for Case 0. In the support vector machine model, C plays an important role as a regularization parameter. The regularization strength is inversely proportional to C. As a result, the SVR estimations for Case 0 were smoother than those produced for Case 1. The smaller RMSE values for Case 1 shown in the barplots in Figure 8 (averaged in 100 runs) supports this interpretation.
When compared to the reference model, it can be noticed that GPR and SVR models presented a lower performance in predicting the dispersion coefficient for the extreme values of (K x > 100), which are discarded for Cases 0 and 1. One possible explanation for the deprecated performance of RMSE for small K x is that those models do not suitably explore the relationship among the features for small dispersion values. For case 0, the interaction among the eight features might not be beneficial to estimate small K x values. In contrast, for Case 1, the features U, B, and H that conserve linear relationship with the dispersion coefficient might not be enough to represent the non-linearities of the dispersion coefficient. Considering that the evolutionary strategy appropriately determines the parameters θ MB of the models, an alternative is to choose a set of input parameters that balance the predictive resources with extreme values. Likewise, this set of parameters should produce estimates with low RMSE values at high precision.

Modeling K x with evolutionary feature selection
Considering the experiments in the previous subsection, it is not clear whether other combinations of parameters can lead to better predictability performance. Despite the explanation provided by the authors in their study (Tayfur & Singh, 2005), it can be argued that the parameters associated with the cases shown in Table 3 were chosen, to their major extent, based on the correlation with the longitudinal dispersion coefficient. Figures 4 and 5 show that the parameters with the highest correlations are those that comprise the sets in Table 3. For example, the U parameter appears in 5 out of 8 cases. Besides, the parameters U, B, and H are part of Cases 1 and 2, which produced the best-averaged results.
The hybrid model developed in the current research can search for the most suitable predictor parameters to explore combinations that were not previously considered. It can be noticed that due to the nonlinear natures of the predictors GPR and SVR models, a careful choice of parameters can produce an improved performance of the ML models. Also, parameters with slight correlations with K x might positively impact the final estimations.
The subsequent experiments aim to analyze the impact of the evolutionary FS on assessing the K x estimation performance metrics. Table 4 summarizes the results obtained for GPR and SVR hybrid models with FS shown in Figure 1. The first column shows the cases, and the second column displays the ML model. The RMSE is shown in the third column, while the Accuracy appears in the fourth column. The fifth two columns show the RMSE when extreme values are discarded, and the last column displays the RMSE for narrow streams. Analyzing the reported results in Table 4, it is possible to observe that the FS step on the model has contributed considerably to improve the ML models' metrics. In addition, Table 4 presents the results for Cases 1 and 2 to compare the performance of the models when the FS was not applied. Overall, the RMSE values decreased for both GPR and SVR models. As observed for the models without implementing FS, GPR-FS and SVR-FS models produced smaller RMSE than the reference model. On average, the GPR-FS model achieved an improvement of 16.5% when compared to Case 0 and 50.2% concerning Case 1. Whereas the improvement achieved by the SVR-FS model to Cases 0 and 1 were 21.0% and 21.6%, respectively. Regarding the Accuracy, the SVR-FS model achieved a slight improvement over the SVR model. On contrast, the GPR-FS model achieved a consistent improvement compared to the standalone model without FS. In particular, the averaged Accuracy is the highest among all models. When the extreme dispersion coefficients (K x > 100) values were excluded, there was no improvement in the RMSE for both models. However, the evolutionary hybrid model implementing GPR and SVR models produced an acceptable predictability performance as per the evaluation metrics for the narrow streams. In particular, the average values of RMSE achieved by the SVR-FS model can be highlighted, as it can be seen in the last column of Table 4. Figure 9 depicts the scatter plots for the best GPR-FS and SVR-FS models, that were chosen according to the smaller RMSE. Considering GPR-FS model, it can be observed that RMSE decreased while Accuracy increased when comparing the scatter plots from Figures 6 and 9.
The analysis of the forgoing results supports the conclusion that incorporating the FS procedure in the ML development learning process leads to a considerable improvement in predicting the dispersion coefficient. The proper selection of input parameters is beneficial for the model as this interaction increases the accuracy while decreases the averaged errors.
The results in Table 4 can help assess the models' performance improvement. However, they do not show which parameters led to these gains. Figure 10 shows the distribution of the selected features after the training process of the evolutionary hybrid model. From this figure, one can observe that GPF-FS and SVR-FS models chose a total of 13 sets of features at the end of the evolutionary search procedure. It was observed that the parameters selection process is heuristic, and there is no guarantee of choosing the same set of parameters in all runs. The barplot on the left side shows the results of GPR-FS model, the set of B, U, σ , Q parameters has been selected 24 times, the B, U, σ was selected 18 times and B, U, σ , β 13 times. These three sets were selected 55 out of 100 times, and they share the parameters B, U,  and σ . A similar analysis for the SVR-FS model in the bar graph of Figure 10 shows that B, U, σ , β was chosen 33 times, B, U, σ 20 times, and B, U, σ 12 times. These three sets were selected 74 times throughout 100 independent executions, approximately two-thirds of the total runs. Figure 11 shows that U, σ , B, Q were the most frequent parameters for the GPR-FS model, in that order. For the SVR-FS model, the most frequent features were U, σ , B, β, in this order as well. Moreover, for both models, the U and σ parameters were select in all runs, while B appears 98 times for the GPR-FS model and 96 times for the SVR-FS model. From most frequent parameters, U and B have the strong correlation with the dispersion coefficient, whereas river sinuosity σ has the lowest correlation, as it can be seen in Figure 4. The FS has shown that the sinuosity plays an essential role in the context of the hybrid model, suggesting σ holds a solid nonlinear relationship with K x that was not captured by linear correlation approaches.
Some literature studies discussed that sinuosity is related to river meandering (Deng et al., 2002) other exogenous factors such as vegetation (Camporeale & Ridolfi, 2010;Savickis et al., 2016) and topography (Timar, 2003;van Dijk et al., 2013), that also affect the dispersion coefficient. As shown in the results presented in this study, the incorporation of sinuosity is beneficial for the adopted ML models performance.

Modeling K x using the most frequent features found by evolutionary selection
The computational experiments carried out in this subsection assist in determining the importance of the parameters indirectly. This importance takes into account their impact on the hybrid ML models. The results shown in Table 4 and Figure 10 suggest that some parameters constitute the core that improves the learning processes of the applied ML models. A detailed analysis of the frequency that the parameters appear in the selected sets is presented in Figure 11. Through this figure, it is possible to verify that the σ parameter appears in all selected sets. The same occurs with the U parameter. Based on this analysis, the essential parameters can be determined as U, B, and σ , which can be interpreted as the best selection of the proposed evolutionary FS approach. These three parameters are used to propose Case 8 shown in Table 5. Table 6 presents the prediction accuracy improvement for all metrics. In particular, it can be observed the improvement for the GPR model based on the Accuracy and RMSE for dispersion coefficients below 100 m 2 /s. Figure 12 displays the scatter plots for the best models of Case 8 according to the RMSE. The scatter plots showed the models yielded acceptable results accuracy. This behavior is expected because careful selection for the significant parameters represents the indirect introduction  of specific knowledge about the features. Besides, this specific knowledge could only be obtained after analyzing several simulations carried out with the evolutionary selection model. The most frequent parameters allow capturing the dispersion coefficient's essential behavior, allowing the effective modeling of the K x by the ML models. However, when this knowledge is not available before carrying out the estimations, the FS step in the integrated model is indispensable to construct an accurate low-complexity model. Table 7 summarizes the results of the uncertainty estimation. The first column shows the model's acronyms or the reference where they were collected. The second column displays the case, according to Table 3. The third column shows the models' mean estimation errors, while the last two columns display the width of the uncertainty band and the 95% estimation interval error. According to the results presented in this table, all models overestimated the K x values as revealed by positive values of mean estimation errors, except for SVR in Case 7. Following the GPR model's attained results, the width of uncertainty bands for the hybrid models vary considerably, ranging between 0.25 and 0.68. However, for Cases 0, 1, 6, 7, and 8, their width bands are smaller than those presented by the reference models. The GPR-FS model equipped with automatic FS presented comparable uncertainty bands equals 0.29. The uncertainty bands for the SVR model presented similar results: Cases 0, 1, 5, 6, 8, and SVR-FS model exhibited smaller uncertainty bands compared to the reference models. Moreover, GPR and SVR models presented competitive for the lowest 95% confidence estimation error interval. The results presented in Table 7 showed that the GPR models (Cases 0, 1, 6, 7, 8 ans FS) and SVR models (Cases 0, 1, 5, 6, 8 ans FS) produced better performance with respect to model estimation uncertainty band and 95% estimation interval.

Uncertainty analysis
Uncertainty analysis results for the K x using the developed models were presented in Table 8. The uncertainty analysis shows that the MAD values for both GPR and  SVR models were smaller than the reference models that appear in the last six rows of Table 8. The uncertainties for GPR and SVR models were similar to those calculated for the reference models. Figure 13 shows the scatterplot of Uncertainty (%) and RMSE for all models. Each pair (x, y) represents a particular model, where the models GPR, GPR-FS, SVR, and SVR-FS are depicted in different colors. The size of  Figure 13 helps to simultaneously compare the performance on the estimation of dispersion and the uncertainty of the models. From this figure, it can be observed that the GPR (Case 2) produced small uncertainty; however, this model shows a higher RMSE. It should be noticed, this model uses one parameter (Q), as shown in Table 3, which may have contributed to the reduced model's accuracy. Conversely, SVR (Case 0) presents a small RMSE but with higher

Modeling verification against the literature formulations and ML models
In order to demonstrate the capacity of the proposed predictive models, it is essential to validate those models against established models over the literature. Table 9 presents the validation of the current research proposed models (i.e. GPR, SVR, and GPR-FS) and the existed EFs and the conducted several ML models over the literature. By considering the coefficient of determination as a comparative metric. The EFs demonstrated a low estimation accuracy. On the other hand, ML models were exhibited better productivity performance in comparison with the EFs; however, with some restricted accuracy. Based on the exhibited literature review modeling results, the attained estimation capability of the developed GPR, SVR, and GPR-FS models revealed better accuracy with a maximum R 2 = 0.95. It is worth mentioning here, although the predictive models were developed based on a limited number of observations (71 samples), the current research evidenced the capability of the developed models on mimicking the actual relationship between the geometric and hydraulic river system parameters and the K x .

Model strengths, limitations and further research
The obtained results and the reported analysis in this paper showed that the proposed evolutionary model in this paper contributes to the development of an automated computer aid model. The model has demonstrated the ability to automatically identify the most relevant features in the dataset and thus initiated the reliable learning process for the predictive model. As a result, the model can gradually learn to focus on critical data attributes during the model building approach. The size of the dataset imposes a limitation on the proposed approach in this paper. The dataset consists of hydromorphological parameters collected in the United States. Applying the evolutionary hybrid model to other rivers on other continents, even with similar characteristics, must be done with care. In this context, it is critical to add more samples to the dataset. However, data acquisition involves a laborious and time-consuming experimental data collection, which represents a payoff for the modeling process.
For future research, the possibility to develop alternative approaches by combining two or more ML models in an ensemble fashion is to be investigated (Elder, 2018;Kotu & Deshpande, 2019). This is for the purpose of exploiting the estimation capabilities of different ML models. Nevertheless, the resulting model can become quite complex in this context, with a potentially large number of parameters. Additional research involves implementing evolutionary multiobjective search algorithms (Jimenez et al., 2017), to reduce the number of model parameters and search for the most helpful input parameters while maximizing the estimation performance.

Conclusions
The dispersion of pollutants in rivers involves complex phenomena, requiring simplifications and assumptions, leading to drawbacks to accurate estimations. In this context, ML models arise as an alternative to dispersion coefficient modeling. However, the performance of these models depends on the proper input parameters selection. In this research, a new hybrid ML model was proposed that embeds a simple approach for the associated parameters selection for modeling the dispersion coefficient of natural streams. An evolution strategy assisted the task of adjusting the selected parameters. Based on the reported literature, there is no attempt to integrate the FS approach with the estimation model developed to determine the dispersion coefficient. In general, the FS task was performed in a separate step and did not consider the estimation model. The performance of the models was measured based on the RMSE and Accuracy. The research findings are summarized as follows: * The proposed predictive model in this research automatically searches for the essential parameters and the ML models' internal parameters. * The modeling results showed that with the incorporation of the FS approach, an efficient methodology was established for the dispersion coefficient simulation. * Based on the proposed evolutionary FS approach analysis, the essential parameters for modeling the K x , were identified as U, B, and σ . * A limited number of geometric and hydraulic river system parameters (71 Observations) were effectively constructed for the K x determination. * The uncertainty analysis indicated that GPR and SVR models presented competitive results for the lowest 95% confidence estimation error interval. * Overall, the research confirmed the feasibility of the hybrid ML model (i.e. GPR-FS) for modeling the longitudinal dispersion coefficient in natural streams.

Data and source availability
The datasets generated during and/or analysed during the current study are available in the github repository, https://github.com/LGoliatt/hybrid-evolutionary-ml-lon gitudinal-dispersion.

Disclosure statement
No potential conflict of interest was reported by the author(s).