Modelling of suspended sediment load by Bayesian optimized machine learning methods with seasonal adjustment

Suspended sediment load (SSL) is essential to river and dam engineering. Due to the complexity and stochastic nature of sedimentation, SSL prediction is a challenging task and conventional methods often fail to generate accurate results. Aiming to provide an improved estimation, this paper contributes to a new forecasting framework by integrating the seasonal adjustment (SA) and Bayesian optimization (BOP) into a machine learning (ML) model (denoted as BMS). The SA is used for de-seasonalisation and trend extraction; the BOP is to optimize the ML architecture. The BMS is evaluated using the daily SSL records from the Yangtze River. Its performance is appraised by statistical criteria of Nash-Sutcliffe efficiency (NSE), correlation coefficient (CC), root mean squared error (RMSE) and mean absolute error (MAE). With the de-noising and hyper-parameter tuning modules, the BMS effectively heightens the accuracy of the standard ML models. The most significant improvement occurs in the Boosting model, with its augments in NSE and CC by 4.3% and 1.6%, and reductions in RMSE and MAE by 24.9% and 24.2%. The BMS gains by comparison even under flood conditions, where it remarkably reduces the errors of the constituent models by up to 47.9% for RMSE and 48.3% for MAE.


Introduction
Sediment management in rivers has implications concerning reservoir operation, water-quality control, river geological and geographical settings, channel navigability, operation of hydraulic structures, river esthetics and fish habitats (Idrees et al., 2021;Kişi, 2010). For example, sediment aggradation elevates the channel bed with excess sand and gravel. It also gives rise to lateral shrinkage of the channel, which could lead to flooding due to the loss of discharge capacity (Kisi, 2005). Reservoir sediment deposition curtails the storage volume and may obstruct bottom outlets. Above all, suspended sediment with high turbidity in river flows is both a physical pollutant and a chemical one if it carries chemicals such as phosphorous and heavy metals (Doğan et al., 2007). Consequently, a good understanding of its transport characteristics is of significance to dam and river engineering.
River sediment is divided into suspended sediment load (SSL) and bed load. The former is often more complex and relevant to engineering applications . Conventionally, the SSL is estimated by either direct measurements or sediment transport equations (Bayram et al., 2012). The former is an accurate method but is often costly. The CONTACT Shicheng Li shicheng@kth.se latter has however limited accuracy, as it employs simplified partial differential equations and assumptions to develop a relationship between the SSL and flow parameters (e.g. flow velocity and discharge) (Aytek & Kişi, 2008). The use of sediment rating curve (SRC) is also a useful tool for SSL estimations. Empirical correlations are usually established using experimental data, which is costly and time consuming (Sharafati et al., 2020). They generate satisfactory results in a condition where they are derived and may lead to unreliable predictions in another. In addition, they cannot provide an estimate of prediction uncertainties associated with the parameters (Shamaei & Kaedi, 2016). Numerical methods are another alternative, which relies on computational power and requires complicated model setups (Sharafati et al., 2020). The limitations of the above-mentioned models motivate the constant need to develop reliable methods for SSL predictions.
With the advance in artificial intelligence (AI) techniques, machine learning (ML) models have exhibited great potential in studying hydrological issues, e.g. rainfall, runoff, evaporation, and streamflow (Deng et al., 2022;Ghorbani et al., 2018;Singh et al., 2022;Tao et al., 2021;Yaseen et al., 2017;Zhao et al., 2021). ML methods are effective in the reproduction of hydrological phenomena, requiring less computational time owing to their robust learning capability and adaptability (Sharafati et al., 2020). Their application in suspended sediment issues has also received growing interest. Given below is a brief review of the related studies of SSL and suspended sediment concentration (SSC, SSL = SSC × discharge), dividing into three categories: artificial neural networks (ANNs), fuzzy logic-based models and other decomposition-based ML methods.
Using a feedforward neural network (FFNN), Jain (2001) makes one of the first attempts to establish an integrated relationship between water level, flow discharge and SSC. In comparison with conventional curvefitting approaches, the developed model yields more accurate results, capable of coping with the hysteresis effects springing from unsteady flows. Using the past sediment records at an up-or downstream station, Cigizoglu (2004) establishes a multi-layer perceptron (MLP) model to estimate daily SSL. The MLP captures the complex non-linear behaviors of the sediment series relatively better than the conventional approaches. With rainfall and flow data as inputs, Alp and Cigizoglu (2007) evaluate the effectivity of the FFNN and the radial basis function neural network (RBFNN) in studying the SSL. They conclude that both models provide satisfactory predictions, with insignificant differences. Ramezani et al. (2015) propose a social-based algorithm (SBA) to optimise the ANN connection weights. The resulting model exhibits outstanding performance in convergence, flexibility and accuracy. Zounemat-Kermani et al. (2016) apply the ANN and support vector machine (SVM) to model the SSC, with the results compared with those from the multiple linear regression (MLR) and the SRC methods. It shows that the ML models reduce, on average, the errors by up to 23%. Lohani et al. (2007) employ a fuzzy logic technique to derive a stage-discharge-SSC relationship. Comparisons with the ANN and SRC demonstrate the superiority of the technique, which best captures the inherent nonlinearity of the system. Upon performing comparative studies of the neuro-fuzzy (NF), the ANN, the multilinear regression (MLR) and the SRC model, Rajaee et al. (2009) conclude that the NF best estimates the cumulative SSL, capable of reproducing the hysteresis effects. Kisi and Zounemat-Kermani (2016) develop an adaptive NF embedded fuzzy c-means clustering (ANFIS-FCM) model to forecast the SSC, which improves accuracy and requires less computational time in calibration compared with the classical ANFIS model. For estimations of the SSL and SSC, Özger and Kabataş (2015), Nivesh and Kumar (2018) and Kisi and Yaseen (2019) propose similar fuzzy logic methods. SSL time series features stochastic temporal variations; direct modelling might be insufficient to capture its nonlinear behaviours in both time and frequency domains. For better representation, data pre-processing techniques of e.g. sampling, transformation, de-noising and normalization are recommended to reformulate and reshape the original signals (Li et al., 2022). Wavelet transformation (WT) is, in the context, the most used data decomposition method. For instance, Li et al. (2013) integrate a discrete wavelet into an ANN (WANN) to model the SSC of hyper-concentrated river flows. Through error auto-correlation and input-error correlation analysis, the WANN performs more robustly than the ANN and SRC models. Rajaee et al. (2011) use a similar technique to decompose the time series data into several sub-signals at various resolution levels. If combined with an ANN, it generates a more accurate SSL prediction. Shiri and Kişi (2012) develop several hybrid models by coupling a wavelet with the gene expression programming (GEP), NF and ANN, respectively. It shows that the wavelet transformation considerably improves the accuracy of the individual ML models and the wavelet-GEP generates the best results. Nourani and Andalib (2015) construct a wavelet based SVM and ANN to examine daily and monthly SSL, which outperform the standard models. Other related techniques include the seasonal adjustment (SA) method (Zhang et al., 2013), the empirical mode decomposition (EMD) (Yang & Chen, 2019), the variational mode decomposition (VMD) (Abdoos, 2016), etc.
The review highlights the effectiveness of the standard ML models and the importance of data decomposition. However, some issues are yet to be addressed to attain improved predictions.
• Decomposition is necessary for enhanced accuracy and needs however to be properly applied. For instance, as shown by (Quilty & Adamowski, 2018), many wavelet-based models are incorrectly implemented and cannot be used for real-world forecasting. In hindcast experiments, owing to the inclusion of hypothetical future information, the WT boosts the performance of the individual models. In forecast tests, due to the border effects, the WT efficiency deteriorates, which is undesirable for real-life applications (Hadi et al., 2020;Quilty & Adamowski, 2018). In addition, extra procedures are necessary to remove the redundant sub-signals introduced by decomposition (e.g. WT and EMD) (Khan et al., 2020). • As for the ML models, a globally optimal architecture is the key to accurate predictions. Models are often optimised by the trial-and-error method, which is time consuming and cannot guarantee a best structure. For example, Li et al. (2013) use this method to determine the number of hidden nodes in the WANN model. Zounemat-Kermani et al. (2016) utilize the same approach to tune the training parameters in the SVM.
As a result, the goal (also the contribution) of this study is to develop a novel model that is capable of extracting the informative features hidden in the time series and at the same time automatically establishing the best combination of hyperparameters. A de-noiser of seasonal adjustment (SA) and an optimizer of Bayesian optimisation (BOP) are coupled with a standard ML model, resulting in a Bayesian optimised ML method with the SA, denoted as the BMS. The incorporated ML models cover the conventional ML, the ensemble learning (EL) and the deep learning (DL) algorithms. This novel BMS model is appraised using daily SSL records from a major hydrological station on the Yangtze River, China. This study aims to develop and evaluate an improved model for more accurate time series estimations of river suspended sediment, which also is intended to provide reference for hydrological modelling of e.g. runoff and environmental parameters including river temperature and salinity.

Methods
To test the effectiveness of the BMS framework, three ML methods are employed. This section gives a brief description of the algorithms followed by the BMS. The evaluation metrics are also introduced.

Support vector machine (SVM)
An SVM is a supervised learning model for classification and regression analysis. Its concept is to establish a decision surface by mapping the input vectors into a highdimensional space, where linear regression is applied. For the regression, the SVM aims to find a linear hyperplane that fits the multidimensional input vectors to output values. Then, the trained SVM estimates future output values contained in a test set. The model is mathematically formulated by where f = functional symbol, φ(x) nonlinear function in feature of input x, w = weight and b = bias. The w value is determined by where a i = Lagrange multiplier and y i = predicted value. The parameters are computed by minimizing the regularized risk (R min ) where C = cost factor, = radius of the tube that the regression function lies in, t i = desired value in period i and L (t i , y i ) loss function. The slack variables (ξ , ξ * ) are introduced, as some data might be outside the -tube. They represent the discrepancy between the actual value and its boundary value of the tube. Consequently, Eq.
(3) is rewritten as where l(α i , a * i) Lagrange function and K(x i , x j ) kernel function. The equation holds true under the condition n i=1 (a i + a * i ) = 0, a i ≥ 0, and a * i ≤ C. Common kernel functions include Gaussian and Linear; The resulting models are referred to as the GSVM and LSVM, respectively.

Ensemble learning (EL)
The EL is a meta-learning algorithm that employs multiple learning models to obtain better predictive performance than the constituent ones alone. It aims to, through generation, pruning and integration, overcome the deficiency of weak learners. The Bagging (BA) and Boosting (BO) are two popular EL algorithms, with their schematics shown in Figure 1.
The BA first creates new training sets by making bootstrap replicates of the original learning sets and then establishes weak models based on each subset. Finally, it generates an optimal model by aggregating all the single ones. The aggregation performs average over the individual models when dealing with numerical values. It does a plurality vote when handling classification. The resulting model gives rise to substantial gains in accuracy if a base learner shows remarkable instabilities (Breiman, 1996). Consequently, the BA mitigates the prediction uncertainties and overfitting risks. Similar to the BA, the BO creates, through random sampling, new training sets for replacement of the initial ones. However, a fundamental difference exists between their development procedures. For instance, the BO establishes sequentially base models and restricts the error rate of each classifier below 0.5, while the BA requires an unstable inducer and constructs base models in parallel (Rokach, 2010).

Deep learning (DL)
The DL belongs to a broader family of the ML methods based on artificial neural networks with feature learning. As an advanced DL structure, the long short-term memory (LSTM) has been applied to hydraulic and hydrological studies (Cho & Kim, 2022;Zhang et al., 2018). It is an enhanced version of recurrent neural network (RNN) architecture. Due to gradient vanishing during backpropagation processes, the RNN becomes insufficient in learning long-term temporal dependency (Cho & Kim, 2022). To counteract this, an additional cell-state structure (also known as LSTM block) is added to the hidden layer. Shown in Figure 2, an LSTM network consists of memory cells and nonlinear gates (input, output and forget gate). The gates maintain the state and regulate the information flow in the LSTM block with a constant error carousel (CEC). The input and output gates function as signal filters to distinguish between timedependent and time-independent information, thus preventing input and output weight conflicts. The CEC is a neuron that keeps the error constant and avoids gradient vanishing by running straight down the network without any activation functions (Zhang et al., 2018). The unique arrangement of the input and output gates installed before and after the CEC is beneficial for the network to selectively obtain the past information. The forget gate receives an error from the CEC and renders the CEC to wipe out the error when required. The construction of the LSTM is summarized as follows: (1) to identify the information to be eliminated from the cell (forget gate), (2) to determine the new information to be maintained by addition to the cell state (input gate), and (3) to control the information of the cell state flowing into the new hidden state (output gate) (Nourani & Behfar, 2021).
Conventional LSTMs learn from only the previous inputs, leading to a loss of information contained in the input series afterward . A bi-directional LSTM (BiLSTM) is often recommended to deal with this drawback by superposing two LSTMs with opposite training sequences and separate hidden layers. Consequently, the outputs are generated based on the inputs both backward and forward in time. More details are reported by Yang et al. (2022). In this study, both the standard LSTM and BiLSTM are employed to test the proposed prediction framework.

The BMS framework
The proposed BMS consists of three major components: a pre-processing module (SA), a standard ML model, and an optimiser (BOP), shown in Figure 3. The SA is to decompose the original time series into seasonal and trend components, which is useful for exploring the trend and any remaining irregular signals. The choice of the SA is based on its effectiveness and easy implementation, with only two sub-signals after decomposition (Wang et al., 2014;Zhang et al., 2013). The BOP searches for the optimal ML architecture through the optimisation of the hyperparameters. The choice of the BOP rests on its fast computation and successful applications in similar problems (Alizadeh et al., 2021). By an integration of the two modules into a standard ML model, the developed framework is intended to provide improved SSL predictions.
In the BMS, its first step is to perform de-seasonalisation. Since SSL time series is non-stationary with seasonality, a direct prediction is often subject to large errors. Consequently, as a data pre-processing tool, the SA method is applied to de-noise the SSL and extract its informative characteristics for modelling. Successful SA applications in ML modelling are seen in Wang et al. (2014) and Zhang et al. (2013). The additive SA is expressed as where x t = time series, T(t) = trend component, and . . , m; s = 1, 2, . . . , l). The parameter I s is defined as As a result, a new series without seasonality is expressed as x ks = x ks − I s . The series x 11 , x 12 , . . . , x 1l , x m1 , x m2 , . . . , x ml is re-recorded back to x 1 , x 2 , . . . , x T .
As the second step, the trend component is fed into a ML model for prediction. Three types of ML methods, i.e. conventional ML, EL and DL, are used, which covers a broad category of the ML family. The reason to the choices is to test the effectiveness of the BMS under different conditions. In ML modelling, the selection of a proper model architecture is critical to prediction accuracy. Due to the complexity, manually searching for suitable hyperparameters is time consuming and a global optimum is neither guaranteed. Consequently, an optimiser, i.e. the BOP, is integrated with the ML models to determine their optimal structures.
In the third step, the ML model is optimised by the BOP. Given a black-box function f (x), its global maximum needs to be specified.
where x * = optimal set of parameters and A = candidate space. With n pairs of inputs and outputs (x i , y i ), i = 1, 2, . . . , n, to select x n+1 that leads to a maximized y n+1 is the target. The BOP is a solution that uses Bayes' conditional probability rule. It considers the results from previous iterations and chooses values for the next iteration. The primary principles of the BOP are stated below.
• To establish a probabilistic surrogate model based on the Gaussian process (GP). The GP is an extension of a multivariate Gaussian distribution to an infinitely dimensional stochastic process, where any finite linear combination of dimensions has a joint Gaussian distribution. Therefore, the GP is employed to obtain the prior distribution based on the accumulated observations • To apply an acquisition function for determination of the next observation location, where the observation property is expected to be the best. The purpose of using an acquisition function is to avoid attaining local optima. It is instead achieved via a trade-off between the exploration and the exploitation In the end, the SSL prediction is fulfilled by the combination of the I s with the estimated T(t) from the Bayesian optimised ML model.

Evaluation metrics
Since ML learning models deal with big data, statistical metrics are often suggested to evaluate their performance (Cho & Kim, 2022;Nourani & Behfar, 2021;Zhang et al., 2018). Common statistical criteria include the Nash-Sutcliffe efficiency (NSE), correlation coefficient (CC), root mean squared error (RMSE) and mean absolute error (MAE). The former two indicate the goodness of fit and the latter two quantify the estimation error. If combined, they provide an all-around assessment of the model skills. The NSE is expressed by where O i = i th observation, S i = i th simulation,Ō = mean of observations,S = mean of simulations and N = number of data sets. The NSE range is between −∞ and 1 (perfect fit). The CC index is computed by It is a proxy of linear correlation between two sets of data, ranging from −1 to 1 (perfect fit). The RMSE index measures the discrepancy between the modeled and observed values, given by It ranges from 0 to +∞, with RMSE = 0 indicating a perfect match between the observation and estimation and a large value signifying the model inadequacy. The MAE index presents the mean of all individual errors, calculated by Similar to the RMSE, the MAE varies from 0 to +∞, with MAE = 0 marking a perfect model. Both RMSE and MAE are error metrics. The closer to zero, the better model performance.

Study area and data source
The developed models are appraised using the daily SSL records from a hydrological station on the Yangtze River. This section gives a brief introduction to the study site and data records.

Study site
The chosen study site is the Datong hydrological station on the Yangtze River. Figure 4 shows the river catchment and the geographical location of the site. With a 1.81×10 6 km 2 catchment area, the whole river basin accounts for approximately one-fifth of the country's land area. Flowing eastwards from the Qinghai-Tibet Plateau into the East China Sea, the river runs through several major cities including Chongqing, Nanjing and Shanghai. It is ∼ 6300 km long, the third longest river in the world, ranking fourth in sediment flux and fifth in flow discharge. Given the important role of the river, accurate predictions of its sediment transport have profound engineering implications for flood control, power generation, riverain development, etc. Field measurements show that the overwhelming part is suspended load. The 1960-1988 data at the Yichang hydrological station located some 40 km downstream of the well-known Three Gorges dam show that although the sediment yield shifts greatly from year to year, ranging between 361 and 754 Mt/year, the suspended load averages 523 Mt annually and the bed load only 7 Mt.
The Datong station (117.37°E, 30.46°N) is on the lower reaches of the river (the most seaward one). It is located ∼ 1245 km downstream of the Three Gorges dam and ∼ 624 km upstream of the river mouth in Shanghai. Its catchments area is ∼ 1.71 × 10 6 km 2 , and the river course at Datong is relatively straight. At normal flow discharges, the river width is ∼ 1200 m and the cross-channel average water depth is ∼ 15 m. The insitu measured maximum flood discharge amounts to 81 000 m 3 /s, which occurred in July 1998. The construction of the dam has an impact on both sediment sorting and flux. The reservoir impoundment traps the coarse material, and the fine-grained sediment is released downstream. A study on the SSL in the river also provides a better understanding of the influence of the dam on the hydro-environment.

Data source
With diverse preparatory projects, the Three Gorges dam construction started in January 1995, with progressive impoundment from June 2003. The construction continued all the way to August 2009. During January and September 2010, the reservoir operated at a relatively high level close to the full pool. In October 2010, the reservoir attained its full retention level. The records included in the study refers to the suspended sediment flux (kg/s) and cover a period of six years, from 1 January 2010 to 31 December 2015. The data are collected from the Yangtze Water Resources Commission (www.cjw.gov.cn).
A total of 2191 datasets are available, 70% of which are used for model calibration (training) and 30% for validation (testing), a rule of thumb as suggested by Roushangar and Ghasempour (2019). There is no  The unit of all variables (except N) is kg/s. well-established criterion for the division; the 7:3 ratio is commonly used in time series modelling. Table 1 presents the statistics of the field SSL data divided between the training and testing phases, where SD = standard deviation.

Results and discussion
With the SSL time series, the proposed models are set up and evaluated. All the models are coded in MATLAB (Version R2021a) on a PC (model Dell Precision 7872 Tower; processor Inter(R) Xeon(R) Gold 5215 CPU @ 2.5 GHz). Their implementation and predictive performance are detailed in the sections that follow.

De-seasonalisation
To best estimate the T(t) of a series, one first estimates and removes I s . Conversely, to best estimate I s , one first estimates and removes T(t). Therefore, the SA is usually performed as an iterative process. First, to obtain an initial estimate of the T(t) using a moving average. To prevent observation loss, the first and last smoothed values are repeated. Second, to de-trend the original series using Eq. (7). Seasonal indices are also created and stored, which ensures that each period does not appear the same number of times within the span of the observed series. Third, to apply a seasonal filter to the de-trended series to obtain an estimate of I s . Forth, to de-seasonalise the data by subtracting the estimated T(t) from the original time series. The last three steps are repeated to achieve a final de-seasonalised series. T(t) and I s are shown in Figure 5. T(t) has a constant amplitude across the series. The seasonal estimate is cantered and fluctuates around zero. This is because the additive SA assumes that the seasonal level is constant over the range of the data. The de-seasonalised series consists of the long-term trend and irregular components. T(t) exhibits a constant pattern and does not need any prediction, while I s features the stochastic trend and needs to be modelled.

Input selection
A selection of proper model inputs is vital to data-driven modelling. There are usually two types of inputs: (a) external contributing factors of the target variables (Li et al., 2021;Roushangar & Ghasempour, 2019) and (b) time delayed values of the targets (Mehr, 2018). Although both approaches generate accurate results, the former is limited to the availability of the exogenous variables. Requiring only the historical information of the targets, the latter is therefore adopted in this study. Define S(t) as the SSL at time step t, and S(t− t) is the lagged SSL ( t = time lag). The determination of S(t− t) that has a significant impact on S(t) becomes the key step in input selection. The partial auto-correlation (PAC) is often used for this purpose (Hadi & Tombul, 2018;Li et al., 2022). The PAC computes the relationship of a stationary time series with its own lagged values, which helps identify the lag extent in an autoregressive model. Despite its successful applications, some argue that this method is incapable of capturing the nonlinear features of a hydrological system (Nayak et al., 2004;Senthil Kumar et al., 2005). For this reason, an additional method, the average mutual information (AMI), is employed to determine the inputs. The AMI is a nonlinear generalization of the auto-correlation function, which measures the amount of information that the variables provide about one another (Farvardin, 2003). Applications of this method for inputs selection are reported by Vlachos and Kugiumtzis (2009) and Wallot and Mønster (2018). In order to obtain the optimal inputs, both the PAC and AMI method are examined. Figure 6 presents the correlation between S(t) and S(t− t) computed by the PAC and AMI approach. For the former, the largest value appears at t = 1 d, and the most significant lagged variables are S(t−1), S(t−2), S(t−3) and S(t−9). For the latter, the dependency of S(t) on S(t− t) drops as t gets large. The top three influential lagged data are chosen as the potential informative predictors. Taking into account the results from both methods, all the possible combinations of different S(t− t) are tested to find the effective inputs, shown in Table 2.

Benchmark study
The standard ELM is adopted to predict the SSL, an choice that is based on its relatively high efficiency in hydrological modelling (Ebtehaj et al., 2016;Hazarika et al., 2020). The purpose of employing the ELM is twofold: (a) to identify the most effective input for datadriven models and (b) to create a benchmark study for comparison. The number of hidden neurons and the transfer function are two critical parameters for the ELM. Without established guidelines, they are usually determined by trial and error. It is found that a neuron number between 4-11 is sufficient to achieve satisfactory predictions. With the rbf, sigmoid, tanh and linear transfer functions examined, the rbf yields the best results. Table 3 shows the statistical performance of the ELM with different inputs. Among the single variable inputs (M1-M4), M1 leads to the highest accuracy, while M4 generates the lowest. Consistent with the PAC and AMI analysis, the significance of S(t− t) reduces as t increases. It is also noted that adding more input variables does not necessarily produce better predictions (e.g. as with M1, M7), which is presumably caused by noise signals. The models with one-day lagged input generally result in satisfactory performance. Input M6 offers the most accurate results, with NSE = 0.948 and 0.937 at the  training and testing phases, respectively. Consequently, it serves as the optimal input for data-driven modelling.

Performance of the BMS framework
With the identified optimal input, the proposed BMS framework is applied to forecast the daily SSL. Three regressors of two variants each are tested, generating six sub-models. For example, if the GSVM is employed, the resulting model is referred to as BGSVMS. For demonstration, only important parameters in each standard model are selected for optimisation using the BOP algorithm. As it is a stochastic method, the outputs from iterations might differ. Therefore, the models are run three times, the one with the lowest error survives. For each time, the BOP operates 30 iterations. For comparison, the corresponding stand-alone models are also presented.

SVM and bsvms models
An SVM model contains three essential parameters: box constraint (BC), kernel scale (KS) and epsilon. Referring to the parameter C in Eq. (3), the BC keeps the allowable values of the Lagrange multipliers α i in a 'box', a bounded region. This parameter controls the penalty paid by the SVM for misclassifying a training point and thus the complexity of the prediction function (Karatzoglou et al., 2006). A high C value leads to a complex prediction function, with overfitting risks, while a low value causes the opposite effect. The suggested BC value for the Gaussian and linear kernels is IQR(Y)/1.349 and 1, respectively, with IQR denoting the interquartile range (MATLAB, 2021). The KS is a scaling parameter for the input data, which are usually scaled with respect to a feature before applied to the kernel function. If the range of some features is large, their inner product can be dominant in the kernel calculation. The suggested KS value is 1 (MATLAB, 2021). The epsilon defines a margin of tolerance where no penalty is given to errors. The larger it gets, the larger the errors are allowed in the solution.
The epsilon is equal to IQR(Y)/13.49, which is an estimate of a tenth of the standard deviation using the IQR of the response variable. If IQR(Y) 0, then its default value is 0.1. To improve the performance of the SVMs, the BOP method searches for the optimal values of the above-mentioned parameters. Figure 7 plots the objective values at different iterations. If there are multiple points with the same objective value, the one that first appears is considered as the optimum, as less computational power (i.e. the number of iterations) is needed. The variations of the objective for both models exhibit a similar pattern, an initial drop followed by a stable state after several iterations. The observed best point for the BGSVMS and BLSVMS appears at the 25th and 16th iteration, respectively, with their respective function values being 3.22×10 −2 and 3.59×10 −2 . For the BLSVMS, its optima are BC = 985.61, KS = 1.71 and epsilon = 0.168. Table 4 summarizes the statistical indices of both the optimised and the standard models. The GSVM and LSVM yield satisfactory results, with NSE = 0.933 and 0.950 in the testing phase. The proposed framework successfully enhances the performance of the standard GSVM and LSVM. For the GSVM in the testing phase, its NSE improves by 2.5% and CC by 1.2%; its RMSE lowers by 19.2% and MAE by 8.9%. For the LSVM, the goodness-of-fit increases by 0.7% (NSE) and 0.3% (CC); the error reduces by 7.7% (RMSE) and 10.3% (MAE). Meanwhile, the BGSVMS and BLSVMS demonstrate a comparative level of accuracy, both of which outperform the benchmark ELM.
In form of scatter plot, Figure 8 compares the observed with the predicted SSL by the SVM based models. The standard models also give satisfactory results, with R 2 = 0.952 for the GSVM and R 2 = 0.958 for the LSVM. The BGSVMS and BLSVMS demonstrate even higher accuracy than their counterparts, with R 2 = 0.969 and 0.968, respectively. Their predictions collapse closely onto the perfect line, although the forecast skill slightly deteriorates at the peak values. Figure 9 compares the measured SSL time series with the BGSVMS and BLSVMS predicted ones. Both models satisfactorily capture the SSL fluctuations, featuring good agreement with the measurements.

EL and BELS models
The EL model performance depends on both base learners and ensemble strategy. An optimal structure of each leads to higher accuracy. This study examines the regression tree as weak learner and the Boosting (BO) and Bagging (BA) as ensemble methods. Integrating them  into the suggested framework leads to the BBOS and BBAS models, respectively. The hyperparameters to be tuned include the number of ensemble learning cycles (LC), maximal number of decision splits (DS), minimal number of leaf nodes (LN) and learning rate (LR). The LC is the number of times the ensemble algorithm trains the base learners. The DS refers to the depth of the tree. The deeper, the more complex. Each leaf node is a class label (decision taken after computing all attributes), and the LN is the minimal number of labels. In order to achieve high-accuracy, it is required to train an ensemble using shrinkage (only applicable to BO). Their suggested values are LC = 100, DS = 10, LN = 5 and LR = 0.1. Figure Table 4 compares the statistics of the BBOS and BBAS with their counterparts BO and BA. The BO yields highaccuracy in the training phase and relatively low efficacy in the testing, indicating its poor generalisation capability. However, if it is integrated into the BMS framework, this drawback is avoided. For instance, at both stages, the BBOS generates similar accuracy. For the testing, the BBOS shows superior performance over the BO, with its augment in the NSE and CC by 4.3% and 1.6%, and its reduction in the RMSE and MAE by 24.9% and 24.2%. Similarly, the BMS framework enhances the BA performance. Its NSE and CC improves by 0.6% and 0.2%, and its RMSE and MAE drop by 4.7% and 2.8%. Figure 11 presents the scatter plots of the measured against the predicted SSL by ensemble based models. The BO and BA results show a dispersion greater than their enhanced variant. The BMS upgrades the standard models in accuracy. The R 2 increases from 0.969 to 0.970 for the BO, and from 0.963 to 0.972 for the BA. Nevertheless, slight underestimations of the SSL peaks occur in all the models. Figure 12 compares the observed time series with the estimated ones by the BBOS and BBAS. In a satisfactory manner, both models reproduce the SSL variations and generate accurate results.

DL and BDLS models
In an LSTM, the number of layers (NL) and hidden units (HU) are two influential parameters of its architecture. The former defines the network complexity, and the latter determines how much information is learned by the layer. Using more HU might yield more accurate results, but it is more likely to lead to overfitting. The initial learning rate (ILR) refers to the learning speed for training. If the ILR value is too low, the training might take a long time. If it is too high, the training might reach a suboptimal result or diverge. Adding a regularization term (also known as weight decay, WD) for the weights to the loss function is one way to reduce overfitting (Murphy, 2012). As an example, these four parameters are optimised by the BOP method. The conventional and bi-directional LSTM (BiLSTM) are used to examine the effectiveness of the BMS framework. Figure Table 4 depicts the statistical performance of the standard and optimised DL models. The simple LSTM and BiLSTM exhibit insignificant differences, indicating that the bidirectional structure has limited effects on the SSL prediction. However, their combinations with the BMS generate more accurate results. For instance, in comparison with the LSTM in the testing phase, the BLSTMS leads to a 1.5% and 0.3% increase in the NSE and CC and a 12.1% and 16.8% decline in the RMSE and MAE. Likewise, the BBiLSTMS yields more accurate predictions than the BiLSTM, with enhancement in the NSE and CC by 1.6% and 0.3%, and reduction in the RMSE and MAE by 13.7% and 9.8%. Figure 14 illustrates the scatter plots of the measured against the forecasted SSL by the DL based models. The modified models produce better results than the conventional ones. Their predictions distribute closely around the ideal line, with slight underestimation at large values. Overall, the BMS is an effective framework with significantly enhanced model performance. The R 2 is boosted   models, which accurately simulate the highly oscillated SSL.

Model comparison
A performance comparison of all the models facilitates the choice of a most accurate model for practical applications. Figure 16 presents the prediction errors in the testing phase. It is obvious that the hybrid models outperform their counterparts. The BGSVMS and BLSVMS are the most accurate ones, which yield the lowest errors. Among the integrated models, the BELS algorithms generate the highest error levels, followed by the BDLS ones. Meanwhile, it shows that the DL is not necessarily more  accurate than the conventional ML methods, as indicated by Zhu et al. (2020).
To further examine the prediction errors of the BMS models, Figure 17 displays the histograms of the relative   Figure 18 presents the violin plot of the measured and predicted SSL. The green dot shows the median; the magenta bar in the centre represents the IQR; the thin black line displays the rest of the distribution. The wide section of the violin below 5000 kg/s indicates the high probability of SSL within this range. All the modified models give an accurate estimation of the SSL. Their predictions agree well with the observation, in terms of the median, IQR, etc. The BLSVMS and BLSTMS present better performance than others.
The previous analysis demonstrates the superiority of the hybrid models over the conventional ones. However, at high flow conditions, all the models show some deficiency. This is primarily due to the highly fluctuated streamflow during the wet seasons. The SSL varies with the river discharge, rendering it a challenging task to model. To examine the robustness of the models, this section focuses on the wet seasons and further assesses the model performance. A rainy month with large floods is randomly selected for the evaluation. Figure 19 presents the comparison of the observed and modelled SSL during the flood event. These results confirm the improved performance of the modified models. Compared with the standard models, the hybrid ones yield a better approximation with respect to the measurements under the flood conditions. Although certain underestimation exists in all of them, the SSL peaks are better captured. Figure 20 shows the prediction errors (RMSE and MAE) during the flood event. All the proposed models gain by comparison. The BMS significantly cut down the errors, with a reduction up to 47.9% for the RMSE and 48.3% for the MAE. The GSVM sees the most significant improvement, while the BO the least significant.

Discussions
In a fluvial system, the SSL forecasts play an essential role in understanding its flow and sediment transport behaviours. As the SRC method depends merely on the correlation between the discharge and the sediment flux, it often fails to produce accurate results, particularly for flows with high non-linearity and non-stationarity (Li  Li et al., 2022). Previous studies have demonstrated the high performance of ML techniques in tackling these problems (Bayram et al., 2012;Nourani & Andalib, 2015;Rajaee et al., 2011;Shiri & Kişi, 2012). To enhance the modelling accuracy, continual efforts are devoted to optimizing the ML models, with this study as part of them.
The developed BMS is a decomposition-based ML model that is coupled with an optimiser. Thanks to the extraction of informative components, data preprocessing is proved to be an effective method for time series modelling (Abdoos, 2016;Li et al., 2013;Rajaee et al., 2011). Compared with the commonly used WT, the SA approach in this study does not introduce extra sub-series that might be redundant. Meanwhile, it avoids the inclusion of future data and selection of decomposition levels and filters. Despite its use in prediction of wind speed by e.g. Wang et al. (2014) and Zhang et al. (2013), the SA application in the SSL modelling is barely reported. This study demonstrates its effectiveness in SSL forecasts. For the BOP algorithm, it is found to be advantageous in the optimisation of the architecture of the ML models, which conforms to the studies by Wu et al. (2019) and Kouziokas (2020). The combination of the pre-processing and optimisation techniques boosts the performance of the conventional ML, EL and DL methods. One practical application of the developed models is to achieve high-accuracy SSL prediction in a complex environment. If standard models fail to produce satisfactory results, the BMS provides an option for enhanced estimation. Meanwhile, it is a promising method in forecasting similar hydrological events. As a complex system, multiple factors influence the sediment dynamics in a natural river, inclusive of vegetation, grain size, soil conservation measures, geographical properties, etc. (Idrees et al., 2021). This study adopts the historical SSL to estimate the future trend. Consideration of more contributing parameters would further enhance the model performance, an issue that is yet to be explored. Another limitation of the BMS model is the increased computing power. The BOP optimizes the structures of the standard models by searching for the best combination of hyperparameters, which leads to greater computational complexity and requires more CPU time. The average execution time of the hybrid models is ∼ 3.2 times of the individual ones. An examination of alternative optimisation techniques, e.g. artificial bee colony and gray wolf optimiser, can be made in future studies.

Conclusions
With the purpose to improve forecasts of suspended load in river flows, this study establishes a new framework (BMS) by integrating a de-noiser (SA) and an optimiser (BOP) into a ML model. The SA aims to extract the global trends in the time series, free from the seasonality effect, while the BOP optimises the structure of the ML model and guarantees its optimal performance. The BMS is evaluated with three types of ML models: conventional ML (GSVM and LSVM), EL (BA and BO) and DL (LSTM and BiLSTM).
The model examinations employ the daily SSL records from a hydrological station on the lower Yangtze River, with the performance statistically evaluated. The integrated models are compared to each individual one. The former gains The BMS is effective in enhancing the accuracy of the standard ML models. At the testing stage, for the SVMs, the NSE and CC improve by up to 2.5% and 1.2%; the RMSE and MAE decline by up to 19.2% and 10.3%. In the EL models, a more significant augment in performance is achieved for the BO, with an increase in the NSE and CC by 4.3% and 1.6%, and a reduction in the RMSE and MAE by 24.9% and 24.2%. The BMS gives a moderate boost to the DL methods. Their goodness-of-fit indices are improved by up to 1.6% for the NSE and 0.3% for the CC, their error statistics drop by up to 13.7% for the RMSE and 16.8% for the MAE.
A comparison among the hybrid models shows that the BSVMS is most accurate, followed by the BDLS and the BELS. However, all models exhibit slight deficiency at large sediment fluxes. Under flood conditions, there exists underestimation in both the standard and combined models. However, the latter is still more reliable than the former. The BMS is beneficial to enhance the prediction accuracy, which considerably reduces the errors by up to 47.9% for the RMSE and 48.3% for the MAE. The most noteworthy improvement occurs in the GSVM and the least in the BO. This paper demonstrates that the BMS is a reliable option for accurate predictions of suspended sediment in rivers, which is thanks to the de-seasonalisation and hyper-parameter tuning modules that improve modelling robustness and adaptability. The suggested data pre-processing and parameter optimisation techniques also provide reference for time series modelling. Its performance is to be explored to study similar hydrological events, e.g. dissolved gas and drought. In addition, the influence of external contributing factors (e.g. vegetation and discharge) of the SSL can be examined, and further optimisation is required for fast calculation.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This study is funded by the Swedish Hydropower Centre (SVC). It is part of research project entitled Quality and trust of numerical modelling of water-air flows for safe spillway discharge, with James Yang and Anders Ansell as project leaders. ORCID James Yang http://orcid.org/0000-0002-4242-3824