Stochastic simulation of daily runoff in the middle reaches of the Yangtze river based on SVM-Copula model

ABSTRACT Hydrological stochastic simulation is an important method for hydrologic analysis which needs a large amount of measured data to determine the variable distribution function. That impedes the development of the random simulation in small sample cases. The characteristics of daily runoff series in the middle reaches of Yangtze River has changed a lot since the Three Gorges dam was built in 2003. The measured hydrologic data since 2003 is insufficient for stochastic simulation with conventional methods. In order to overcome the difficulty, a simulation model coupling the support vector machine with the Copula function was established in this paper. The measured data of Hankou Hydrology Station from 2003 to 2015 was used as an applied example to verify the applicability of the model. The research showed that the model we established can maintain the statistical characteristics of the original daily runoff series well, it could be a new way for the random simulation in the case of insufficient data.


Introduction
Hydrological processes are quite complicated due to the influence of meteorology, terrain and human activities. The stochastic simulation method has been an important mean of understanding and managing complex hydrology by studying the characteristics of random hydrological processes. A reasonable model is the key to effective stochastic simulation. The runoff is important hydrological element, the runoff simulation sequences generated by stochastic models are used to obtain the required hydrology data (such as the flood discharge, firm capacity, reservoir design flood level, etc.). Therefore, the simulation of runoff plays an important role in engineering design, hydrology and water resources management. With the continuous development of random hydrology, many different types of stochastic models have been formed, such as the autoregressive models, the dissolving models, the speckle noise models, wavelet models, stochastic models based on Copula function, etc. The stochastic model based on Copula functions can effectively describe the relationship between variables and construct multi-variable joint distribution. It has been widely used in runoff stochastic simulation since Sklar first proposed the Copula theory in 1959 (Sklar, 1973). In recent years, a lot of researches study on CONTACT Jie Zhang 85304438@qq.com the theory and its application of Copula simulation model in hydrology. Pereira, Álvaro, Erhardt, and Czado (2016) developed a statistical model which considers spatial and temporal dependencies of runoff for runoff scenario simulation, an applied study showed that the model could reproduce the temporal and spatial dependence structure. Chen and Guo (2015) proposed a Copulabased method for generating multi-site monthly and daily runoff data, the method could realize multisite stochastic simulation by establishing trivariate Copula. Liu, Zhou, Chen, and Guan (2015)proposed a multivariate conditional model for runoff prediction and the refinement of spatial prediction estimates. The model consisted of high-dimensional vine Copulas, conditional bivariate Copula simulations, and a quantile-Copula function. A maximum entropy-Gumbel-Hougaard Copula model for daily runoff simulation was established by Kong, Huang, Fan and Li, the model was proved to be effective by an example study. Hao and Singh (2012) proposed an entropy-Copula model for single-site monthly runoff simulation, the joint distribution of adjacent monthly runoff was constructed by using the Copula method. Lee and Salas (2014) explored the applicability of the Copula method for stochastic runoff simulation and investigate the pros and cons of different Copula time series models. Jeong and Lee (2015) developed a stochastic simulation model based on Copula method for intermittent monthly flow arid regions, and applied the model to Colorado River intermittent runoff for water management. Mei, Yin, Hong, Li, and Li (2016) compared the synthesis effect on annual runoff of the double model simulation method and the typical method of solution set decomposition, the sample mean features keep well by two methods. Yang, Si, Fan, Kong, and Lin (2019) constructed the correlation analysis between rainfall and runoff in Xiangxi River by using Archimedean Copula method, and the joint distribution of monthly rainfall and runoff was obtained.
In these studies, the determination of the variable marginal distribution and the joint distribution between variables are the two keys to construct an appropriate stochastic model. The determination of the joint distribution function between variables can be achieved by selecting the appropriate Copula function, and the variable marginal distribution function usually determined by probabilistic methods which requires a sufficient sample size of the historical sample data. Therefore, the above models are usually stochastically simulated based on measured runoff data of rivers for at least several decades.
Since the Three Gorges Reservoir was impounded in 2003, the runoff of the hydrological station in the middle and lower reaches of the Yangtze River has undergone significant changes which are affected by reservoir dispatching and rainfall. The daily runoff sequence during one year changed greatly especially. The changes are expressed in a decrease during flood season and an increase during dry season. The random simulation of the runoff sequence after the Three Gorges reservoir has become a new focus of the Three Gorges reservoir's dispatching and flood controlling. Since the Three Gorges water storage, the accumulated hydrological data of the hydrological station in the middle and lower reaches of the Yangtze river are less than 20 years which are far from the sample size to determine the variable's marginal distribution by the frequency statistical method. Finding a new effective marginal distribution identification method has become the key point for establishing a runoff simulation model. Support vector machine (SVM) is a machine learning method based on statistical learning theory which was developed in the mid-90s. It seeks to increase the generalization ability of learning machines and get the minimization of the risk of experience and the confidence range by seeking the minimization of the structural risk. Hence it has a good performance in solving the problems of small samples and nonlinearities. Zhou, Jun, Zhang, and Jun (2017) proposed a model combining wavelet transform (WT) with bacterial colony chemotaxis algorithm and support vector machine (BCC-SVM) algorithm (WT-BCC-SVM), they applied the method to diagnosis of pesticide residues to provided it could work. SVM classification method has been widely used in mechanical engineering, medicine, artificial intelligence and other fields. Diyana, Yee, Tadaaki, Hiroshi, and Fuminori (2017) presented a two-stage tongue's multi-colour classification based on SVM method for precise tongue colour diagnosis, an application showed that the accuracy of diagnosis was satisfactory. Chen, Wang, Yang, and Zhou (2006) proposed a novel approach based on rough set theory and SVM method for facial expression recognition with considering the geometric feature only, and the approach could get high recognition ratio. Jangid, Dhir, Rani, and Singh (2011) also used SVM classifier with RBF kernel to classify handwritten numbers in their study. Jing, Cao, Hu, and Gao (2015) proposed a reliability life distribution pattern recognition method based on multi-classification fuzzy support vector machine (M-FSVM), simulation experiment results showed that this model can overcome the shortcomings of traditional support vector machines and can recognize the commonly used life distribution patterns intelligently. A large number of practices have proved that SVM classification method can effectively identify different types of images and texts. Based on the SVM pattern recognition method and the Copula function of constructing joint distribution, the SVM-Copula coupled stochastic model is established in this paper. It provides a new way for the random simulation of the daily runoff sequence in the small sample cases.

Identification of the marginal distributions of daily runoff based on the SVM theory
The purpose of SVM classification is to develop a computationally efficient way to find the optimal classification hyperplane in high-dimensional feature space. At present, the SVM classification has been widely used in text classification, image recognition, handwritten recognition, biometric classification and recognition (Cristianini & Shawe, 2001).

The algorithm of SVM
The goal of classification is to find the optimal hyperplane, the target hyperplane is required not only to correctly separate the points of the sample but also to maximize the classification interval which means the minimum structural risk (Cherkassky, 1995). In the case of linear inseparability, the nonlinear mapping algorithm is used to transform the linear inseparable samples of the lowdimensional input space into high-dimensional feature spaces to make them linearly separable.

SVM bivariate classification algorithm
In the case of two separable sets, we can separate the two sets with an infinite hyperplane which is just at the centre between the two sets. The basic fundamental of the SVM algorithm is to find the support plane, the bivariate classification algorithm is as follow (Sakthivel, Sugumaran, & Nair, 2010).
Set the training sample set as where n is the number of training samples, d is the dimension of each training sample vector, and y is the category label (if y = +1, the sample belongs to one class; if y = −1, the sample belongs to another class).
In linear indivisible cases, transform function : x → maps training samples from input space to the highdimensional feature space, the kernel function K( Designing an SVM bivariate classifier is to find out the optimal hyperplane that maximizes the classification interval when the sample set is linearly separable in the high-dimensional feature space. The classification model can now be formulated as the following constrained optimization problem: (2) A slack variable ξ i ≥ 0 can be introduced to ensure the correctness of the classification in the cases of linearly inseparable in the high-dimensional feature space. Therefore, the constrained optimization problem in Equation (2) can be translate into where C is the penalty factor. In order to ease the solution, the above problem can be transformed into its dual problem which is expressed as follows: where α i is the corresponding Lagrange factor. After the optimal solutions of α i and b were obtained by solving the above constrained optimization problem, we can get where α * i and b * are the optimal solutions of α i and b separately.
The SVM classification algorithms are different with the different selected kernel functions K(·). The commonly used kernel functions are: linear kernel function; polynomial kernel function; radial basis kernel function and sigmoid kernel function.

SVM multivariate classification algorithm
It is necessary to construct a suitable multi-class classifier when deal with the classification with multivariate types. At present, multi-classifiers are mainly constructed by a combination of multiple two-classifiers which can be established by the one-to-one method and the one-to-all method. In practical applications, the one-to-one method is more commonly used. It needs to construct a classifier between any two types, and complete classification by voting.

Identification of the daily runoff marginal distribution based on SVM-classifier
Practice shows that the distribution patterns commonly used in hydrological analysis are Normal distribution, Lognormal distribution, Weibull distribution, Exponential distribution, Pearson distribution and Gamma distribution.
The characteristics of the distribution can be measured and described in three aspects. The first one is the concentration trend which reflects the degree of data closing to or aggregating from the central value such as median, quantile, average, harmonic mean, geometric mean, etc. The second one is the degree of dispersion which reflects the trend of each data away from the central value such as heterogeneous ratio, interquartile range, standard deviation and variance, standard deviation coefficient, discrete coefficient, etc. The third one is the skewness and the kurtosis reflecting the shape. Combined with the main features of hydrological data, seven characteristic parameters including quartile (upper quartile and lower quartile), interquartile range, standard deviation, standard deviation coefficient, skewness coefficient and kurtosis coefficient were selected.
The above distribution patterns were subdivided into 573 distribution patterns as different distribution parameters which are shown as Table 1.
A set of distribution feature parameter values corresponding to each distribution pattern constituted a training sample. The training sample value were calculated by   Table 2.
The SVM two-classifiers were constructed with two of the six distribution patterns. A total of 15 classifiers are used to identify the distribution patterns of the test samples by using the voting method. The distribution type with the largest number votes is the goal marginal distribution that the test samples satisfied. And then the marginal distribution function of the test sample can be determined.

Copula functions
The concept of Copula was first introduced by Sklar in 1959 to answer M. Fréchet's question about the relationship between multi-dimensional distribution functions and low-dimensional edges. The Copula function is a uniformly distributed multi-dimensional joint distribution function defined on [0, 1]. It can join the marginal distributions of multiple random variables to construct a joint distribution (Nelsen, 1999). (6) where X i is random variable; F i (X i ) is the marginal distribution function of random variables X i .
For the two-dimensional case: where θ is a parameter of the Copula function;u = F X (x), v = F Y (y), which are the marginal distributions of the random variables (X and Y) respectively. Copula functions can be roughly divided into elliptical Copula, Archimedean Copula, quadratic Copula. Thereinto, Archimedean Copula functions are widely used in the field of hydrology. The expressions and applicable conditions of the common two-dimensional Archimedean Copula are as follows. Gumbel-Hougaard (GH) Copula function: Clayton Copula function: Ali-Mikhail-Haq (AMH) Copula function: Frank Copula function:

Construction of SVM-Copula stochastic model
The daily runoff sequence is a non-stationary hydrological time series. x t and x t−1 have different marginal distributions with F t (x) and F t−1 (x), the joint distribution function N(·, ·) represents the normal distribution, L(·, ·) represents the lognormal distribution, E(·, ·) represents the exponential distribution, W(·, ·) represents the weibull distribution, P(·, ·) represents the pearson distribution, G(·, ·) represents the gamma distribution, Q0.75 represents the upper quantile, Q0.25 represents the lower quantile, Q represents the interquartile range, Q = Q0.25−Q0.75, σ represents the standard deviation, v = σ x represents the standard deviation coefficient which is also called the discrete coefficient, Skew represents the skewness, Kur represents the kurtosis.
can be express as: (12) where z t , z t−1 are the marginal distributions of x t and x t−1 . The conditional distribution function of x t with respect to x t−1 is: After the conditional distribution function of x t with respect to x t−1 has been gotten, a random simulation of the daily runoff time series can be performed according to the inverse function of the generated sample.
where t = 1, 2, . . . , m, m is the number of sections; i = 1, 2, . . . , n, n is the sequence length; E t,i is random number uniformly distributed in accordance with (0, 1). The previous section value of the first section value of the i-th year is the m-th section of the (i-1)-th year which is Z m,i−1 . θ 1 represents the relationship between the first section and the mth section values of the previous year. The marginal distribution function F(x) of random variable can be determined by SVM classifier. The simultaneous equations of Equation (13) and Equation (14) are the general forms of the SVM-Copulabased stochastic model of the daily runoff sequence: where t = 1, 2, . . . , m, i = 1, 2, . . . , n.

Parameter estimation
The parameters of SVM-Copula daily runoff series stochastic model can be divided into two categories: marginal distribution function parameters and the Copula function parameters. The parameters of the marginal distribution function can generally be obtained by the moment method, the linear moment method or the maximum likelihood method. The correlation index method is usually used to determine the Copula function parameters. The parameter θ can be calculated by the relationship with the Kendall rank correlation coefficient. The relationships between θ and τ are as follows: GH Copula function: Clayton Copula function: AMH Copula function: (18) Frank Copula function:

Model algorithm
The algorithm of the SVM-Copula model consists of two parts: the marginal distributions which are determined by SVM classifier and the conditional distribution functions which are calculated by Copula functions. On the basis of these, the daily runoff simulation sequence can be obtained by the inverse function method. The simulation steps of the SVM-Copula daily runoff stochastic model are: Step 1. Determine the marginal distribution. Identify the marginal distribution of each day's runoff by SVM classifier, determine the marginal distribution function F t (x) and its parameters μ t , σ t .
Step 2. Compute the joint distribution. Choose an appropriate Copula function by analysing the measured daily runoff sequence, determine the parameter θ t , and solve the Equation (13) to get the conditional distribution function of x t with respect to x t−1 , put the result into Equation (14) to determine z t,i and x t,i .
Step 3. Give the original number. Generate a random number ε 1,1 uniformly distributed in accordance with (0, 1), put them into the first formula of Equation (12) to get z 1,1 , and then solve the second formula of Equation (12) to get the simulation value of the first section of first year x 1,1 .
Step 4. Simulate the random runoff sequence of the first year. Put z 1,1 and a random number E 2,1 to Equation (15) to get z 1,1 and the simulation value x 2,1 of the second section of the first year, and so on. Get the m section values z 3,1 , z 4,1 , . . . , z m,1 and the simulation values x 3,1 , x 4,1 , . . . , x m,1 .
Step 5. Simulate the random runoff sequence of n years. Put z m,1 and a random value ε 1,2 into Equation (12) to get z 1,2 and x 1,2 , repeat step 4 to get the simulation values of m sections of the second year. And so on to obtain the simulation values of daily runoff sequence of n years. .75 are the data of daily runoff which have been processed to the interval of (0,1). σ represents the standard deviation; v represents the standard deviation coefficient which is also called the discrete coefficient; Skew represents the skewness; Kur represents the kurtosis.

Application study
The current runoff data of the Hankou Hydrological Station from 2003 to 2015 was used for the purpose of testing the feasibility and effectiveness of the SVM-Copula stochastic model.

Model establishment
According to the measured runoff data of Hankou Station from 2003 to 2015, the characteristic parameters are extracted from the daily runoff time series, the daily runoff characteristic parameters of each day constitute the test sample set. Partial test samples are shown in Table 3. The distributed pattern recognition is performed by the one-to-one SVM multi-classifier which was constructed in the foregoing, and the voting result is shown in Table 4.
From Table 4, the logarithmic normal distribution has the highest vote, and the daily runoff is considered to be in the Lognormal distribution L(μ, σ ) whose probability density function (PDF) is Its cumulative distribution function (CDF) is where parameters μ and σ are the mean and standard deviation of ln(x) respectively. Therefore, the parameters of the marginal distributions of different days are different. Partial Lognormal distribution parameters are shown in Table 5.
It has been proved by practice that the Frank Copula function simulates the runoff sequence better than other  Copula functions (Jeong & Lee, 2015). Beyond that, parts of the Kendall rank correlation coefficients τ of the daily runoff data are shown in Table 6. As we can see, the rank correlation coefficients between adjacent two days runoff are large. Therefore, the Frank Copula function whose function form has been shown in Equation (14) is selected as the coupling function of the model, and the relationship between θ and τ has been shown in Equation (19). According to the relationship between Frank Copula function parameter θ and the Kendall rank correlation coefficient τ , the parameters of Copula function θ t of the model can be obtained. Parts of the parameters are as shown in Table 7.
According to Equation (16), the conditional distribution function of x t is   With the simultaneous equations of Equation (22), Equation (11) and Equation (21), we can establish a stochastic model of the daily runoff based on the SVM-Copula function: where t = 1, 2, . . . , 365.
In the above formula, ε t,i is a random number uniformly distributed in accordance with (0, 1). It should be noted that x t−1 and F t−1 (x t−1,i ) indicate the daily runoff corresponding to the 365th day and its cumulative distribution function respectively when t = 1.

Model applicability test
In order to test the applicability of the model, the longsequence method was used. We run random simulations of daily runoff based on the established model and ended up with 10000 simulation sequences. The mean values, the mean square errors, the coefficients of variations and the Kendall rank correlation coefficients of the simulated value and the measured value are calculated respectively to verify whether the simulation data of the model could maintain the characteristics of the measured data. The partial results are shown in Table 8 and  Table 9.
As we can see, the errors between measured value and simulated value is relatively small for the mean values, it means that the simulated data can maintain the characteristics of the measured data well in terms of the numerical value. That the mean variances and the relative errors of variation coefficient of simulated data and the measured data are small indicates that the model accurately identifies the marginal distribution pattern of daily runoff and maintains the original characteristics of measured data in terms of the degree of dispersion. The rank correlation coefficient error of the adjacent two days between the simulated runoff value and the measured value is quite small, it indicates that the simulated daily runoff sequence generated by the model can maintain the statistical characteristics of the measured runoff sequence in terms of autocorrelation.
By comparing the simulated data with the measured data, the simulated value of the SVM-Copula daily runoff stochastic model maintains the statistical characteristics of the measured daily runoff data in general. The model can be used for the random simulation of daily runoff.

Conclusion
The correct marginal distribution function of daily runoff and the appropriate correlation structure between adjacent two days runoff are the keys to realizing the random simulation of daily runoff series. In this paper, the SVM multi-classifier was constructed to automatically identify the marginal distribution pattern of each day's runoff, and its distribution function was determined. The correlation structure between adjacent two days runoff was established based on the Frank Copula function, moreover, the SVM-Copula stochastic model of daily runoff was constructed. A case study showed that the simulated values generated by the SVM-Copula stochastic model maintain the statistical characteristic of the measured daily runoff series and the model is suitable for the stochastic simulation of daily runoff.