Solution for a time-series AR model based on robust TLS estimation

Abstract We discuss an algorithm for the autoregression (AR) model as a typical time-series model. By analyzing the structure of the AR model, we highlight the shortcomings of traditional algorithms for model parameter estimation and propose an approach to overcome the shortcomings of traditional solutions for the AR model. The errors-in-variables (EIV) model is used to solve an existing problem. There is an obvious difference between the AR model and a traditional model. For the AR model, one observation datum appears repeatedly; hence, the residual of one observation datum is not unique. Furthermore, in theory, the optimal estimation of model parameters cannot be obtained by current solutions. Based on an analysis of the AR model, we focus on how to obtain the optimal estimation when the observation data of the AR model are contaminated by outliers. The median function is used to establish a modified solution for the AR model based on the institute of geodesy & geophysics, Chinese academy of sciences (IGG) weight function by comparison with current algorithms for the robust estimation of the EIV model. We propose an iterative algorithm based on median function. Finally, we apply two numerical instances to compare the proposed algorithm with traditional algorithms and draw conclusions from results of the instances.


Introduction
The time-series model is widely used in the relative fields of prediction, such as solar irradiance prediction, ground subsidence prediction and weather prediction (Zhangzhen and Tianhe 2012;Hirata and Aihara 2017;do Nascimento Camelo et al. 2018;Tuncel and Baydogan 2018). The autoregression (AR) model is a typical timeseries model, and the theory based on AR model is developed. Time dimensions are taken into account when the variables described by AR model are related to time (Pfeifer and Deutsch 1980;Pfeifer and Ve Bodily 1990); when stochastic character of AR model has an influence on estimation model parameters, this factor is also investigated (Di Giacinto 2006); besides, AR model with Kalman filter is proposed for non-linear estimators (Kurt and Batu Tunay 2015). As a mathematical fitting model, the precision of the AR model's parameter estimation determines the efficiency of a prediction (Wan and Xiao 2016;Datteo et al. 2018). Currently, traditional algorithms for parameter estimation for the AR model involve transforming the AR model into the Gauss-Markov model (G-M model) and computing the model parameters based on least-squares (LS) estimation. The computation process is expressed as follows: where v is the error vector that exists in the observation vector (l), which is composed of observation data; v follows a normal distribution. B is the coefficient matrix that expresses the relationship between the observation vector (l) and model parameters vector (x). In a stochastic model, r 2 0 is the variance component of the unit weight, and its estimation is written asr 2 0 . P is the weight matrix. Based on LS(v T Pv = min), the model parameters (x) can be estimated, and the estimation is written aŝ x. D(x) denotes the covariance matrix, whereas r denotes redundancy.
A traditional solution for model parameter estimation only takes into account the observation error in the observation vector by implementing least constraints on it; however, as highlighted by Adcock (1877), the shortcomings of a traditional solution are that it ignores the error that exists in the coefficient matrix, and the elements of the coefficient matrix are also composed of observation data. To take into account errors that exist in the observation vector and coefficient matrix simultaneously, the G-M model was developed, and a more scientific model called the errors-in-variables model (EIV model; Adcock 1877) was proposed and written as follows: where E B is the error matrix that exists in the coefficient matrix (B), b = vec(E B )(vec(Á) represents the vector that stacks one column of E B underneath the previous one, and P B denotes the weight matrix of the coefficient matrix. To take into account both observation errors in the observation vector and coefficient matrix, total least squares (TLS) is proposed to extend the traditional criterion of LS: Researchers have discussed optimal solutions for the EIV model (Golub and Van Loan 1980;Schaffrin and Wieser 2008;Mahboub 2014;Ai-Sharadqah and Ho 2018), and it is extensively used in relative fields of geoscience (Schaffrin and Uzun 2011;Mahboub 2012;Wang et al. 2016;Leyang et al. 2017).
In this article, we introduce the EIV model into time-series prediction, in addition to discussing a solution for AR model parameter estimation based on TLS. Particularly, the optimal algorithm for robust estimation is discussed when the observation data in the AR model are contaminated by outliers. This remainder of this article is organized as follows: in Section 2, the AR model is shown in the form of the EIV model, and the difference between the AR model and traditional EIV model is analyzed. In Section 3, traditional solutions for AR model parameter estimation are presented first. Then, a modified solution for AR model parameter estimation while outliers exist in the observation data is proposed, where the proposed solution is based on robust estimation and the median function. In Section 4, the feasibility and efficiency of the proposed solution are proved using two numerical instances.

EIV model and its solution
Suppose that the quantity of model parameters is m, such that ða 1 ; a 2 ; :::; a m Þ, and the quantity of observation data is m þ n, such that ðh 1 ; h 2 ; :::; h mþn Þ. Then, the AR model can be expressed as follows: The AR model can be rewritten using the G-M model as follows: where ðv mþ1 ; v mþ2 ; :::; v mþn Þ is an error vector that exists in the observation vector (l), which is denoted as v in the equation. If we transpose the observation vector from the left-hand side to the right-hand side of Equation (4), then the standard expression of the G-M model can be obtained as v = Bxl, and the model parameters can be obtained using the method presented in the previous section. Note that elements of the coefficient matrix also consist of observation data; hence, there are observation errors in the coefficient matrix. If we define E B as the error matrix that exists in the coefficient matrix, then Equation (4) can be rewritten using the EIV model as v = (B þ E B )x -l. To obtain the optimal estimation of the EIV model parameters, the criterion of TLS is defined as Equation (2). Many solutions have been established for the EIV model, as cited previously. Typical solutions that can take into account the stochastic character of the model were proposed (Schaffrin and Wieser 2008;Mahboub 2014;Fang et al. 2017). In this article, we choose the iterative solution proposed by Schaffrin and Wieser (2008) as an example: wherex,v andÊ B are estimations of the relative parameters; Q, Q 0 and Q b are cofactor matrices; and P ¼ Q À 1 and P B ¼ Q À 1 0 Q À 1 b , where represents the symbol of the Kronecker-Zehfuss product.^is the estimation of Lagrange multipliers k,d ¼k T Q bk . The detailed iteration can be found in the seminal report of Schaffrin and Wieser (2008).
Additionally, by Equation (4), an obvious characteristic of the AR model can be determined such that the same element appears in the model repeatedly: the quantity of one element in vector ( h 1 ; h 2 ; Á Á Á ; h m ) that appears in the AR model is (1, 2, … , m), respectively, whereas ( h mþ1 ; h mþ2 ; Á Á Á ; h mþn ) is (n, … ,1), respectively. Thus, although the AR model belongs to the catalogue of EIV models, solutions for the traditional EIV model may not be suitable for the AR model because, in a mathematical model, one observation datum should obtain a unique residual or the same residual; however, current solutions cannot ensure that the observation datum meets this requirement, which results from the condition that one observation datum appears in the model repeatedly. Furthermore, this fact has an influence on the optimal algorithm, particularly for robust estimation, which is the main problem addressed in the following discussion.

Analysis of traditional robust estimation for the AR model
According to the previous analysis, to take into account the observation error in the observation vector and coefficient matrix, the EIV model is considered to be a rational solution for the parameter estimation of the AR model. However, the factor that one observation datum obtains different residuals because of its repeated appearance cannot be overcome by current solutions. Such a phenomenon has an influence on, for example, parameter estimation and precision statistics, particularly for robust estimation. In fact, the observation data are contaminated by an observation error unavoidably. When the error belongs to the stochastic error, criteria such as LS and maximum likelihood estimation are used to obtain the optimal estimation of model parameters. In the intervening time, when the observation data are contaminated by outliers, robust estimation is used to overcome the influence that outliers have on parameter estimation. As is well known, to overcome the influence of outliers that exist in the observation data, the method of robust estimation uses a weight function, such as the Huber weight function (Huber 1964) and IGG weight function (Yang 1999), to modify the weight of the observation data. Considering the IGG weight function, for example, where P i is the initial weight value of the observation data; jv i j is the absolute value of the residual for the observation data; a and b denote the threshold value; and their values are often a ¼ 1:5 r and b ¼ 2:5 r, wherer is a variance component.
Robust estimation uses the weight function to modify the weight value of the observation data. Generally, poorly accurate observation data that are assessed by the residual obtain a smaller weight value, and its influence on parameter estimation can be modified properly. In this process, observation data precision is assessed by the residual. The IGG weight function has three segments: I. If the residual is smaller than or equal to threshold value a, then the weight of observation data remains invariant. II. If the residual is greater than threshold value a but smaller than or equal to threshold value b, then the weight of the observation data should be modified and becomes smaller according to the formula. III. If the residual is greater than threshold value b, then zero is assigned to the weight of the observation data.
For the EIV model, the residuals of the observation vector and coefficient matrix are obtained by current solutions; however, as shown in the previous analysis, solutions for the AR model based on TLS make one observation datum obtain different residuals, which could lead to failure in the utilization weight function for robust estimation.

Iterative algorithm for the AR model
According to the analysis in the previous section, the AR model is different from the traditional EIV model. The obvious characteristic is that one observation datum appears in the model repeatedly. This phenomenon causes the observation datum to obtain different residuals. As cited in Equation (4), one element in vector ( h 1 ; h 2 ; Á Á Á ; h m ) appears in the equation (1, 2, … , m) times, respectively, whereas (h mþ1 , h mþ2 , … , h mþn ) is (n, … ,1), respectively. Therefore, if we define D i (i ¼ 1, 2, … , m þ n), which represents the observation error in the observation data, then the observation error that exists in the observation vector (h 1 , … , h m , h mþ1 , … , h mþn ) can be expressed as (D 1 ; :::; D m ; D mþ1 ; :::; D mþn ), and the elements of error vector D i are as follows: To obtain a unique residual for the observation data, we use the median function (med[]): The unique observation error of the observation data can be obtained using the median function. Additionally, the median function is robustly efficient; in seminal reports (Rousseeuw and Wagner 1994), conclusions have indicated that when the contamination rate of outliers that exist in the observation data is not more than 50%, solutions based on the median function can obtain a stable estimation for model parameters. Furthermore, we use the median function to obtain the variance component, which is a vital parameter for the weight function and robust estimation: whereas the variance component is typically computed as follows: where P mþn is the weight matrix of the observation data. Note that the estimation of the variance component is biased by the current solutions (Shen et al. 2011). The efficiency of above two methods for computing the variance component will be compared by instance in the next section. An iterative algorithm for robust TLS estimation is established for the AR model using the median function: Step 1. Calculate the initial values of the AR model parameters using algorithms based on the TLS criteria: Step 2. Calculate the residuals of the observation data in the coefficient matrix and observation vector, respectively:v Step 3. Calculate the unique residual of each observation datum based on Equations (7) and (8), which was proposed previously. The unique residuals of all observation data are listed as (v 1 , v 2 , … , v mþn ).
Step 4. Calculate the variance component (r) using Equation (9) and redefine the weight based on the weight function. The IGG weight function is used in this article: Note that the threshold value and residual estimation (v i ) of the weight function should be calculated based onr and (v 1 , v 2 , … , v mþn ), which were proposed previously.
Step 5. Calculate the model parameters and iteration from Step 2 to Step 4 until the estimation converges.
This algorithm is called robust M-TLS in this article.

Numerical examples and discussion
In this section, two numerical examples are used to verify the proposed robust TLS estimation for the AR model. The aims of these examples are as follows: (I) The first instance uses real observation data, and we compare the differences of solutions for the AR model between LS and TLS. The model for which both the coefficient matrix and observation vector have observation errors is computed using LS and TLS, respectively. The results based on the two solutions are compared with each other.
(II) The second example uses simulated data, and we insert simulated outliers into the EIV model. The traditional algorithm for robust LS is used to calculate the model parameters. Meanwhile, both the robust TLS algorithm and robust M-TLS algorithm are also used to calculate the model parameters. The results are compared with each other to verify the efficiency of the different algorithms.
In the instances, three parameters are used for the AR model, which can be written as follows: Observation data of first instance are listed in Table 1. Before proceeding to the experiment, stationary of observation data has been analyzed, autocorrelation function of the observation data is calculated and it is presented in Figure 1.
In the second example, values of (a 1 , a 2 , a 3 ) are simulated as (0.6, À0.4, À0.2), and the observation data calculated using the simulated parameters are listed in Table  3. The previous 25 periods' observation data are used to calculate the model parameters, and other observation data are used to check the precision of the parameters. To achieve the aforementioned aims, a random error that follows N(0, 1) is inserted into the previous 25 periods' observation data, which is shown in Figure 2.   The observation data of  which are used to calculate model parameters are inserted into the random error, and their values are listed in Table 4. Additionally, outliers are inserted into the observation data of No. 4,No. 10,No. 18,and No. 25, which are denoted in red in Table 4.
Four observation data are inserted into outliers. Note that, for the AR model, if LS is used to calculate the model parameters, then the contamination rate of outliers is 4/22 (approximately 18.2%). Additionally, if TLS is used to calculate the model parameters, then the contamination rate of the outliers is 13/88 (approximately 14.8%). It appears that, for the AR model, when the same observation data are contaminated by outliers, the contamination rate based on TLS is smaller than that for LS.
When the observation data are contaminated by outliers, then we use three types of robust algorithms, that is robust LS, robust TLS and robust M-TLS proposed in this article to compute the model parameters. In the calculation, the observation data are supposed to be independent, and their initial weight is defined as unit one. The results of the three algorithms are listed in Table 5.
We use the estimation of model parameters to compute the residual of the observation vector. For comparison, only the residual of the observation vector that has observation elements composed of No. 4-No. 25 is calculated. The residual is listed in Figure 3.
From the results of the three algorithms, robust LS achieved poor accuracy. In fact, it was difficult for the parameter estimation to converge. In the iteration process,  Observation data in red denote that the data are contaminated by outliers.
because an irrational threshold value of the weight function was obtained and its value was larger than the residual, the algorithm failed in terms of robustness. The traditional robust TLS was more accurate than LS; however, in the process of modifying the stochastic model, each observation datum in the model had a different residual; in this instance, all the observation data No. 4-No. 22 had four residuals. This factor led one observation datum to have different weights. Therefore, although the traditional robust TLS algorithm for the AR model was more precise, its solution was unscientific. Meanwhile, the robust M-TLS proposed in this article overcame the shortcomings of the traditional algorithm, and it was more efficient and precise.

Conclusions
A modified solution for parameter estimation for the AR model was proposed in this article. To overcome the shortcomings of traditional solutions based on LS that cannot take into account the observation error in the coefficient matrix, EIV estimation for the AR model was introduced, and by analyzing the current algorithm of robust TLS, the current algorithm based on robust TLS also had unavoidable shortcomings for the AR model. It could not obtain a unique residual for one observation datum; hence, the estimation of the threshold value of the weight function was not scientific. The median function was used to determine a unique residual, and a more rational method for calculating the threshold value was proposed. An iterative algorithm was established using the median function and IGG weight function for the AR model based on EIV estimation. By comparing the proposed algorithm with current algorithms using two examples, the feasibility and efficiency of the proposed algorithm were proved. However, the proposed algorithm also had the shortcoming of increasing the amount of calculation, which needs to be further investigated.  Figure 3. Residual of the observation vector using the three algorithms.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported by the National Natural Science Foundation of China (41601501, 41804029), Natural Science Fund for Colleges and Universities of Jiangsu Province (16KJD420001) and Huaian Key Laboratory of Geographic Information Technology and Applications (HAP201405).