Nonlinear predictive model selection and model averaging using information criteria

ABSTRACT This paper is concerned with the model selection and model averaging problems in system identification and data-driven modelling for nonlinear systems. Given a set of data, the objective of model selection is to evaluate a series of candidate models and determine which one best presents the data. Three commonly used criteria, namely, Akaike information criterion, Bayesian information criterion and an adjustable prediction error sum of squares (APRESS) are investigated and their performance in model selection and model averaging is evaluated via a number of case studies using both simulation and real data. The results show that APRESS produces better models in terms of generalization performance and model complexity.


Introduction
Model selection plays a fundamental role in choosing a best model from a series of candidate models for datadriven modelling and system identification problems. In general, system identification and data-driven modelling consists of several important steps, including data collection, data processing, selection of representation functions, model structure selection, model validation and model refinement (Preacher & Merkle, 2012;Solares, Wei, & Billings, 2017;Söderström & Stoica, 1989).
Among various model selection methods, Akaike information criterion (AIC) and Bayesian information criterion (BIC) are two most popular measures. Since AIC was firstly proposed in 1974 (Akaike, 1974), many variations of AIC have been developed for model selection. For example, the second-order Akaike information criterion (AICc) was developed for small sample size data modelling problems in 1989 (Brockwell & Davis, 1991;Hurvich & Tsai, 1989); the AIC was designed to approximately estimate the Kullback-Leiber information of models in 1998 (Akaike, 1998); also, the delta AIC and the Akaike weights were introduced to measure how much better the best model is when compared with the other models. In the model selection process, the AIC, delta AIC and AIC weights are calculated for each candidate model. Usually, the 'best' model is chosen to be the model with the smallest AIC; the delta AIC calculates the difference between the AIC of each model and the smallest AIC of the 'best' model CONTACT Hua-Liang Wei w.hualiang@sheffield.ac.uk (Symonds & Moussalli, 2011); the AIC weight is ranged from 0 to 1, which is an analogous to the probability that a candidate model is the best choice (Buckland, Burnham, & Augustin, 1997). Drawn on these theories, some model averaging approaches were also developed, for example, the natural averaging method (Buckland et al., 1997) and full model averaging method (Lukacs, Burnham, & Anderson, 2010). Over the past few decades, AIC and its variations have been used to solve a wide range of model selection problems including those in ecology (Johnson & Omland, 2004) and phylogenetics (Posada & Buckley, 2004), among others. Another commonly used model selection criterion is BIC, which was proposed by Schwarz in 1978(Schwarz, 1978. It is also referred to as the Schwarz information criterion, or the Schwarz BIC. Similar to AIC, BIC is also calculated for each candidate model and the model with the smallest BIC is chosen to be the best model (Kass & Raftery, 1995). The only difference between AIC and BIC is that BIC uses a larger penalty on the increment of the model terms. In recent years, BIC has also been increasingly used as model selection criterion (Cobos et al., 2014;Hooten & Hobbs, 2015;Vrieze, 2012;Watanabe, 2013). Based on the investigation of vast literature on applications and comparative studies of the two criteria (e.g. see Aho, Derryberry, & Peterson, 2014;Burnham & Anderson, 2004;Burnham, Anderson, & Huyvaert, 2011;Chaurasia & Harel, 2013;Claeskens & Hjort, 2008;Johnson & Omland, 2004;Kuha, 2004;Medel & Salgado, 2013;Posada & Buckley, 2004;Vrieze, 2012), it can be noted that both AIC and BIC have their own advantages and limitations. It cannot be guaranteed that one is better than another regardless of application scenarios. The reason is that the data, model type and other aspects of the modelling problems can be significantly important in determining which of the criteria is more suitable.
Both AIC and BIC have been widely applied on model selection problems. However, there still exists large room for improvement. For example, it lacks evidence that the two criteria can also work well for complex nonlinear system identification problems. Although AIC and BIC can usually produce good model selection result based on the assumption that the 'true' model is among the candidate models, they may fail to select the best model when the system is very complex and neither of the candidate models can sufficiently represent the data. These situations often occur when the model structure or some prior information is unknown. To solve the model selection problem of nonlinear system identification, the cross-validation (CV) based criterion (Stone, 1974) and its two variations, the Leave-One-Out (LOO), also called Predicted Residuals Sum of Squares (PRESS) (Allen, 1974;Chen, Hong, Harris, & Sharkey, 2004;Hong, Sharkey, & Warwick, 2003), and generalized crossvalidation (GCV) (Golub, Heath, & Wahba, 1979), were developed. Most recently, a modified GCV criterion, also known as adjusted predicted sum of squares (APRESS), was also proposed for nonlinear systems identification .
Based on above considerations, it is essential to investigate AIC, BIC and APRESS, to figure out which one works better for model selection of nonlinear system identification and data-driven modelling problems. In this study, case studies using simulation and real data were carried out and the three criteria were used to select a best model from a set of candidate models. The prediction performances of the models which are selected by the three criteria were evaluated and compared, to find out which method gives better model selection result. In addition, a model averaging approach is developed based on the full model averaging method to improve the model robustness.
The paper is organized as follows. The nonlinear autoregressive moving average with exogenous input (NARMAX) model and orthogonal forward regression (OFR) algorithm are briefly reviewed in Section 2. Section 3 introduces the model selection and averaging methods using AIC, BIC and APRESS. In Section 4, case studies are given to illustrate the performances of these methods. The paper is concluded in Section 5.

NARMAX model and OFR algorithm
In this study, the candidate models are chosen to be the NARMAX model structure, which can be described as : where y(k) and u(k) are systems output and input signals; e(k) is a noise component with zero mean and finite variance; the noise can be assumed to be white Gaussian in many applications. n y , n u and n e are the maximum lags for the system output, input and noise. F[·] is some nonlinear function. A polynomial NARX model can be written as the following linear-in-the-parameters form: where ϕ m (k) = ϕ m (ϑ(k)) are the model terms generated from the regressor vector ϑ(k) = [y(k − 1), . . . , y(k − n y ), u(k − 1), . . . , u(k − n u )] T , θ m are the unknown parameters and M is the number of candidate model terms.
The OFR algorithm is briefly introduced as follows . Let y = [y(1), . . . , y(N)] T be a vector of measure outputs at N time instances anϕ m = [ϕ m (1), . . . , ϕ m (N)] T be the vector formed by the m-th model term (m = 1, 2, . . . , M). Let D = {δ j : 1 ≤ j ≤ M} be the model term dictionary, the objective of OFR algorithm is to find a subset D n = {δ l 1 , . . . , δ l n } so that y can be explained: For the full dictionary D, the ERR index of each candidate model term can be calculated by: where i = 1, 2, . . . , M. The first selected model term can then be identified as: Then the first significant model term of the subset can be selected as ϕ l 1 , and the first associated orthogonal variable can be defined as q 1 = δ l 1 . Let r 0 = y, set: After removal ϕ l 1 from D, the dictionary D is then reduced to a sub-dictionary where The rest of the model terms can then be identified step by step using the ERR index of orthogonalized subsets D M−s+1 : The s-th significant model term of the subset can be selected as ϕ l s , and the s-th associated orthogonal variable can be defined as q s = q (s) l s . Then: Recursively, the significant model terms of the subset {δ l 1 , . . . , δ l n } can be identified step by step. By summing (10) for s from 1 to n, yields: The r n 2 is called residual sum of squares, or sum squared error. The mean square error (MSE) of the model can be calculated as r n 2 /n, which can be used to form model selection criteria such as AIC, BIC and APRESS.

Model selection and model averaging methods for nonlinear modelling
This section introduces model selection and averaging approaches based on AIC, BIC and APRESS.

Model selection with AIC, BIC and APRESS
AIC and BIC can be calculated as (Akaike, 1974;Schwarz, 1978): where k is the number of fitted parameters in the model, L is the maximum likelihood estimate for the model and N is the sample size. As mentioned earlier, for least square based regression analysis, AIC and BIC can be directly calculated by using MSE, as (Hurvich & Tsai, 1989): where MSE(k) is the MSE of the candidate model. Equations (14) and (15) Billings, & Balikhin, 2007). The APRESS can be easily calculated in each term selection step in OFR algorithm. It is defined as : where p(k) is a penalty function defined in terms of the cost function C(k, α) = k × α with α being an tuning parameter.
It can be noted that each of the three criteria contains two components: the first component measures the prediction error, which indicates how well the model fits the data. The second component is the cost function, which is used to penalize the model when more model terms (also called parameters in statistics) are added to the model. Therefore, there is a trade-off between the better fit and the model complexity. In general, the value of the criterion decreases when a first few model terms are included in the model, because of the reduction of prediction error. When an enough number of model terms are included, the penalty component becomes significant, leading to increased value. Thus, the model with a minimum value is then treated as an optimal choice with both good prediction performance as well as parsimonious representation • APRESS is easy to implement in the OFR algorithm for nonlinear dynamic modelling • APRESS has a tuning parameter so that it needs a figure to determine the optimal turning point • APRESS have been applied for nonlinear model selection of many applications of the system. From the investigation of the literature, a summary of the reported advantages and limitations of the AIC/BIC/APRESS is given in Table 1 (Aho et al., 2014;Hooten & Hobbs, 2015;Johnson & Omland, 2004;Medel & Salgado, 2013;Posada & Buckley, 2004;Vrieze, 2012;Wei, Billings, & Balikhin, 2006).

Model averaging with AIC, BIC and APRESS
Model averaging is a widely applied method to deal with model uncertainty and reduce or eliminate the risk of using only a single model. Model averaging approaches such as AIC-and BIC-based averaging methods have been used in many applications (Asatryan & Feld, 2015;Cade, 2015;Kontis et al., 2017;Moral-Benito, 2015). The model averaging approach with AIC involves the computation of the delta AIC and the Akaike weights. The delta AIC can be calculated as (Symonds & Moussalli, 2011): where AIC c i is the AIC value for the i-th candidate model, AIC c min is the minimum AIC of all the M candidate models, and i = 1, 2, . . . , M. The Akaike weight indicates the probability that an individual candidate model is the best model. The Akaike weight for i-th candidate mode is computed as (Buckland et al., 1997): where ω i is the Akaike weight for the i-th candidate model and i = 1, 2, . . . , M. Then, the averaged parameter estimate of 'full model averaging' is calculated as follows: To produce averaged model based on BIC and APRESS, a simple approach is to replaced AIC by BIC and APRESS, to calculate the BIC and APRESS weights of model parameters of all candidate models. The averaged parameters can then be computed using formula (19). This method is simple to implement. More importantly, it is easy to determine which of the three criteria gives the best-averaged model. The advantage of the averaged model is that it is, in general, more robust than the single 'best' model determined by the model selection criterion. This is because a single model only contains a limit number of model terms suggested by model selection criterion. If a model selection criterion fails to detect the correct number of model terms, the model terms of the single model may be insufficient to well represent the system. On the contrary, the averaged model uses the information of all the candidate models and each candidate model gives its contribution according to their weights based on the model selection criterion. Therefore, when the single model selected by the model selection criterion is not the best, the performance of the averaged model is usually better than that of the single model. However, it should also be noted that a model with more terms is not necessarily always better than a model with less terms, because some terms may be redundant and may deteriorate the model prediction performance. Therefore, it is not always true that the averaged model is better than a single model, but the averaged model is often more robust in case where there is large uncertainty in the data collection, model structure and model parameter, etc.

Case studies
In this section, case studies are carried out to evaluate the performances of the proposed model selection and model averaging methods.

A simulation example
Consider a nonlinear system described by the model below: where the input u(t) was assumed to be uniformly distributed on [−1, 1], and the noise ξ(t) is the white noise with zero mean and finite variance. The signal to noise ratio (SNR) of the data is about 10 dB. A total number of 500 input-output data points were generated. The first 250 points were used for model estimation and selection and the second 250 points were used for performance test. A regression vector can be defined as: with the maximum time lags of n y = n u = 2. The initial full model was chosen to be a polynomial form with nonlinear degree of l = 2. The full dictionary contains a total number of 15 model terms: Note that the true model term √ |y(t − 1)| in (20) is not included in any of the specified candidate model sets. Therefore, all candidate models can only provide an approximation of the true system behaviour, which is accurate to some degree but can never perfectly reconstruct the true system model structure. This is true for most real-world data-driven modelling tasks, where the true system model structure is unknown. The OFR algorithm was used to select model terms from the dictionary and estimate the model, and the AIC, BIC and APRESS were used to evaluate all the candidate models. The first 15 model terms are shown in Table 2 and ranked by the ERR index. It can be seen that the most important terms are selected in the first few steps including the true system model u(t − 1) × y(t − 2). The candidate model is the model with associated number of model terms, for example, the second candidate model is defined to be the model with two terms, u(t − 1) × y(t − 2) and u(t − 2) × u(t − 2), so on and so forth.
The AIC, BIC and APRESS of all the 15 candidate models were calculated and shown in Figure 1 and some statistical evaluations of the models suggested by AIC, BIC and APRESS are shown in Table 3. The performances of all the candidate models are shown in Figure 2. Compared with AIC and BIC, the APRESS suggests a choice of three model terms, which is much smaller than that suggested by AIC and BIC. Also, the model suggested by APRESS, although with fewer number of model terms, possesses slightly better predicative capability. Due to the fact that the prediction performances can be affected by the uncertainty brought by the noise, it is normal that any of the models can achieve slightly better statistics of correlation, prediction efficiency and error, as long as they include the main components of the true model. However, it is also crucially important to achieve a parsimonious representation for complex nonlinear systems in many application situations, because a model with less variables can largely reduce the work of data collection and benefit the process of understanding the systems. In general, all the three model selection criteria are capable for model selection for this example. It is possibly because that although the model term √ |y(t − 1)| is not in the candidate term set, it can be approximated using the model term y(t − 1) with some polynomial format.
The averaged parameters were calculated based on 15 candidate models using formula (19). Note that all the three averaged models were calculated from the same 15 candidate models and the only difference is that the averaged parameter was computed using different weights based on AIC, BIC and APRESS, respectively. A comparison of the performances of the three averaged models is also shown in Table 1. It can be observed that the performances of the averaged models are slightly better than   the associated single models, but this is achieved at the price of increasing the model complexity. As mentioned earlier, the true model term √ |y(t − 1)| in (20) is not included in the specified candidate model terms, as a consequence, all the 'best' single models suggested by the three criteria just simply achieve a best balance or tradeoff between the model representation performance on the test data and the model complexity. For real applications, there would always exist a risk if we only trust a single model to make important decisions or carry out important analyses. The model averaging process, however, is extremely useful to improve the robustness, especially when the true model structure is not included in the specified candidate model set or the model selection method fails to choose the best model.

A real-world application: Dst index forecast
The magnetosphere can be considered as a complex system. In order to understand the magnetosphere system, Dst index is often used to measure the magnetic disturbances (Wei et al., 2006(Wei et al., , 2007Wei, Billings, & Balikhin, 2004). In this study, the process of Dst is treated to be an unknown nonlinear system, where the system inputs are solar wind variables and the system output is the Dst index. The description of the inputs and output is given in Table 4. All the variables were sampled every 1 hour. It should be noted that VBs is a multiplied input which was suggested to be included in the model inputs (Gonzalez et al., 1994). The Dst data used in this example is sampled from 1998. There are a total number of 1460 input-output data points. The first half data was used for model estimation and the second half data was used for validation. Similar to the previous discussed simulation example, the OFR algorithm was used to select model terms and estimate the model parameters, and the AIC, BIC and APRESS were used for model selection. The time lag of inputs was chosen to be 4 and the nonlinear degree was 2 so that the model is input-alone (Volterra model), meaning that no autoregressive model terms were included in the inputs.  In total, 40 candidate models were estimated to predict Dst index 1 hour ahead. The AIC, BIC and APRESS of all the candidate models are shown in Figure 3. The number of model terms suggested by AIC, BIC and APRESS are 38, 8 and 7, respectively. The evaluation of the prediction performances of the three models are shown in Table 5 and the performances of all the 40 estimated models are shown in Figure 4. It is clear that AIC fails to select the 'best' candidate model. The model with 38 terms performs poorly in forecasting Dst index 1 hour ahead. On the contrary, the models chosen by BIC and APRESS are quite similar and achieve very similar performances. Comparing the performances of the two selected models with that produced by all the candidate models, it can be seen that the BIC and APRESS selected nearly the 'best' model. Additionally, the model suggested by APRESS involves a relatively smaller number of model terms. Clearly, for this real data example, both BIC and APRESS are capable for the model selection task. If a parsimonious representation is required, the APRESS statistic is superior to the other two model selection criteria.
The averaged parameters were calculated for the candidate models based on AIC, BIC and APRESS weights. The result of the three averaged models is shown in Table 3 and a comparison of predicted and observed Dst index is  shown in Figure 5. It can be seen that the performances of the averaged models are also similar to the associated single models. Following the discussion above, it can be concluded that the model averaging approaches is consistent with the model selection results. The performance of the averaged model is mainly affected by the 'best' single model chosen by AIC, BIC or APRESS, while the other candidate models make smaller contribution to the averaged model according to the relevant averaged models.

A real-world application: estimation of energy performance of residential building
The energy performance of residential building is related to many aspects, for example, surface area, wall area, roof Relative compactness x2 Surface area x3 Wall area x4 Roof area x5 Overall height x6 Orientation x7 Glazing area x8 Glazing area distribution area, overall height, orientation, glazing area, and glazing area distribution (Tsanas & Xifara, 2012). In this example, models are built to represent the relationship between heating load and these factors. The descriptions of these variables (factors) are shown in Table 6 (Tsanas & Xifara, 2012). There are 768 input-output data points and the first and second half data are used for training and testing, respectively. The nonlinear degree is set to be 3. Similar  to the process described in Sections 4.1 and 4.2, AIC, BIC and APRESS are used to evaluate a total of 20 candidate models and select the model that can best describe the system. The plots of AIC, BIC and APRESS of the candidate models are shown in Figure 6. Both AIC and BIC suggest the model with 12 model terms. As for APRESS statistics, by setting the adjustable parameter α to be 0, 1, . . . , 10, three apparent turning points are observed at horizon 3, 14 and 17. From Figure 7, the model with three terms provides better performances. It can be noted that another advantage of APRESS is that it uses an adjustable parameter α to calculate the cost function, so that the optimal model length can be determined by the turning points, rather than the smallest value. In this example, if α is set to be any of the single values that is less than 6, it would be difficult to find the optimal point. Thus, the adjustable parameter makes the APRESS more sensible to the optimal solution.
It can be seen from Table 7 that the averaged model provided by APRESS outperforms those provided by AIC and BIC. This is not surprising, as the single model selected by the APRESS is much better than the models selected by AIC and BIC. Again, it can be conclude that APRESS is superior to AIC and BIC for model selection and model averaging for quantifying the energy performance of residential buildings.

Conclusion
Investigations have been carried out on model selection and model averaging with three information criteria, namely, AIC, BIC and APRESS. Three case studies on system identification and date-driven modelling using both simulation and real datasets are presented, and the associated comparative analysis shows that APRESS is superior to AIC and BIC with several advantages. First, the model produced by APRESS can achieve parsimonious representation with good or better prediction performance. Second, APRESS is simple to compute incorporate in the implementation procedure of the OFR algorithm. Third, APRESS is more sensible to the optimal solution for real data modelling problems. With these benefits, APRESS is recommended for model selection in nonlinear system identification and data-driven modelling, especially for real data based modelling problems where the true system model structure is unknown. Moreover, a model averaging approach has been introduced and evaluated via the three case studies. The associated results indicate that the averaged model can improve the model robustness and thus it is recommended to use model selection and averaging method together for real data modelling problems of nonlinear systems. The reason that APRESS outperforms AIC and BIC in the three case studies is not theoretically justified in the present work. Our future work would include theoretical analysis of the performance of these methods.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported in part by EU Horizon 2020 Research and Innovation Programme Action Framework under grant agreement 637302, the Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/I011056/1 and EPSRC Platform Grant EP/H00453X/1.