Neural network-based parametric system identification: a review

Parametric system identification, which is the process of uncovering the inherent dynamics of a system based on the model built with the observed inputs and outputs data, has been intensively studied in the past few decades. Recent years have seen a surge in the use of neural networks (NNs) in system identification, owing to their high approximation capability, less reliance on prior knowledge, and the growth of computational power. However, there is a lack of review on neural network modelling in the paradigm of parametric system identification, particularly in the time domain. This article discussed the connection in principle between conventional parametric models and three types of NNs including Feedforward Neural Networks, Recurrent Neural Networks and Encoder-Decoder. Then it reviewed the advantages and limitations of related research in addressing two major challenges of parametric system identification, including the model interpretability and modelling with nonstationary realisations. Finally, new challenges and future trends in neural network-based parametric system identification are presented in this article.


Introduction
Complex Systems feature a large number of measurable components interacting simultaneously and nonlinearly with each other on multiple levels (Zhao et al., 2017). In real-world scenarios, elusiveness widely exists in Complex Systems, such as climate systems (Gu et al., 2019;Zhao et al., 2016), neuro systems (He & Yang, 2021;Zhao et al., 2013), Cyber-Physical systems (Kaiser et al., 2018;Zhu et al., 2021), etc. System identification, which is the process of uncovering the inherent dynamics of the system directly from the observed inputs and outputs data, can be used to control, analyse, or design a complex system (Billings, 2013).
System identification can generally be divided into two types: parametric approaches and nonparametric approaches. The nonparametric system identification investigates the system's specific properties by analysing the observed data directly without a model. In contrast, the parametric system identification unveils the system's inherent dynamics by building a universal approximation model based on observed data. The analysis or prediction of the actual system is based on the model instead of the measurement data. One prominent advantage of such CONTACT Yifan Zhao yifan.zhao@cranfield.ac.uk School of Aerospace, Transport and Manufacturing, Cranfield University, Cranfield MK43 0AL, UK an approach is that it is still able to represent the system well when the noise is nonlinearly and causally correlated to the system inputs and outputs. A mathematic model of noise can be built and accommodated in the general model creating an unbiased estimation of the mean of the system output distribution (Billings, 2013). There are two significant challenges in parametric system identification.
Interpretability. Since parametric system identification is a model-based analysis procedure, it is necessary to build an interpretable model that can reveal the physical properties of the underlying system. Interpretability or transparency is often referred to as the dependency and causality between multiple inputs and outputs, as well as the system spectrum properties. The interpretable model should provide the system structure information for analysis, such as the system order (maximum time delay), significant system inputs, dominating frequency response properties, etc. However, this is challenging for the system identification approaches acting as a black box where no a priori knowledge is available about the actual system.
Identification with nonstationary realisations. When a system is chaotic, unstable, or time-varying, the observed data will be nonstationary, making modelling even more challenging. Conventionally, nonstationary time series can be regarded as homogeneous or inhomogeneous according to whether they can be stationarised by temporal pattern decomposition. The temporal pattern of homogeneously nonstationary time series is regarded as the intricate combination of trend, seasonal, and stochastic patterns (Box et al., 2015). If the time series is inhomogeneous nonstationary, the actual underlying system may be chaotic, time-varying and suffering from random disturbance (Ardalani-Farsa & Zolfaghari, 2010;Billings, 2013;. The training data space will be impossible to cover all situations, and a fixed model will always suffer from a covariate shift. So, it is challenging to build a robust and adaptive model capturing all genuine dynamic characteristics of the actual system when the measurement data is both homogeneous and inhomogeneous nonstationary. To tackle these challenges in the time domain, the regressor library-based modelling like the derivation of Nonlinear AutoRegressive Moving Average with eXogenous inputs (NARMAX) model with the forward regression with Orthogonal Least Square (FROLS) algorithm (Billings, 2013) and the derivation of the state-space model with Sparse Identification of Nonlinear Dynamics (SINDy) algorithm (Brunton et al., 2016) have been developed in the past few decades. For the transparency problem, the dependency between variables is revealed by selecting significant system inputs. The significant inputs correlated to the system outputs or states can be selected with Principal Component Analysis (PCA) based Multiple Forward Regression with Orthogonal Least Square (MFROLS) algorithm (Billings, 2013) or Forward Orthogonal Search (FOS) algorithm by maximising the overall dependency (MOD) . Then the regressor library containing all lagged significant features with a certain maximum time delay can be built. The form of the library component can be polynomial with a certain degree of nonlinearity, sinusoidal function, wavelet function and radial basis function (RBF) (Brunton et al., 2016;Hong et al., 2008). The causality between inputs and outputs can be revealed through Error Reduction Ratio -causality test (Zhao et al., 2017) or the NARX-based Granger causality test . As for the spectrum analysis for model transparency, a polynomial and wavelet NARMAX model can be transferred into generalised frequency response functions (GFRF) by omitting noise terms (Li & Billings, 2005).
For the time-invariant system with homogeneous nonstationary realisations, when observed data exhibits a trend and seasonal pattern, the Autoregressive Integrated Moving Average (ARIMA) model can be used, which employs a certain order and step difference operation to make time-series stationary. The ARIMA model can be extended to the seasonal multiplicative model when the seasonal patterns are interdependent. When heteroskedasticity appears in the residual analysis, the Generalised Auto-Regressive Conditional Heteroscedastic (GARCH) model is commonly adopted to mitigate the problem (Box et al., 2015). The ARIMA and GARCH models can be generalised to a Nonlinear AutoRegressive Integrated Moving Average with eXogenous Input (NARI-MAX) model and NARMAX-based GARCH model, respectively, when the mean function is nonlinear (Bodhanwala, 2014;Neshat et al., 2018). When the actual system is time-varying or suffering from disturbance generating inhomogeneous nonstationary realisations, the sliding window approach can be applied to select a common model structure representing the system structure, and the model parameters can be estimated with adaptive parameter estimation techniques such as Kalman filter (KF), reclusive least square (RLS), least mean square (LMS), Sequential Monte Carlo (SMC) methods like particle filters (Schön et al., 2015) or the wavelet modelling (He & Yang, 2021;Li et al., 2016). It should be noted that selecting the appropriate maximum time delay and degree of nonlinearity to minimise the number of candidate model terms in the library is critical for speeding up the calculation when applying traditional regressor library-based modelling methods. However, it usually relies on a priori knowledge.
In recent years, with the growth of computation power, neural network-based methods have become more popular in parametric system identification and unknown function approximation because of their prominent universal approximation ability (Aggarwal, 2018;Hornik et al., 1989;Jiao et al., 2018;Ljung et al., 2020;Shen et al., 2020Shen et al., , 2021. Furthermore, the building procedure of the neural network model does not rely on a priori knowledge about the actual system because its structure-related factors are regarded as hyperparameters that can be tuned in training and validation (Goodfellow et al., 2016). However, very limited reviews have been reported on system identification with neural networks to summarise their advantages and limitations and discuss the new challenges and future trends addressed by this article. As far as we are concerned, this is the first time that state-of-the-art neural networks have been reviewed in the context of parametric system identification. An additional novelty of this paper is the proposal of a new angle for the performance evaluation of neural networks, considering two criteria of interest in parametric system identification, interpretability and nonstationary. Another contribution of this study is the systematic establishment of a link between neural network-based methods and non-neural network-based methods in the context of parametric system identification.

The connection between neural networks and parametric system identification
Neural networks for time-series modelling can be generally classified into three types based on their structure and training methods: feedforward neural networks (FNN), recurrent neural networks (RNN), and encoder-decoder models. Recently, the connection between neural networks and parametric models used for system identification has been studied in (Ljung et al., 2020), where the FNNs and RNNs can be regarded as a Nonlinear Autoregressive with Exogenous inputs (NARX) and nonlinear state-space model, respectively.
FNN is the neural network topology with no previous states or outputs feedback loop. According to the structure of hidden units, FNN can be generally divided into three types: the multilayer perceptron (MLP) (Acuna et al., 2012;Wang & Song, 2014), the radial basis function network (RBFN) (Ahmed et al., 2010;Moody & Darken, 1988;Pérez-Sánchez et al., 2018) and the temporal convolution neural network (CNN) (Bai et al., 2018;Cao et al., 2021;Fan et al., 2021;Hao et al., 2020;Oord et al., 2016;Tang et al., 2022). FNN can be generally described with the NARX model family, where the system is assumed to have white noise. For simplicity, the single-input and singleoutput (SISO) model is demonstrated in this paper. The multi-input and multiple-output (MIMO) case, where the variables are all endogenous, is referred to as the nonlinear vector autoregressive (NVAR) model and is usually used for iterative prediction (Gauthier et al., 2021). The SISO NARX model can be described as follows: where F[·] can be any nonlinear function and n y , n u are the maximum time delay of the system output y and exogenous input u respectively. The FNN representing the NARX model is known as the series-parallel mode NARX network (Jumilla-Corral et al., 2021;Wang & Song, 2014). In Figure 1, an example of MLP representing the NARX model is illustrated. The FNN uses the direct method for prediction. The network output can be either a certain time step ahead value y(k + n) or values over a certain time span τ ahead {y(k + n), y(k + n + 1), . . . , y(k + n + τ )}. As long as the FNN has promising approximation ability, n steps can be long-term (Wei & Billings, 2006). However, when the future input data is not available, only the reclusive method can be applied, which implies that the previous prediction result of the FNN must be used as input to produce a new prediction. Since the FNN is trained with the direct method, as long as the error between system output and model output is not zero, the error will be accumulated as the prediction goes on, ending up with a potentially poor result (Wang & Song, 2014). To overcome this problem, recurrent neural networks (RNNs) have been developed.
In the RNN, past information is transferred through the previous states or outputs, and each time step shares the same weight, which is similar to the nonlinear state space model. A general nonlinear state-space model in the discrete domain is described as follows (Ljung et al., 2020): where x, y and u are the system state, output, and input respectively. The nonlinear functions f and h can be parametrised by θ in various ways. In RNNs, the f and h become activation functions and θ become the weights and biases. An example of vanilla RNN including the previous output feedback (Jordan) type and previous state feedback (Elman) type is shown below: The Jordan RNN can be represented by Equations (4) and (6). The Elman RNN can be represented by Equations (5) and (6). In Eqs. (4) -(6), the hypermeter σ h , σ y are the activation functions of the hidden layer and output layer, respectively. The parameter W hx , W hh b h ,W y , b y are the weights and biases of the hidden layer and output layer, respectively. The values of h t−1 and y t−1 are the previous state and previous output (Elman, 1990). Modern RNNs such as the longshort term memory (LSTM) and gated recurrent unit (GRU), which were developed based on the Elman RNN by controlling the information flow with gates, can also be regarded as examples of the nonlinear state-space model (Ljung et al., 2020). RNNs address the reclusive prediction error by training via the backpropagation through time (BPTT) within a selected time window τ (Werbos, 1990). The BPTT takes the accumulated error into account by updating parameters based on both current prediction errors and future prediction errors within the time window τ in training. Since the accumulated loss is only optimised within τ , the value of τ should be selected the same as or larger than the maximum time delay of the input time series that affects the output. Therefore, the time window is closely related to the system order.
One limitation of the typical recurrent layer is that the time window of its inputs and outputs is required to be fixed during the training. To tackle this problem, the encoder-decoder model, which is also referred as the sequence to sequence (seq2seq) model, is designed for modelling when the memory window of input and output sequence is not fixed (Liu et al., 2021). The typical seq2seq model using RNN as the encoder and decoder is shown in Figure 2. Two separate RNNs are adopted as the encoder and decoder, respectively where the input time window is m and the output time window is n. The encoder takes the input sequence and only outputs a fixed-length vector to the decoder, either using the state of the final time step or a temporal embedding vector if the attention mechanism is adopted to tackle the memory loss of the decoder (Bahdanau et al., 2014;Vaswani et al., 2017). The decoder takes the fixed-length vector for initial prediction when a start token is received. The start token can be either one hot vector or a part of time series before the value is predicted (Vaswani et al., 2017;Zhou et al., 2021). It should be noted that both FNN and RNN can be used for the encoder and decoder (Lim & Zohren, 2021). Furthermore, the decoder part can be either autoregressive or non-autoregressive (Vaswani et al., 2017;Zhou et al., 2021). Therefore, the type of parametric model corresponding to an encoderdecoder model is dependent on its specific structure.

Neural network models for system identification
In the paradigm of parametric system identification, the model is built based on observed system inputs and outputs time series. Then the correlation, causality and spectrum analysis, and system behaviour forecast are implemented based on the identified model. When the neural network acts as the model for parametric system identification, the neural network will be trained with the historical system inputs and outputs. Then the analysis and prediction are implemented based on the trained neural network.
The model interpretability and modelling with nonstationary realisations are two challenges of parametric system identification. The interpretability of a model often refers to the model's transparency to correlation, causality, and spectrum analysis. The system characteristics, including the significant system inputs, system order and delay information, should be revealed through the parametric correlation and causality analysis. For spectrum analysis, the model built in the time domain should be capable of transformation to the frequency domain. However, the original structures of deep neural networks are regarded as black boxes and hard to interpret (Cui & Athey, 2022;Rudin, 2019). Furthermore, for nonstationary time series, conventional neural networks are incapable of tracking data dynamics and adapting to underlying changes (Ditzler et al., 2015). Recently, some research has been conducted to enhance the interpretability of neural networks and their performance in modelling nonstationary time series. Table 1 summarises the research progress in addressing two major challenges of parametric system identification based on different types of neural networks for dynamic system modelling.

System characteristic identification
In parametric system identification, the system characteristics, including significant system inputs, system order, delay information, and system spectrum information, are identified based on the interpretation of the parametric model. In Table 2, some milestone works of system characteristic identification using different neural networks are reviewed in terms of their key operations and limitations. The detail of these milestone works and their follow-up works are discussed below.

The selection of significant system inputs
The first type of method to select significant system inputs is based on the nonlinear Granger causality (GC) test, where a measure is defined to evaluate whether a certain input helps to predict the output. Montalto et al. (2015) proposed an NN-GC approach where the nonlinear GC is derived by calculating the difference between prediction errors of MLP with and without the past information of certain input variables using the non-uniform embedding (NUE) strategy. NUE is the framework to select the most informative lagged input one after another. However, the implementation is complicated with high computational costs. In (Wang et al., 2018), it has been tested that the NUE strategy version of the 20-order NN-GC model costs almost twice as much CPU time as the NN-GC without NUE.
Based on NN-GC, in (Wang et al., 2018), the RNN-GC framework was developed where both linear and nonlinear Granger causalities are derived by  RBFN  • Preselecting the centres of kernels with K-means.  (Tutunji, 2016) Approximating activation function with the first two terms of the Taylor series to avoid nonlinearity.
Transfer function • System spectrum information • Only applicable with single hidden layer MLP. • Only a linear transfer function can be derived.
calculating the prediction error of LSTM with and without a certain input variable. In comparison to NN-GC (Montalto et al., 2015), which uses MLP as the prediction model, RNN is less prone to overfitting due to its weight sharing over time, which makes the parameter dimension irrelevant to the length of the input time series. This makes RNN-GC less sensitive to the model order, resulting in a higher accuracy of causality detection. Furthermore, since the parameter dimension is irrelevant to the length of the input time series, one fixed RNN structure is able to fit time series with different maximum time delays, which makes it more flexible than MLP (Wang et al., 2018). A similar framework has been implemented in (Rosoł et al., 2022), except that the Wilcoxon signed-rank test is adopted to assess the significance of causality.
In (Li et al., 2020), causality is expanded to include the effect of future values on present values, which is detected using a bidirectional LSTM. Recently, Liu et al. (2021) proposed the seq2seq-LSTM Granger Causality (SLGC) framework, where the LSTM-based encoder-decoder model is adopted for prediction. The seq2seq model can be adopted when the length of the input and output sequence is not fixed and equal in training. The nonlinear Granger causality is calculated by comparing the fraction of variance explained by the forecasting model with and without a certain input variable, measured by the R-squared score. However, these methods are relatively more complicated than the following fully end-to-end method. The sparse regression method is another way to select significant system inputs where the nonlinear GC between inputs and outputs is derived in a fully end-to-end manner. In (Tank et al., 2021), the component-wise LSTM (cLSTM) was developed to disentangle the contribution of different inputs to the outputs. The group least absolute shrinkage and selection operator (LASSO) penalty is applied on all weights in the first cLSTM layer. Both gate weights and cell state weights show the same sparsity pattern indicating the significant inputs. However, only the causality can be detected in cLSTM, while the maximum time delay and specific time delay cannot be derived by the penalty-based method on RNN. Because the weight is allocated for each time step in MLP while the weight of RNN shares over time, there is no way to inspect the contribution of a variable at a certain time step according to the weight of RNN.
Similarly, in (Khanna & Tan, 2019), the causality was derived by applying group LASSO penalty on the economy statistical recurrent unit (eSRU) which is a simplified version of SRU with fewer parameters to make it less susceptible to overfitting. In testing, with the simulation dataset that includes Lorenz-96 and vector autoregression (VAR), the eSRU with LASSO penalty demonstrates almost perfectly recover of the true causality, achieving an Area Under the Receiver Operating Characteristic curve (AUROC) of nearly 93%. This AUROC is more than 10% higher than that of cLSTM. Similarly, for the blood oxygenation level dependent (BOLD) imaging dataset, the eSRU outperforms cLSTM significantly, achieving an additional AUROC of nearly 14%. However, in the DREAM-3 In Silico Network Challenge dataset, the eSRU only exhibits a slight improvement over cLSTM. One limitation of all research mentioned above is that the hidden confounder problem has not been considered.

The identification of system order and specific delay terms
The identification of system order (maximum time delay) and the specific input-output time delay is usually more challenging than the selection of significant system inputs because the delay information should be revealed in the causality analysis. In (Tank et al., 2021), the component-wise MLP (cMLP) is developed to disentangle the contribution of different inputs to the outputs. The nonlinear Granger causality, maximum time delay and specific time delay can be derived by applying the group LASSO, hierarchical LASSO, and group sparse group LASSO penalty on the first hidden layer respectively. However, the hidden confounder problem has not been considered.
In , the multiscale RBFN was developed to capture the system dynamics. Similar to the derivation of the polynomial NARMAX model with ERR-OLS, the FROLS algorithm is applied to select significant RBF bases and calculate their weights (Billings, 2013;Billings et al., 2007). Since the delay information is contained in RBF bases, the system order and specific input-output delay information can be revealed. However, pre-processing such as clustering is required to determine the centre and width of each node heuristically, which makes it relatively more complicated than the fully end-to-end method. Nauta et al. (2019) developed the Temporal Causal Discovery Framework (TCDF) to obtain causality while predicting the target time series. In the TCDF, the Attention-based Dilated Depthwise Separable Temporal Convolutional Networks (AD-DSTCNs) was developed to separate the contribution of inputs to different outputs when representing the NVAR model. In AD-DSTCNs, the nonlinear correlation between inputs and outputs is derived by the hard spatial attention which is implemented by setting the threshold on the attention score since the correlation and causality are binary decision problems. Then all the input variables correlated to the output are regarded as potential causes of the output and validated by the permutation importance validation method to find true causes if all confounders are measured. Finally, the maximum time delay (system order) can be discovered by following the path with the highest kernel weights from the output layer back to the input layer thanks to the depthwise separable structure and weight-sharing property of convolution layers. However, the specific time delay cannot be derived since only the path with the highest kernel weights can be tracked.

Spectrum analysis
Similar to the conventional parametric system identification approach (Li & Billings, 2005), the spectrum analysis for the neural networks trained in the time domain is based on model transformation. However, up to now, only single-hidden-layer MLP and RNN can be transferred to a frequency domain model for spectrum analysis. In (Fung et al., 1997), the GFRF of single-hidden-layer MLP and Jordan RNN is derived through truncated Volterra series expansion of the activation function. However, the applications on modern RNNs, such as LSTM and GRU have not been studied. In (Tutunji, 2016), the linear transfer function of the single-hidden-layer MLP whose input includes both system's historical input and output is derived by approximating the activation function with the first two terms of the Taylor series expansion to avoid nonlinearity. However, the nonlinear GFRF is not derived, which ends up with a steady-state error when the MLP represents nonlinear systems.

System modelling with nonstationary realisations
There are generally two types of training methods for neural networks, either online or offline, for tracking the data dynamics and adapting to the underlying system changes from nonstationary realisations. The online training method is adopted to adjust the neural networks when the system is chaotic or time-varying generating inhomogeneous nonstationary time series. The offline method is more often employed to train neural networks to learn the trend and seasonality of homogeneous time series. Some milestone works of modelling using neural networks with nonstationary realisations are reviewed and summarised in Table 3 with respect to their training types and key operations.

Homogeneous nonstationary
To address homogeneous nonstationary time series, Wu et al. (2021) proposed the Autoformer to learn the intricated temporal pattern for long-term predictions. In the Autoformer, the auto-correlation mechanism was developed to detect period-based dependency and aggregate corresponding subseries as temporal embeddings. Then, the series decomposition block is designed to decompose seasonality and trend progressively. The encoder only focuses on seasonal part modelling and the decoder predicts the trend and seasonal component with accumulation structure and stacked Auto-Correlation mechanism, respectively. This approach outperformed the selfattention-based model like Informer (Zhou et al., 2021) and Reformer (Kitaev et al., 2020) since the autocorrelation mechanism learns the sub-series dependency while the self-attention mechanism only learns the point-wise dependency (Wu et al., 2021). It also outperformed the neural networks that decompose trend and seasonality in pre-processing, such as the CNN-based DeepGLO (Sen et al., 2019) and MLPbased N-BEATS (Oreshkin et al., 2020). This is because the pre-processing is constrained to historical information while Autoformer progressively decomposes seasonality and trend series throughout the whole forecasting process (Wu et al., 2021). However, due to its complicated and fixed structure, it is non-adaptive to track the time-varying underlying process dynamics in inhomogeneous non-stationary data.
In (Chng et al., 1996), the gradient RBF (GRBF) network was developed, where the input of RBFN becomes a certain order difference operation of the historical data. The order of the difference operation is a hyperparameter that can be tuned to achieve lower loss. The purpose of the difference operation is to mitigate the trend and seasonal effect. When a certain input is close to the centre of a hidden unit, the hidden unit is activated, performing as a localised one-step-ahead prediction of the output. Hence, the GRBF senses a certain order of localised derivative of time series, which captures the trend and seasonal pattern. However, the significant inputs of the actual system cannot be selected in the same manner as  since the input of GRBF is a certain order difference of data instead of the data from the original measurement space. Although the fixed structure makes it non-adaptive to underlying changes in inhomogeneous nonstationary realisations, the shallow structure of RBF allows it to be adjusted to be adaptable .   Online • Combine the GRBF (Chng et al., 1996) and fast tuneable RBF (Chen et al., 2016)

Inhomogeneous nonstationary
For the inhomogeneous nonstationary time series, the actual system may be time-varying or suffering from different disturbances. Some traditional methods like RLS can be used to adaptively estimate the weight of RBFN online (Chen, 1995). For simplicity, a large number of training data can be randomly selected as kernel centres, and then RLS can be applied to estimate the weight, which is known as the online sequential extreme learning machine (OS-ELM) (Wang & Han, 2014). However, adaptively learning the weight of the fixed structure of RBFN is not expressive enough when the significant inputs and causal relationship are timevarying . The fast-tuneable RBF was developed to adaptively learn the network structure and parameters. The node will be replaced when the squared relative error (SRE) is larger than a user-set threshold. The node with the least contribution measured by weighted node-output variance (WNV) will be abandoned. One simple way to set new node centre and width is using the current input data and the new maximum distance between centres, respectively. Then the weight can be updated with gradient descent or least squares. If no node is replaced, parameters will just be updated by RLS (Chen et al., 2016). However, the fast-tuneable RBF doesn't take the homogeneous nonstationary condition into account. Inspired by the conventional sliding-window model structure selection approach, Du et al. (2021) proposed an adaptive RNNs (AdaRNN) framework to identify different phases and fitted one robust model capturing the common knowledge shared among different periods. In AdaRNN, the target time series are split into K most dissimilar segments by reclusively maximising the distance between distributions of the randomly selected period with the greedy algorithm in the Temporal Distribution Characterisation (TDC) module. Then each segment is regarded as a homogeneous interval. Finally, an RNN-based encoderdecoder model is adopted to robustly fit all selected phases by matching distributions between RNN states from different phases to extract common knowledge in the encoder with the Temporal Distribution Matching (TDM) module. However, the end-to-end optimisation of both TDC and TDM remains a problem. Furthermore, the adaptive parameter learning methods have not been considered, which makes it hard to track some rapid changes in the time series generated by a time-varying system.
Recently,  combined the GRBF and the fast tuneable RBF creating the fast adaptive GRBF networks which is capable of identifying the system with data that is both homogeneous and inhomogeneous nonstationary. However, it still suffers from the same problem as the GRBF that the significant inputs of the actual system cannot be selected.

Research gaps and future direction
Many processes in climate systems and neuro systems are inherently nonstationary, which makes the system characteristics identification challenging (He & Yang, 2021;Zhao et al., 2016). In conventional parametric system identification approaches, a robust and adaptive parsimonious model, such as the time-varying NARMAX (TV-NARMAX), can be derived with nonstationary realisations. The system input-output delay information can be revealed in the model structure and the spectrum information can be revealed by transferring the TV-NARMAX model in the time domain to a time-varying GFRF (TV-GFRF) (He et al., , 2015Li et al., 2017) in the frequency domain. The significant inputs can be derived by the windowed ERR-based causality test (Zhao et al., 2012(Zhao et al., , 2017. Although a lot of studies have been carried out to either improve the interpretability of neural networks in the stationary scenario or their prediction performance with nonstationary realisations, the system characteristics identification with nonstationary realisations based on neural networks is overlooked, which potentially attracts future research. Although some state-ofthe-art (SOTA) neural networks, such as Autoformer and AdaRNN, have been developed, there has been very limited research on long-term forecasting with both homogeneous and inhomogeneous realisations. A potential future research direction could be merging the TDC and TDM mechanisms from AdaRNN into the Autoformer to enhance its adaptability. In conventional parametric system identification, the residual analysis is critical for model validation during a homogenous interval. Whether the residual is white noise indicates if all useful information in both input and output time series has been fully extracted by the model. If the residual is autocorrelated, the model is biased. Furthermore, if the residual is correlated with the explanatory variables, a condition known as endogeneity, the model is also biased (Box et al., 2015). Noise modelling should usually be considered to create unbiased estimation (Billings, 2013). On the other hand, if the residual has zero mean but its variance is always changing, a condition known as heteroscedasticity, the model is unbiased but the confidence intervals and hypotheses tests cannot be relied on. Then some models should also be built to estimate the change of variance (Bodhanwala, 2014;Box et al., 2015). Most of the validation and testing of neural networks in previous research is only based on prediction errors, while the residual analysis has not been considered. The question of whether the produced neural network is unbiased and has valid confidence intervals is often neglected, but it is important in parametric system identification. Although the residual analysis is considered in (Ljung et al., 2020), noise modelling and variance estimation with neural networks remain a problem. Therefore, it is still difficult to build an unbiased neural network model when the noise is coloured and correlated with system inputs and outputs. Future research is required to address this bottleneck to further promote neural networks in system identification.

Conclusions
This paper reviewed recent studies on parametric system identification based on three types of neural networks, including feedforward neural networks, recurrent neural networks and encoder-decoder, which have better approximation ability and rely less on prior knowledge than conventional approaches. It has been discovered that the FNNs and RNNs can be regarded as NARX and nonlinear state-space models, respectively, while the parametric model corresponding to an encoder-decoder depends on its specific structure. In system characteristic identification based on the interpretation of neural networks, it is observed that more research has been conducted on selecting significant inputs and extracting input-output delay information, whereas the spectrum analysis with neural networks has been relatively less studied and presents an area for future research. On the other hand, system modelling with both homogeneous and inhomogeneous realisations have been less studied, leaving a gap for future research. Furthermore, long-term forecasting with both homogeneous and inhomogeneous realisations could be potentially handled by leveraging the advantages of multiple SOTA neural networks, such as Autoformer and AdaRNN. Finally, it was identified that system characteristics identification with nonstationary realisations and modelling with coloured noise based on neural networks remain understudied and waiting to be explored.

Data sharing statement
Here is no data available for this article.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Notes on contributors
Aoxiang Dong graduated from Newcastle University with Distinction in Automation and Control in 2019. While studying at Newcastle University, he focused on designing digital control systems with advanced control theory and stochastic state estimation. He is currently pursuing a PhD in nonlinear system identification for data-driven control and decision-making.
Andrew Starr is a Professor of Maintenance Systems, Head of TES Institute and Director of Education in the School of Aerospace, Transportation and Manufacturing at Cranfield University. His works in novel sensing, e-maintenance systems, and decision-making strategies has been recognised as internationally excellent.
Yifan Zhao was born in Zhejiang, China. He received a Ph.D. degree in Automatic Control and System Engineering from the University of Sheffield, Sheffield, UK, in 2007. He is currently the Professor of Data Science at Cranfield University, Cranfield, UK. His research interests include computer vision, signal processing, non-destructive testing, active thermography, and nonlinear system identification.