Input-to-state stability for system identification with continuous-time Runge–Kutta neural networks

Runge–Kutta neural networks (RKNN) bridge the gap between continuous-time advantages and the discrete-time nature of any digital controller. RKNN defines a neural network in a continuous-time setting and explicitly discretises it by a variable sample time Runge–Kutta method. As a result, a-priori model knowledge such as the well-known continuous-time state space model can easily be incorporated directly into the neural network weights. Additionally, RKNN preserves a long-term prediction accuracy and a reduced parameter precision requirement. In this contribution, we enhance RKNN with global-asymptotic stability (GAS) and input-to-state stability (ISS) criteria. Based on a Lyapunov function, we develop constraints on the neural network weights for an arbitrary sample time. The constraints are independent of any measurements, have only a few hyperparameters, can be combined with any standard constraint optimisation algorithm and guarantee stability already during training. We apply the algorithm on two public real-world nonlinear identification tasks, an electro-mechanical positioning system and a cascaded water tank benchmark. Both benchmarks are solved with the same hyperparameters and the presented method is competitive to nonlinear identification methods beyond neural networks in the literature. In prediction configuration, all other black-box nonlinear identification approaches are outperformed regarding root mean squared error (RMSE) by an order of magnitude on the cascaded tank benchmark.


Introduction
Nonlinear system identification is a core discipline in control engineering (e.g. Nelles, 2001;Sjoberg et al., 1995). For more than 30 years, neural networks are developed and widely applied for the identification of nonlinear dynamic systems (e.g. Ahmed et al., 2010;Chen et al., 1990;Deflorian et al., 2010;Haykin, 1990;Narendra & Parthasarathy, 1990;Weigand et al., 2020;Williams & Zipser, 1989). Many contributions focus on discrete-time neural networks. As pointed out by Y. J. Wang and Lin (1998), there are several problems in using discrete-time neural networks: larger approximation errors are induced by the assumed first-order discretisation, the long-term prediction accuracy is usually upgradable and the neural network can only predict the system behaviour at fixed time intervals. A variable sample time 1 in training and application leads to complete model failure. This is mainly because feed forward networks and recurrent neural networks tend to learn the system states instead of the change rates of system states. Thereby the discrete sample time is inseparably connected to the neural network weights.
Another approach are continuous-time neural networks. This type of networks do not suffer from the aforementioned disadvantages. However, continuous-time neural networks do not consider the discrete-time nature of measurements or any digital controller. They often require a quasi-continuous sample CONTACT Jonas Weigand jonas.weigand@mv.uni-kl.de † Equal contribution.
time. This results in a large amount of training data and a restriction of applications. In industrial real-time applications, a high sample time can be prohibitive and thus the discretetime nature of the measurements has to be taken into account. Runge-Kutta neural networks (RKNN) bridge this gap by constructing a neural network in continuous-time domain and integrating the Runge-Kutta (RK) approximation method explicitly. Thus a precise estimate of the change rates of the system states is possible (Y. J. Wang & Lin, 1998). Therefore, RKNN do not suffer from the aforementioned disadvantages of discretetime or continuous-time neural networks and are superior in generalisation and long-term prediction accuracy, as theoretically proven by Y. J. Wang and Lin (1998) and shown in simulations by Efe and Kaynak (2000). More recently, Forgione and Piga (2021a) and Mavkov et al. (2020) achieved very good results with continuous-time neural networks for system identification.
In general, model stability is a key consideration for nonlinear system identification especially with data-based neural networks. Accordingly, the stability properties of discretetime neural networks are widely studied (e.g. Barabanov & Prokhorov, 2002;Hu & Wang, 2002b;L. Wang & Xu, 2006;Yu, 2004). The continuous-time model stability properties are broadly analysed too, see, e.g. Hu and Wang (2002a) for stability and Sanchez and Perez (1999) for the more general concept of input-to-state stability (ISS). ISS extends the stability criterion of global asymptotic stability (GAS) in the sense of Lyapunov to non-autonomous systems, i.e. systems with an input unequal to zero. Sontag (2008) shows that some GAS stable systems can diverge even for some inputs that converge to zero.
However, due to the explicit transfer to the discrete-time domain, neither the discrete-time nor the continuous-time stability criteria are applicable for RKNN. Only a sample timedependent criterion can guarantee ISS for RKNN. Therefore, we derive sample time-dependent ISS constraints on the neural network weights in this contribution. The ISS constraints ensure that all predicted states remain bounded. This holds always true, including but not limited to all inputs, to long-term predictions and to feedback loops containing the neural network in a model based controller with potentially noisy state measurements. Note that the presented algorithm enforces model stability even if the underlying process is unstable. In this case, the algorithm generates model errors, identifying a stable model for an unstable process.
The contributions of this paper can be summarised as follows: • A discussion of bridging the gap between continuous-time models and the discrete-time nature of digital controllers with RKNN. • Sample time-dependent ISS and GAS stability guarantees for RKNN. • A comparison of RKNN with methods beyond neural networks for system identification on two public, real-world benchmarks. The benchmarks are an electro-mechanical positioning system and a cascaded water tank system. To the best of our knowledge and in prediction configurations, the presented approach outperforms all other methods applied on the same benchmarks.
The remainder of this paper is organised as follows. In Section 2, we illustrate the contributions regarding stability with two simple examples. In Section 3, the RKNN is defined and in Section 4 the properties of continuous-time neural networks are discussed. In Section 5, some preliminaries regarding stability and in Section 6, the stability constraints for the RKNN are presented. We apply the algorithm to two real-world identification tasks in Section 7 and make concluding remarks in Section 8.

Illustrative examples
The core contribution regarding stability can be illustrated using two simple examples, accounting for the continuous-time stability and the discretisation method, as both aspects contribute to ISS. First, consider a learned continuous-time neural network which is clearly GAS, but its states tend to infinity in the presence of bounded inputs. This example underlines the importance of ISS instead of GAS.
Consider a continuous-time neural network with one state x(t) ∈ R, one input u(t) ∈ R, continuous-time t ∈ R, hyperbolic tangent activation function tanh(·), no bias weights and two neurons in the hidden layer, The network (1) is clearly GAS as the autonomous system reduces toẋ(t) = − tanh(x) and converges to zero. But for any given input u(t) ≥ 2 (and including any upper bound, e.g. u(t) ≤ 10 6 ), we geṫ > − tanh(x(t)) + 9 > 8 and the system trajectory tends to infinity although the input is bounded. The following second example shows that even an ISS neural network can become unstable if neural network constrains do not account for the sample time h ∈ R. Consider a standard state space neural network with eigenfrequency ω 0 = 10 1 s , damping D = 0.5 and input gain K = 1. The network consists of two states, one input, no bias terms and no hidden layers, and is GAS and ISS in the continuous domain (as it is input-affine and the real parts of all eigenvalues are negative).
Assume (2) is time discretised by a fourth-order Runge-Kutta method and a sample frequency much greater than the eigenfrequencies. As shown exemplary in Figure 1 for a step response, the discrete-time trajectory for (2) tends to infinity for h = 0.3s although it is stable for h = 0.1s. This trade off between neural network weights and sample time in general is non-trivial.
To the best of our knowledge, the presented stability constraints are the only ones that can enforce ISS in the given examples.

Runge-Kutta neural networks
The RKNN consists of a continuous-time neural network embedded in an explicit S-stage RK method. The neural network is defined by the differential equationṡ , the true and calculated outputs are y ∈ R N Y andŷ ∈ R N Y respectively, the true and calculated states are x ∈ R N X andx ∈ R N X respectively. The nomenclature of the weight matrices W is as follows: superscripts l, o and h denote the linear matrix, output matrix and hidden layer matrix for the state network. Superscript H corresponds to the respective layers of the output network h NN (x,ũ). Subscripts x, u and a refer to state weights, augmented input weights and activation function weights respectively. Note that the augmented input is bounded, i.e. there is an upper bound M ∈ R so that ũ(t) ∞ < M holds for all times t. Step response of a stable and unstable time discretisation of (2). Circles and squares represent each discrete-time step. Green: Step input. Black: Stable discretisation. Blue: unstable discretisation.
The augmented inputũ is composed of the input u ∈ l N U ∞ and a constant for taking bias terms into account Several neural networks can be applied, as long as the activation function σ (·) of the state network is Lipschitz continuous and for any argument v i ∈ R =0 holds. The activation function σ (·) is defined element-wise. There are no requirements for the activation function of the output network σ H (·). In this work, we present and compare two types of activation functions, hyperbolic tangent (TANH) and rectified linear units (ReLU). In the case of TANH, the activation function is σ : The continuous-time neural network (3a) is embedded in an S-stage explicit RK method with sample time h such that the discrete-time systemx ( 4 a ) is generated. We define the discrete sample time n ∈ N so that x n =x(t = nh). The RK scheme provides the rule to compute (4a) from (3a) for a given sample time h. The general S-stage RK method is defined by the discrete-time system  Butcher, 2016). The coefficients of the RK method are usually collected in vectors and a matrix and are  represented by the Butcher array The RK scheme (5a) is completely determined by the parameter matrix and vectors A B , b B and c B . Note that for an explicit RK method, the matrix A B is lower triangular (see Butcher, 2016). Although the notation (5a) allows a varyingũ within the sample interval, we assume it to be constant during one arbitrarily short sample interval. In the following, we focus on the state network and set The solution of the output network is straightforward and does not affect any stability considerations since it does not contain any derivatives or any feedback, The classical 4-stage RK method as neural network graph is depicted in Figure 4, with the state network f NN (x,ũ) and output network h NN (x,ũ) as presented in Figures 2 and 3 respectively. Equivalent to Figure 4, the classical 4-stage RK method is given by the equations g n,1 = f NN x n ,ũ n , g n,2 = f NN x n + h 2 g n,1 ,ũ n , g n,3 = f NN x n + h 2 g n,2 ,ũ n , g n,4 = f NN x n + hg n,3 ,ũ n , x n+1 =x n + h 6 g n,1 + 2g n,2 + 2g n,3 + g n,4 , y n = h NN x n ,ũ n .

Properties of continuous-time neural networks
Depending on the application, the presented model is capable of two configurations. In a (one-step-ahead) prediction configuration (e.g. Y. J. Wang & Lin, 1998), the state x is completely measured for all time steps. Each predicted output and state, y n+1 andx n+1 , is estimated based on the previously measured state and input, i.e. x n andũ n . The estimated statex n is not fed back into the state network. As a consequence, a one-step prediction mode leads to a fast training and enables parallel computation. However, training in one-step prediction reduces accuracy for long-term predictions if the model is not even unstable (without the presented constraints). If long-term estimation is required in the application, it cannot be neglected in the training. Therefore, in a (multi-stepahead) simulation configuration (e.g. Deflorian, 2010), only the initial state measurement x 0 is required and all proceeding states are recursively taken from previous calculations. So, each successive simulated statex n+1 in the training depends on the previous state estimatex n and on the inputũ n . The feedback of estimated states leads to a potential accumulation of state errors and is therefore more comprehensive. The difference regarding estimation performance between simulation and prediction configuration is 2 orders of magnitude for both presented benchmarks (see Section 7).
However, defining neural networks in continuous-time and discretising by RK methods leads to several advantages: (1) Continuous-time neural networks have a great analogy to the well-known state-space model, making it easy to incorporate a-priori knowledge. For example, is equivalent to (2) Discretising by explicit RK methods enables a simple integration of variable sample times. The sample times in training can differ from the sample time in the application by adjusting h. This enables a reduction of the data set in training and therefore reducing computational effort. In this case, the presented stability constraints have to apply the maximum sample time. In neural networks in a discretetime setting varying sample time is very difficult, because the sample time is directly intertwined with the weights (e.g. Hu & Wang, 2002b). However, discrete-time networks are a special case of the presented algorithm. If a simple discretisation by Euler and a sample time of h = 1 is selected, we obtain Covering this as a special case, the presented stability constraints in our contribution enhance the work of Hu and Wang (2002b) and enable full continuous-time advantages in a discrete-time formulation.
(3) Defining a model in continuous-time reduces the necessary parameter precision for a good approximation. For instance, consider the PT2 transfer function Selecting a sample time of 1 ms the MATLAB command c2d calculates the corresponding z-transfer function For the same model, the number of significant parameter digits increases from 3 digits to 12. In theory, considering an Euler discretisation, a small sample time always leads to an increase of significant parameter digits. The more precision is required in the model parameters, usually the more complicated the optimisation problem becomes. In discrete-time models, even small deviations of parameters accumulate over time to notable errors. Furthermore, discrete-time model stability tends to be more sensitive to parameter errors than continuous-time model stability. Assume a small parameter error in the aforementioned PT2 exampleG (s) = 100 s 2 + 14.7s + 100 and in the same, discrete-time examplẽ In both cases, we applied a 5 % parameter error in the second term of the numerator. WhileG(s) remains stable,G(z) does not.
The sample time is an important factor regarding the application of the RKNN. On the one hand, a large sample time leads to faster computation (as fewer points are required to estimate) and vice versa. On the other hand, a small sample time increases stability margins and vice versa. Naturally, the sample time should be small enough to satisfy the Shannon theorem.
The stability aspect is reflected in the subsequent ISS Theorem 6.1 and GAS Theorem 6.2: both theorems clearly indicate when the sample time is so large that the RKNN is unstable. 2 This insight is independent of neural networks: Many linear dynamic systems become unstable if the sample time is too large (e.g. Figure 1). To ensure stability, it is worth considering to reduce the sample time. We present two general methods, which are applicable in all physical domains, in both configurations (prediction and simulation) and do not require any hardware change.
(1) It is possible to include a zero-order-hold element in the measurements for several time steps in training and validation. The sample time can be reduced by holding the same inputũ k and state x k for several time steps, estimating the RKNN on each reduced time step. It is possible to include various interpolation and extrapolation schemes, depending on the application. Unfortunately, reducing the sample time by interpolation increases the computational effort.
(2) Another possibility to reduce the RKNN sample time is time-domain normalisation. To the best of our knowledge, this idea has not been presented before. As depicted in Figure 5, a measurement (i.e. of the Cascaded Tank benchmark) is obtained with a sample time of 20 s (black line). The dynamics of the real system are transferred to a faster system (red line), the RKNN identifies the normalised system, and the estimation of the RKNN is transferred back to the original time grid. For the implementation, the sample time h in the neural network graph ( Figure 4) can be changed to the normalised sample time h norm while the time step index n still refers to the original time grid in the measurements. As a result, t n+1 = t n + h data = t n + h norm with the measurement sample time h data . The sample time can be changed accordingly from h to h norm in (5a) and in the subsequent stability constraints (Theorems 6.1 and 6.2). Integration of state-space prior knowledge into the neural network weights is still possible. For instance, consider an autonomous, linear, discrete-time system with two measurements x n and x n+1 . The discrete-time state matrix is denoted as A D,S and the exponential matrix function is written as e (·) . As a consequence, h data A S = h norm A norm,S must hold for a given data set and we can set the neural weights to W l

Preliminaries for stability
Consider a continuous-time and a discrete-time systeṁ x n+1 = f D (x n , u n ).
The main results of this work, an ISS criterion for continuoustime neural networks, builds upon previous works. To focus this contribution, we give only a brief reference to these remarkable works. The concept of ISS is introduced in Sontag (2008). It extends Lyapunov stability to non-autonomous systems, i.e. u(t) = 0. Besides continuous systems, discrete-time ISS criterion for (7) were developed in Jiang and Wang (2001). Additionally, it is known that if a system is ISS and the point of equilibrium is the origin, it is also 0-GAS (Khalil, 2002). For the sake of simplicity, we dispense the time dependency of state and input variables, i.e.
with a positive definite matrix P 1 > 0, β > 0 and γ ≥ 0. The scalar product for two vectors v 1 , v 2 is defined as v 1 , v 2 V = v T 1 Vv 2 . We apply the identity matrix for the scalar product if not mentioned otherwise. Deflorian & Rungger, 2014)): An RK method is called weakly B-input-to-state stable, if for each (6) satisfying (8a) the numerical approximation (5a) is ISS for all sufficiently small h.

Definition 5.1 ((Weakly B-ISS RK method
The property is called weakly B-ISS because Definition 5.1 depends on (6) and on the sample time h. In Butcher's original terminology, the B-stability is a property of the scheme (5a) which is independent of (6) and h. In Deflorian and Rungger (2014) and in the extended German version (Deflorian, 2011) we can find the following result.
The coefficients k, l, m are dependent on the explicit RK method. Various (k, l, m)-algebraically stable RK methods are discussed in Cooper (1984). For example, the classical 4 step RK method with the Butcher-Array Cooper, 1984). To ensure B-ISS, we have two possibilities: (1) fix the sample interval h and choose the model weights so that (9) holds, or (2) choose the model weights so that x, f C (x, u) P 1 < 0 and then search for every sample interval the maximum h so that (9) holds.
We concentrate on the first condition and view the sample time as fixed constant within one data set.

Existence and uniqueness of equilibrium
To discuss the GAS and the ISS of the RKNN (4a), the existence and uniqueness of the trajectories are prerequisites. A sufficient condition for this is to satisfy the Lipschitz condition (Khalil, 2002) for any two states x 1 , x 2 ∈ R N X and for any two augmented inputsũ 1 ,ũ 2 ∈ l N U +1 ∞ .
We can estimate Analogously, we can show that the Lipschitz condition is satisfied. Thus existence and uniqueness of the trajectories of the RKNN (4a) are ensured. It is a prerequisite that the action function σ (·) fulfils (10). Among other activation functions, this is the case for TANH and ReLU functions.

Constraints on the neural network weights
To derive the ISS constrains, we make sure that the point of equilibrium of the RKNN (4a) is the origin (Sontag, 2008, Section 2.8). The point of equilibrium is affected by the bias terms and could be transformed to the origin by means of a complicated nonlinear coordinate transform. As a more elegant solution, we connect the bias terms of the RKNN (4a) to an augmented inputũ. Defining the bias terms as additional input dimension ensures that f NN (0, 0) = 0 holds while remaining full nonlinear modelling capabilities.
To simplify the derivation of the ISS criterion, we reformulate the state network with a nonlinear Lipschitz matrix rather than nonlinear activation functions. Therefore, we use the condition that the activation function σ (·) is Lipschitz continuous and that holds. This requirement is obligatory for the stability constrains. As a result the Lipschitz constant is and geṫ (12) Note that (12) is only used to derive the stabilisation criterion. The prediction or simulation properties of the RKNN are not changed in any way. The feed forward calculation and the training of the RKNN utilise (3a) in all cases. For the sake of simplicity, we omit the dependency of the Lipschitz matrix on the state and on the augmented input in the following, i.e. L = L(x,ũ). Using Theorem 5.2 we have to choose the weights such that holds. Using this condition (13) we get the main result of the contribution.
Theorem 6.1 ((ISS of RKNN)): The RKNN (4a) using a (k, l, m)-algebraically stable RK method is ISS, if there exist two positive definite matrices P 1 , P 3 and weighting variables holds. The ith row vector of matrix W h x is R Wh i and the ith column vector of matrix W o a is C Wo i . The matrices T 1 and T 2,i are defined by Proof: See Appendix C.
If we are only interested in GAS of the RKNN and not in ISS we can see from the proof (Appendix A, (A1) and (A2)) that all terms associated with P 3 vanish.

Theorem 6.2 ((GAS of RKNN)):
The RKNN (4a) using a (k, l, m)-algebraically stable RK method is GAS, if there exists one positive definite matrix P 1 and weighting variables Proof: See Appendix C.

Remark:
For , d i > 0, P 3 = 0 and P 1 a diagonal matrix the constraints (14a) and (15a) reduce to the results presented in Hu and Wang (2002a). Therefore, Theorems 6.1 and 6.2 generalise the result of GAS for continuous-time neural networks presented in Hu and Wang (2002a) as different network structures, various sample times, more options in the stability criterion and ISS can be ensured.

Experimental results
We demonstrate the algorithm on two real-world applications, a nonlinear Electro-Mechanical Positioning System (EMPS) (Janot et al., 2019) and a Cascaded Tank System (Schoukens et al., 2016). Both applications are publicly available benchmarks for nonlinear system identification.

Implementation for both benchmarks
We apply completely the same neural network architecture, stability criterion, optimisation algorithm setup and evaluation metrics on both benchmarks, as explained in the following. The RKNN is employed as explained in Section 3 and depicted in Figure 4. We test independently for the two activation functions, hyperbolic tangent (TANH) and rectified linear units (ReLU). In all cases, we set N U = 1, N Y = 1, N X = 2 and N N = 4. = 0. This sums up to 32 neural network weights. Only 32 parameters are sufficient and consequently, the proposed algorithm is computationally efficient. For illustration purposes, the complete neural network weights for one benchmark case are given in Appendix B.
The ISS stability criterion (14a) is implemented as nonlinear constraints in all cases. Using nonlinear constraints reduces the implementation effort significantly, even if an implementation as linear matrix inequalities (LMI) may be more computationally efficient. Note that only the constraints apply the Lipschitz relaxation (12), while the RKNN remains unchanged with ReLU or TANH activation functions (3a). Negative definiteness of the matrices is calculated using Sylvester's criterion as explained by Giorgio (2017). We fix P 3 = 0.0001I and α i = 1 N N . Positive definiteness of matrix P 3 is guaranteed by its implementation as P 3 = DD T with a strictly positive lower triangular matrix D. Non-zero elements of the matrix D are subject to optimisation. All presented results are ISS stable.
The proposed algorithm is implemented in MATLAB using the optimisation algorithm fmincon. We apply the Interior Point method based on Byrd et al. (1999). To improve generalisation capabilities, we introduce a regularisation term on the weights in the cost function. Other than that, the cost function consists only of the estimation error on the training data, which is the mean squared error (MSE), for a set of N ∈ N measurements. The MSE in the prediction case depends onŷ RKNN,n =ŷ RKNN (x n−1 ,ũ n , ) instead ofŷ RKNN,n =ŷ RKNN (x 0 ,ũ n ,ũ n−1 , . . . ,ũ 0 , ) for the simulation case. The root mean squared error (RMSE), the relative model error, the model fit and the R 2 index are applied as evaluation metrics in this manuscript, with the mean estimated outputŷ RKNN,mean = 1 N N n=1ŷ RKNN,n . Complete evaluation metrics are given in Appendix A. Without loss of generalisation, we scale the input and output of the neural network to approximately y ∈ [−1, 1] andũ ∈ [−1, 1]. Therefore, we scale the measured output by 4 and the input by 1/150 for the EMPS benchmark. The Cascaded Tank benchmark is scaled by 1/10 for both input and output.
We do not post-process the measured data at all. We apply the raw, i.e. unfiltered data. The measurements consist of two sets, one for training the neural network and one for validation. A full state measurement x n is not available in the data sets. Therefore, the state is set equal to the output measurement and its derivatives, x n = [y n ,ẏ n ] T . The derivativeẏ n is calculated with finite differencesẏ n ≈ 1 h (y n+1 − y n ) of first order. The initial state for simulation is obtained analogous, x 0 = [y 0 ,ẏ 0 ] T .
We evaluate two configurations, (one-step-ahead) prediction and (multi-step-ahead) simulation. Prediction employs the last state measurement as input and predicts the next one. In Simulation the estimation horizon is equal to the complete measurement time. The RKNN output is computed only based on the initial state and the input over time. This leads to a free-running simulation, without any measurement feedback, for 25,000 time steps in the EMPS benchmark. Full black-box identification is applied in all cases, so no model knowledge is incorporated in the RKNN at all. The succeeding Cascaded Tank model (16) is only stated to explain the benchmark.

Electro-mechanical positioning system
The EMPS is a standard component for robots or machine tools and is pictured in Figure 6. It is particularly difficult to identify, since it suffers from nonlinear, asymmetric friction. The input is the impulse force and the output is the position of the car.
The measurements and results for both configurations and for both activation functions are presented in Figure 7. The figures on the left correspond to the training data and the figures on the right correspond to the test data. The first two rows  Results on the EMPS benchmark, for training and test data, both activation functions and both configurations. All results are ISS stable. The training case is displayed on the left and the test scenario on the right. The y-axis description of each row is identical and the x-axis description in all cases is the time in seconds. Black lines correspond to measurements, red lines to RKNN with TANH and blue lines to RKNN with ReLU. The first two rows refer to the simulation configuration and the third and fourth rows to the prediction configuration. The fifth and sixth row display the force input signal.The figure shows that the RKNN can satisfactorily simulate (first row) and predict (third row) the benchmark. As almost no difference between the measurements y n and the model outputŷ RKNN,n can be observed without magnification, the second and fourth rows represent the output error e n = y n −ŷ RKNN,n . represent the output in the simulation configuration: black corresponds to the measured position, red to the simulated RKNN output using TANH, and blue to the simulated RKNN output utilising ReLU activation functions. As the deviation between measurement and estimation is not visible without zoom, the second row corresponds to the position error signal, e n = y n − y RKNN,n . The third and fourth rows present the results for the EMPS in prediction configuration and are displayed in an identical form as the simulation case in the first two rows. The fifth row depicts the complete force input signal and the sixth row presents an exemplary zoom on the force input for a given time interval.
As expected, the prediction configuration outperforms the simulation configuration by 2 orders of magnitude. For both, training and test data, the RKNN achieves a similar performance, demonstration sufficient generalisation capabilities. In prediction configuration, the ReLU activation function outperforms the RKNN with TANH activation. The results for EMPS are computed on the original time grid, without interpolation or time-domain normalisation.
Only few other black-box identification algorithms for this benchmark have been published yet. Forgione and Piga (2021b) develops a neural network called dynoNet with linear dynamics operators as elementary building blocks. Instead of activation functions, a transfer function is applied in each neuron. 3 Janot et al. (2019) identify a best linear fit with the CAPTAIN and the CONTSID toolboxes. 4 Most importantly, Forgione and Piga (2021a) identified continuous-time neural networks for the EMPS and the Cascaded Tank benchmark in simulation configuration. Fortunately, Forgione's paper was published while this work was still in the review process as it is a direct comparison to RKNN. Forgione discusses various integration methods including RK and enhances the system identification with a soft-constrained integration method (SCI) and a truncated simulation error minimisation (TSEM). Regarding the EMPS benchmark, Forgione included few physical knowledge into the neural network structure. 5 A comparison of the results is presented in Table 1. Table 1 reveals that continuous-time methods (TSEM, SCI and RKNN) can improve the RMSE compared to the best linear fit by a factor of 20. In prediction configuration, we are pleased to report the first results in the micrometre scale, although there is no black-box comparison published yet. Hopefully, more black-box methods will consider this benchmark in the future. However, great results on this application can be archived using hard-crafted grey-box approaches (Chaves Ferreira Pinto & Hultmann Ayala, 2020; Janot et al., 2019, 2017).  Schoukens et al. (2016) wrote that 'the cascaded tanks system is a fluid level control system consisting of two tanks with free outlets fed by a pump. The input signal controls a water pump that pumps the water from a reservoir into the upper water tank. The water of the upper water tank flows through a small opening into the lower water tank, and finally through a small opening from the lower water tank back into the reservoir'. The three tank setup is depicted in Figure 8.

Cascaded tanks
The system exhibits weakly nonlinear behaviour during normal operation and hard saturation effects, when the tank overflows, depending on the input. As presented by Schoukens et al. (2016), the dynamics in continuous time of the three tanks without the overflow effect are given bẏ with scalar damping coefficients k 1 ,k 2 ,k 3 ,k 4 , noise w 1 (t), w 2 (t), e(t) and states, input and output as defined before.
As the benchmark exhibits a large sample time, we test for both time-domain interpolation (RKNN with INT, Figure 9) and time-domain normalisation (RKNN with TDN, Figure 10). Regarding the interpolation, a zero-order-hold element for several time steps is included in training and validation. As a consequence, the sample time can be reduced from 4000 ms to 400 ms by holding the same inputũ n and state x n for 10 time steps. For improved comparison, we chose zero-order-hold Figure 9. Results on the Cascaded Tank benchmark, using interpolation, for training and test data, both activation functions and both configurations. All results are ISS stable. The training case is displayed on the left and the test scenario on the right. The y-axis description of each row is identical and the x-axis description in all cases is the time in seconds. This figure applies the same y-axis scaling as Figure 10. Black lines correspond to measurements, red lines to RKNN with TANH and blue lines to RKNN with ReLU. The first two rows refer to the simulation configuration and the third and fourth rows to the prediction configuration. The fifth and sixth rows display the input signal.The figure shows that the RKNN can satisfactorily simulate (first row) and predict (third row) the benchmark. To display the difference between the measurements y n and the estimationŷ RKNN,n in more detail, the second and fourth rows represent the output error e n = y n −ŷ RKNN,n . Figure 10. Results on the Cascaded Tank benchmark, using time-domain normalisation, for training and test data, both activation functions and both configurations. All results are ISS stable. The training case is displayed on the left and the test scenario on the right. The y-axis description of each row is identical and the x-axis description in all cases is the time in seconds. This figure applies the same y-axis scaling as Figure 9. Black lines correspond to measurements, red lines to RKNN with TANH and blue lines to RKNN with ReLU. The first two rows refer to the simulation configuration and the third and fourth rows to the prediction configuration. The fifth and sixth row display the input signal.The figure shows that the RKNN can satisfactorily simulate (first row) and predict (third row) the benchmark. To display the difference between the measurements y n and the estimationŷ RKNN,n in more detail, the second and fourth rows represent the output error e n = y n −ŷ RKNN,n . instead of advanced interpolation schemes to avoid an augmentation of the data, as some methods in literature struggle with an identification using only 1024 data points. As a comprehensive consequence, the simulation horizon increases to 10240 time steps. Regarding the time-domain normalisation, the normalised sample time is set to h norm = 40 ms and 1024 data points are employed. The results are evaluated on the original time grid with N = 1024 and h data = 4000 ms in both cases.
In the literature, many nonlinear methods have addressed the Cascaded Tank benchmark. We will start with the approaches that have delivered the best results in prediction configuration at the time of this publication. Belz et al. (2017) apply local model networks with external dynamics to the benchmark problem, represented by three different approaches: NARX, NFIR and NOBF components. Aljamaan et al. (2021) develop a Hammerstein Box-Jenkins model for the identification benchmark. Worden et al. (2018) presents a record of participants of the nonlinear system identification workshop considering this benchmark among others. White, grey and black box approaches are discussed and evolutionary algorithms are applied for system identification in both prediction and simulation configuration. 6 Hostettler et al. (2018) propose a Gaussian Process-based nonlinear, time-varying drift model for stochastic differential equations. Results are compared in Table 2 and the complete evaluation metrics are given in Appendix A. To the best of our knowledge, the RKNN with time-domain normalisation and provable ISS stability is capable of outperforming all published methods by a factor of 10.
Figures 9 and 10, as well as Table 2, present an unexpected good performance in prediction configuration. From physical insight, as both tanks are independent, the two states should be independent, too. Therefore, we expected the application of the output derivative as a second state to be problematic. The RKNN in prediction configuration addressed this challenge successfully.
Many nonlinear methods addressed Cascaded Tank benchmark in the more sophisticated simulation configuration. We cannot refer to all approaches, so we will discuss the ones with the best results regarding RMSE in simulation configuration. Svensson and Schön (2017) wrote that it is 'surprisingly hard to perform better than linear models in this problem, perhaps because of the close-to-linear dynamics in most regimes, in combination with the non-smooth overflow events'. The overflows occur only in a few time steps and completely changes the dynamical behaviour of the system. Svensson develops a flexible, nonlinear state-space model with a Gaussian-Process inspired prior. As discussed in Section 7.2, Forgione and Piga (2021a) develop the SCI and TSEM methods  Mavkov et al. (2020) which applies continuous-time neural networks with the Euler integration scheme. The best linear approximation is given in Relan et al. (2017). Furthermore, Relan et al. (2017) designs an unstructured flexible nonlinear state space model with different initialisation schemes (nonlinear SS with NLSS2 initialisation). Birpoutsoukis et al. (2018) develop a truncated volterra series model for the benchmark and extend its regularisation method. All methods are compared in Table 3. All in all, the RKNN with ISS provides a comprehensive RMSE in simulation and prediction configuration on this benchmark. Especially the direct comparison with the methods using continuous-time neural networks, TSEM, SCI and INN, presents a similar performance. Table 3 demonstrates that time-continuous neural networks (TSEM, SCI, INN and RKNN) are comprehensive methods compared with approaches beyond neural networks. TSEM provides the best RMSE on this benchmark in simulation. It should be pointed out that none of the discussed works considered ISS stability. Therefore, we conclude that the ISS stability constraints are not too conservative. Note that many more nonlinear methods have been applied to this benchmark, we only referred to the ones leading the field. To the best of our knowledge, the RKNN with ISS provides the best results for both benchmarks in prediction configuration in the literature. An improvement of accuracy can be achieved of factor 10 in prediction configuration. Furthermore, the similar approach also using continuous-time neural networks, TSEM, is leading the Cascaded Tank benchmark in simulation configuration.

Conclusion
In this contribution, input-to-state stability (ISS) of continuoustime neural networks discretised by explicit Runge-Kutta methods (RKNN) is investigated. The ISS property for RKNN extends the stability results from control theory to neural networks. Especially for industrial applications, it is important to prove that all predicted states remain bounded in every potential use case and under all possible circumstances. The drawback is that constraints always narrow the solution space. Nevertheless, RKNN keeps full nonlinear black-box identification capabilities. Furthermore, the presented stability guarantee already holds during training. The constraints rely only on the weights of the neural network and are not dependant on any measurement or any handcrafted physical insight of the process.
A future research is required to combine the ISS stable neural network with control, i.e. how to include the RKNN in a model-based reinforcement learning (RL) algorithm or in a model predictive controller (MPC). First steps of combining control with RKNN have been done by Weigand et al. (2020) for model-based RL and by Uçak (2020) for MPC. Another promising direction is further analysis of time-domain normalisation (TDN), i.e. developing auto-tuning mechanisms for TDN.

Acknowledgments
We thank Vassilios Yfantis, Patrick Bachmann, Achim Wagner and the anonymous reviewers for providing substantial comments on the manuscript.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This research was supported by the project RACKET under grant 01IW20009 by the German Federal Ministry of Education and Research. Notes 1. We refer to variable sample time between data sets, i.e. training and application, and not within one data set. 2. Both theorems guarantee stability even for a large sample time. In this case, the constraints force the network to reduce its dynamic, until the simulation of the RKNN remains close to the origin for all time steps. 3. Direct communication with the authors revealed that the RMSE might have an unit error. We encourage reading the paper. 4. The paper only states the relative error e rel,sim = 74 %. We estimated the corresponding RMSE in Table 1 with e RMSE = (e rel y meas 2 )/( √ N100). 5. The results for the EMPS benchmark origin from direct contact to the authors. The authors report for the TSEM method R 2 = 0.9967, e RMSE = 4.7253 mm and for the SCI method R 2 = 0.9935, e RMSE = 6.646 mm. Results for the Cascaded Tank benchmark are given the paper. 6. The paper only states the best normalised mean squared error (NMSE) of e NMSE,pred = 0.0565 V and e NMSE,sim = 4.6174 V. We estimated the corresponding RMSE in Tables 2 and 3

Appendix C. Proof of Theorem
The idea of the proof is to find an estimation for (14), such that linear matrix inequalities (LMI) only depend on the network weights and fixed parameters. First, we estimate the terms X and U. Then, we eliminate L. For the sake of simplicity, we set X = W l x + W o a LW h x and U = W l u + W o a LW h u to define f NN (x,ũ) = Xx + Uũ. We calculate the left-hand side of (14) by x, f NN (x,ũ) P1 + hm f NN (x,ũ), f NN (x,ũ) P1 = x T P 1 (Xx + Uũ) + hm((Xx + Uũ) T P 1 (Xx + Uũ)) = x T (P 1 X + hmX T P 1 X)x + x T (P 1 U + 2hmX T P 1 U)ũ +ũ T (hmU T P 1 U)ũ.
( A 1 ) To estimate the term x T (P 1 U + 2hmX T P 1 U)ũ, we use the general relation 2z T 1 z 2 ≤ z T 1 P 2 z 1 + z T 2 P −1 2 z 2 which is valid for any vectors z 1 , z 2 ∈ R NX and for any positive definite matrix P 2 > 0, P 2 ∈ R NX ×NX (Horn & Johnson, 2012: Theorem 7.7.6 and 7.7.7).
Equation (A4) holds always true as the norm is by definition greater or equal than zero. To ensure that the terms in (A3) associated with the state x are strictly less than zero, we define Z 1 = 5 4 P 2 + P 1 X + hmX T P 1 X.
We resubstitute X and calculate Next, we eliminate L in two steps. We know that L ≤ 1 holds and estimate the positive definite term Using (A7) in (A6) we get Substitution of (A8) in (A5) leads to with T 1 = P 3 + P 1 W l x + hm(W l x T P 1 W l x + Q). Then, to estimate the other terms independently of L we follow the approach in Hu and Wang (2002a Consequently, we can estimate Introducing the weighting variables λ i , with NN i=1 λ i = 1, 0 ≤ λ i ≤ 1, ∀i ∈ {1, . . . , N N } we can reformulate T 1 Incorporating (A9) into the sum leads to with T 2,i = (hmW l x T + I)P 1 C Wo i R Wh i + hm(C Wo i R Wh i ) T P 1 W l x . By the definition of the weight variables λ i and the fact that 0 ≤ l i ≤ 1 holds, we ensure that 1 − NN i=1 l i λ i ≥ 0 holds and the proof is completed.