Two-step reinforcement learning for model-free redesign of nonlinear optimal regulator

In many practical control applications, the performance level of a closed-loop system degrades over time due to the change of plant characteristics. Thus, there is a strong need for redesigning a controller without going through the system modelling process, which is often difficult for closed-loop systems. Reinforcement learning (RL) is one of the promising approaches that enable model-free redesign of optimal controllers for nonlinear dynamical systems based only on the measurement of the closed-loop system. However, the learning process of RL usually requires a considerable number of trial-and-error experiments using a poorly controlled system that may accumulate wear on the plant. To overcome this limitation, we propose a model-free two-step design approach that improves the transient learning performance of RL in an optimal regulator redesign problem for unknown nonlinear systems. Specifically, we first design a linear control law that attains some degree of control performance in a model-free manner, and then, train the nonlinear optimal control law with online RL by using the designed linear control law in parallel. We introduce an offline RL algorithm for the design of the linear control law and theoretically guarantee its convergence to the LQR controller under mild assumptions. Numerical simulations show that the proposed approach improves the transient learning performance and efficiency in hyperparameter tuning of RL.


INTRODUCTION
In many practical control applications, a reliable mathematical model of the controlled plant is not available since system modeling is often difficult.A common example of such situations arises when the plant is already in use in a closed-loop system for daily operations and is hard to suspend despite a degraded performance level due to the change of plant characteristics.This leads to a strong need for redesigning or tuning the controller based only on the measurement of the closed-loop system in a model-free manner.Until now, model-free controller design methods have been developed using various approaches for linear and nonlinear systems.Examples include data-driven control (DDC) [1], iterative learning control (ILC), model-free adaptive control (MFAC) [2], Mail : yhori@appi.keio.ac.jp iterative feedback tuning (IFT) [3], virtual reference feedback tuning (VRFT) [4], and fictitious reference iterative tuning (FRIT) [5].Among them, reinforcement learning (RL) [6][7][8] has been actively studied as a versatile approach to tackle optimal control problems for a large class of nonlinear systems in recent years.An attracting feature of RL-based controller design is that the optimal control law is explored in a fully autonomous fashion through a trial-and-error process [9][10][11][12][13][14][15].
One of the major issues of the RL-based design is that the system usually needs to undergo many trial-and-error experiments using the poorly designed control law during the learning process, which may accumulate wear on the plant and potentially shortens the lifetime of the system.Thus, it is desirable to develop a learning method that (i) maintains the performance level of the closed-loop system to some extent during the learning process, and (ii) reduces the number of necessary trials.Several studies have tackled these problems by utilizing LQR controllers designed based on the mathematical model of the controlled object.For example, literature [16] proposed to switch a local LQR controller and an RL controller depending on whether the current state of the system is inside the estimated controllable set or not, while literature [17] proposed to use both the RL and LQR controllers at the same time in parallel to improve its transient learning performance.In addition, literature [18] also used RL to learn a nonlinear controller operated with the LQR controller, in such a way that the derivative of the nonlinear controller with respect to the state becomes zero at the origin to ensure the local stability.Futhermore, methods that combine RL and model predictive control (MPC) were proposed in literature [19,20] to tackle the safety, stability, and trial amount (sample efficiency) issues in the exploration phase of RL.
However, these methods require some form of the plant model, and thus, their application to the closed-loop controller redesign problem is limited.This motivates us to further develop a model-free approach for assisting the learning process of RL.
In this paper, we propose a completely model-free approach to design an optimal control law, especially a nonlinear optimal regulator, for nonlinear systems while improving the transient learning performance and efficiency of RL.Specifically, we consider a situation where a closed-loop system with a stabilizing linear controller has an unsatisfactory performance level and requires redesign of the control law without knowing the mathematical model of the plant.The proposed approach consists of two steps.In the first step, we measure the input-output response of the existing closedloop system and design a quasi-optimal linear quadratic regulator (LQR) in an offline and model-free manner, which achieves a certain degree of performance to assist the learning process of RL.Then, in the second step, we use an online RL method to design a nonlinear control law that is connected, in parallel, to the pre-designed linear control law.As a result of this two-step design approach, the designed control law achieves a performance that cannot be realized by the linear control law alone.The proposed approach has two main advantages for its practical use: (i) it can be applied even if the plant model is unavailable, and (ii) it can reduce wear on the actual plant during the learning process of RL because the performance level can be improved by the quasi-optimal LQR controller, and the efficiency in hyperparameter tuning is improved.
The organization of this paper is as follows.In Section 2, we address the problem formulation.In Section 3, we describe the proposed approach.In Section 4, we introduce the algorithm for Step 1 of the proposed approach.Then, in Section 5, the effectiveness of the proposed approach is verified by numerical simulations of an inverted pendulum with input saturation as the plant.Section 6 illustrates the advantage of the proposed approach that hyperparameters can be efficiently tuned.Finally, in Section 7, we give concluding remarks of this paper.

PROBLEM FORMULATION
In this section, we describe the problem formulation.We consider a discrete-time nonlinear system described by where x k ∈ R n and u k ∈ R m are the state and input at time k, and f : R n × R m → R n is a nonlinear function.The equilibrium point of interest is at the origin, and f is smooth at that point.The state x k can be measured directly.
We define A ∈ R n×n and B ∈ R n×m by Using A and B, the linear approximation of the system dynamics near the origin is given by In what follows, we consider the case where the nonlinear plant ( 1) is already in operation with a locally stabilizing linear control law K init , but the performance of the closed-loop system has room for improvement, and the plant model is unknown.Then, our goal is to redesign the control law (policy) for better performance, but without explicitly identifying the nonlinear function f and its associated Jacobian matrices A and B.
This situation can be more formally stated as follows.
Assumption 2.1.A linear control law K init ∈ R m×n such that A + BK init is Schur stable is given.
With this assumption, we consider the following nonlinear optimal regulator design problem.
Problem 2.2.Consider the nonlinear system (1).Suppose Assumption 2.1 holds, and the plant model of (1) is unknown.Let where Q ∈ R n×n and R ∈ R m×m are given positive semi-definite and positive definite symmetric matrices, respectively.Design a control law that minimizes the cost function (4).
Note that the solution of Problem 2.2 (i.e., optimal regulator) is nonlinear, because the system dynamics is so.

PROPOSED APPROACH: MODEL-FREE TWO-STEP DESIGN OF CONTROL LAW
We propose a model-free two-step approach to design the optimal control law.The structure of the control law we use in the proposed approach is shown in Fig. 1, where K AC is a linear auxiliary control (AC) law that assists the learning process of a nonlinear control law µ : R n × W → R m with an adjustable parameter W ∈ W (W: a set of parameters).The proposed approach is to design the parallel control laws K AC and µ by the following two-step procedure.Two-step design procedure ✓ ✏ Step 1: Design the auxiliary control law K AC ∈ R m×n for the linear quadratic regulator (LQR) problem using an offline RL method.
Step 2: Design the nonlinear control law µ (adjust the parameter W ∈ W) by an online RL method, where we regard the closed-loop system consisting of the plant and the auxiliary control law K AC as a single environment (plant for online RL shown as a gray box in Fig. 1).

✒ ✑
The linear auxiliary control law K AC designed in Step 1 makes the cost lower than the initial control law K init while it is only quasi-optimal due to the nonlinearity of the plant.This new linear control law contributes to reducing wear on the plant in the learning process of the online RL in Step 2. This is because (i) the auxiliary control law improves the transient learning performance of the online RL by attaining a certain level of performance at the early stage of learning, and moreover, (ii) it facilitates the hyperparameter tuning of online RL by avoiding the application of large inputs from RL.It should be noted that the damage to the plant by Step 1 can be considered to be minimal since the linear control law designed in Step 1 is obtained by an offline algorithm using a single set of input-output data collected with the pre-existing control law K init .
Then, in Step 2, the nonlinear control law µ is designed by the trial-and-error process assisted by the auxiliary control law K AC .Specifically, the control input u k is generated by where u AC k = K AC x k , and u RL k is the control input obtained by an online RL method based on µ.The nonlinear control law µ enables to further lower the value of the cost function and attains the performance level that the linear controllers cannot achieve.Although this improvement requires additional cost for the learning of the control law µ, the effect of the improvement is significant in the long run since the designed control law is repeatedly used in the actual operation of the plant.
The underlying spirit of the two-step design approach is similar to the residual learning [21,22], which is a widely used approach in machine learning, but the proposed approach is specifically tailored for redesigning an optimal regulator for dynamical systems.In the two-step design procedure, the offline and the online RL methods for the design of linear and nonlinear control laws can be freely chosen by users.Nevertheless, in the next section, we show a specific instance of the design method of the linear auxiliary control law in Step 1.Then, in Section 5, we demonstrate how the linear auxiliary control law can be combined with an online RL method to train the nonlinear controller at a relatively small cost.

OFFLINE REINFORCEMENT LEARNING FOR DESIGNINIG LINEAR AUXILIARY CONTROL LAW
The algorithm for offline RL in Step 1 is given in Algorithm 1.This algorithm is derived based on the Hewer's iterative method for discrete-time algebraic Riccati equation [23].
The algorithm executes the Hewer's method by using collected input-output data instead of the model of the plant, i.e., the matrices A and B in (3), and finds the LQR controller for the linearized system.First, as shown in the upper part of Fig. 1, we apply an exploration term ν k to the closed-loop system locally stabilized by the control law K init , and record the input u k and output x k .We denote the collected time series data by {u k } ks+l−1 k=ks and {x k } ks+l k=ks where the non-negative integer k s is the arbitrarily determined start time of data collection, and k s + l is the final time of input-output data.Then, the auxiliary control law K AC which assists the learning process of the online RL is obtained by the iteration of (I) and (II) in Algorithm 1, where h j ∈ R l and F j ∈ R l×(n 2 +nm+m 2 ) are defined by Algorithm 1 Offline RL Algorithm for Discrete-Time LQR Problem.Data Collection.Apply u k = K init x k + ν k to the plant to collect data for k = k s , k s + 1, k s + 2, . . . ,k s + l, where ν k ∈ R m is an exploration term.
Initialization.Set the iteration number j = 0 and the linear control law K 0 = K init .Policy Evaluation and Improvement.Based on the collected data, perform the following iterations for j = 0, 1, . . . .
give a least square solution of the following equation, where h j is (6) and F j is (7): (II) Update the linear control law by the following equation: Repetition and Termination.Repeat the policy evaluation and improvement with j ← j + 1 until is satisfied for a small positive scalar ǫ.
with the entries The symbols ⊗ in ( 9) to (11) and vec in (12) in Algorithm 1 are the Kronecker product and the vec operator, respectively [24, §4].
As shown in Theorem 4.1 below, the convergence of K j in Algorithm 1 to the LQR solution is guaranteed when the column rank of F j is greater than or equal to n 2 + nm + m 2 for all j = 0, 1, . .., in which case the solution of the linear equation ( 12) is uniquely determined.This condition can be satisfied by choosing l such that and adding an appropriate exploration term that sufficiently excites the system.
Theorem 4.1.Consider a linear system whose state equation is (3).Suppose As-sumption 2.1 holds, and rank(F j ) = n 2 + m 2 + nm for all j = 0, 1, . ...Then, K j updated by Algorithm 1 converges to the LQR solution K ⋆ of the discrete-time linear system (3) as j → ∞.Furthermore, the rate of convergence is quadratic in the neighborhood of the solution K ⋆ .
Proof.We begin with the case j = 0. Since A + BK init is Schur stable (Assumption 2.1), there exists a positive definite symmetric matrix P 0 ∈ R n×n satisfying About this equation, the following relation holds: This leads to the following Bellman equation: Under the control law K init , the following equation holds: This can be transformed as below with an arbitrary u k : Substituting this equation and the collected data u ks and x ks into the Bellman equation (17), we have We partly used Ax ks + Bu ks = x ks+1 to derive the right hand side of this equation.
By applying the vec operator to (20), we obtain where we used the relation )vec(M ) derived from the properties of vec operator [24, §4.3.1].In addition, G 0 1 ∈ R n×n , G 0 2 ∈ R m×n and G 0 3 ∈ R m×m are the matrices defined as follows: The same formula as (21) holds for the data collected at other time steps.By putting these equations together, we have the following system of l linear equations with n 2 + nm + m 2 unknowns: When l and ν k are appropriately chosen and the rank of F 0 is n 2 + nm + m 2 , the exact and unique solution is obtained.On the other hand, the Lyapunov equation is obtained by substituting (18) into the Bellman equation (17).Furthermore, by substituting ( 23) and ( 24) into (13) in Algorithm 1, we obtain As shown in [23], A + BK 1 is Schur stable with this K 1 .Therefore, the procedure described above can be continued recursively for j = 1, 2, . . .This also implies that updating K j with ( 12) and ( 13) is equivalent to updating it with ( 29) and (30) below: which comprise the Hewer's iterative method for discrete-time algebraic Riccati equation [23].Thus, the theoretical assurance regarding the updates by ( 29) and (30) given as Theorem 1 in [23] also assures that K j updated by Algorithm 1 with the initial value K init converges to the LQR solution K ⋆ as j → ∞.
In addition, the rate of convergence is quadratic in the neighborhood of the LQR solution K ⋆ due to Theorem 2 in [23].
From Theorem 4.1, an approximation of the optimal feedback gain near the equilibrium point of the nonlinear plant can be obtained by using Algorithm 1.We show a specific example of K AC designed by Algorithm 1 in Section 5.
Remark 1. Algorithm 1 is the discrete-time correspondent of the offline RL method for the continuous-time LQR problem proposed in [25], and also a special case of the offline RL method for discrete-time H ∞ control problem proposed in [26] with simpler formulae.However, Algorithm 1 and the proof of Theorem 4.1 are derived specifically and directly for the offline reinforcement learning of the discrete-time LQR solution based on the Hewer's method [23].This leads to the algorithm suitable for the Step 1 of the model-free two-step design approach in the problem setting of this study, and also to revealing the rate of convergence of the algorithm that is not shown in the previous work [25,26].
Remark 2. It is difficult to explicitly state what kind of exploration term (probing noise) should be applied to make the rank condition regarding F j hold.In fact, the related work [25,26] did not provide such information in each problem setting either.
Intuitively, the exploration term should contain a sufficiently large number of frequency components to avoid the degeneration of the matrix F j .In the simulation in Section 6, this is empirically confirmed using an exploration term consisting of various sine functions.
Remark 3. The RL method for the LQR problem was previously studied for both continuous-time and discrete-time systems.For continuous-time systems, Jiang and Jiang [25] and Bian and Jiang [27] proposed offline policy-iteration-based and valueiteration-based algorithms, respectively.Later, these RL algorithms were extended to design the gain for the output feedback case [28].For discrete time systems, online policy-iteration-based algorithms were proposed in literatures [9,[29][30][31].Offline algorithm were proposed in literature [32,33], where the discrete-time LQR controller was obtained from multiple experimental data with different initial values.Compared to these related methods, the offline RL method for discrete-time LQR problem given as Algorithm 1 in this section is particularly suited for the redesign problem for two reasons: (i) Algorithm 1 requires only a single experimental data for learning, minimizing the wear on the plant during the learning process, and (ii) the discrete-time cost function (4) used for the offline RL is compatible with that for the online RL used in Step 2 as opposed to continuous-time cost functions.

DEMONSTRATION EXAMPLE
In this section, we illustrate the validity of the proposed approach through numerical simulations using an inverted pendulum with input saturation as a plant.

Plant and overall flow of the simulation
We consider the control problem of the inverted pendulum shown in Fig. 2.This inverted pendulum is controlled by the torque input ūk ∈ R from the motor attached to the fulcrum.The motor has input saturation characteristics, by which the control input u k is saturated at a constant value of ±s.That is, the actual torque input ūk applied to the fulcrum is ūk = The discrete-time model of the dynamics of the inverted pendulum is given by where T s is the sampling period, and x = [ψ, ξ] ⊤ is the state of the inverted pendulum consisting of the angle (rad) and angular velocity of the pendulum (rad/s) denoted by ψ ∈ [−π, π] and ξ ∈ (−∞, ∞), respectively.The definitions and the values of other variables in (32) are shown in Table 1.
Suppose the inverted pendulum is stabilized by a control law K init (Assumption 2.1), but the control performance is not satisfactory.In what follows, our goal is to redesign the control law by applying the two-step procedure proposed in Section 3. Specifically, we first regard the nonlinear system (32) as a plant and obtain the quasioptimal auxiliary control law K AC through Algorithm 1, assuming that the model of the plant is unknown (Step 1).Then, with K AC , we design the nonlinear control law µ by applying a general online RL method (Step 2).In this example, we use an Actor-Critic method with eligibility traces and describe it in detail in Section 5.3 and Appendix A.
We define the finite interval cumulative cost from time k = 0 to k = k fin as for the comparison of control performance.Note that if the state and input have sufficiently converged at k = k fin , x k ≃ [0, 0] ⊤ and u k ≃ 0 hold after the terminal time and J fin (k fin ) ≃ J.In other words, the cost J fin (k fin ) is a measure almost equivalent to the cost function J.

Evaluation conditions
In Step 1, we run the iterations in Algorithm 1 with the parameters listed in Table 1.The initial state for data collection is set to x 0 = 0, and the following exploration term ν k is added to K init x k : where ω i (i = 1, 2, • • • , 100) are selected randomly from [−500, 500].

Evaluation results
The quasi-optimal linear control law K AC is obtained as [−2.77, −0.48].For the purpose of verification, we also calculate the LQR controller K ⋆ by using the linearized model of the nonlinear plant (32) around the equilibrium point [ψ, ξ] ⊤ = 0.The result is K ⋆ = [−2.77,−0.48].As expected, K j converges to K ⋆ by Algorithm 1, which is also verified from Fig. 3 showing the gap between K j and K ⋆ converges to 0. Next, we compare the performance of the three control laws, K AC , K ⋆ , and K init , based on the average value of the cost J fin (k fin ) obtained with 100 simulations.In these simulations, we select the initial state x 0 = [ψ 0 , ξ 0 ] ⊤ randomly according to Convergence of K j updated by Algorithm 1. K j converges to the linear optimal law K ⋆ .K 5 is selected as K AC according to the termination condition ( 14) with ǫ given in Table 1.
Table 2. Average of the cost J fin (k fin ) for each linear control law.Average of the cost using K AC is lower than using K init and matches to the value using linear optimal controller K ⋆

Control law
It can be seen in Table 2 that the performance of the control law K AC is better than that of the initially given control law K init .We can also see that the performance of K AC matches that of K ⋆ , which is a direct consequence of the convergence of K AC to K init .These results show the validity of Theorem 4.1.

5.3.
Step 2: design of nonlinear control law µ

Evaluation conditions
In Step 2, the nonlinear control law µ, or more precisely the control law parameter W is adjusted by an online RL method.In this demonstration example, we use an Actor-Critic method with eligibility traces combined with the linear control law [6, §13.5], [17] whose pseudo code is shown as Algorithm 2 in Appendix A. To obtain a parameter value that can generate appropriate input for an arbitrary state, we perform the following training procedure.A training experiment simulation consists of N tri = 4000 trials and each trial is performed for k fin = 50 steps.We select the initial state x 0 = [ψ 0 , ξ 0 ] ⊤ randomly according to ψ 0 ∈ [−0.4,0.4] and ξ 0 ∈ [−1, 1] in each trial.If the angle of the pendulum exceeds 0.5 rad, i.e., |ψ k | ≥ 0.5, the trial is terminated.In this case, W k and θ k are updated by Algorithm 2 but by setting the reward to r k = −1000 as a penalty and θ ⊤ k−1 φ(x k ) = 0 (line 8 of Algorithm 2).To make W converge, we set the variance Σ = σ 2 of the exploration term in j-th and the learning rate β for W in j-th trial to so that the degree of exploration and the change of W decrease as the number of trials increases, where σ 2 init denotes the initial variance and β init denotes the initial learning rate.
The change of the parameter W and the cost J fin (k fin ) is affected by the stochasticity of the exploration term and is also highly dependent on the sequence of initial states.Therefore, the trend of the transient control performance during learning needs to be evaluated in a statistical manner.For this purpose, we define a sequence of 4000 trials as one set of simulations and execute N sim = 3500 sets of simulations to calculate the average of the cost J fin (k fin ) at each trial over the 3500 sets.In addition, we consider the following two cases for comparison: (i) the case where we train µ by Algorithm 2 without any linear control law (hereafter denoted as RL alone) and (ii) the case where we use K init instead of K AC in Algorithm 2 (denoted as K init + RL).The same procedure is performed for these two cases to compute the average of the transient learning performance.We set the i-th basis function to where the average c i is selected as [11,11] ⊤ .The details of other simulation parameters are listed in Table 3.An enlarged view with a different scale of the vertical axis is inserted to highlight the average cost at the end of the learning process.The proposed approach (K AC +RL) shows significantly better transient learning performance than that of RL alone especially in the early stage of learning.

Evaluation results
The comparison of the average of the cost J fin (k fin ) at each trial across 3500 sets of simulations is shown in Fig. 4, where the inset shows an enlarged view with a different range of the vertical axis.The cost for K AC alone is calculated by averaging the cost for 3500 simulations with randomly selected initial states.It can be seen from Fig. 4 that the average cost obtained with RL alone is very high in the early stage of learning.This is because the inverted pendulum falls over in a short time when the controller has little experience.On the other hand, the proposed approach (K AC +RL) shows significantly better transient learning performance.The average cost is lower than that of RL alone especially in the early stage of learning.This indicates that the quasi-optimal linear control law K AC in the proposed approach assists the closed-loop system to maintain a certain level of control performance even in inexperienced states.This interpretation agrees with the observation that the cost of using the online RL in parallel with the initial control law (K init +RL) is higher than that of the proposed approach in the early stage of learning since the control performance of the initial linear control law is worse than K AC .It should be noted that the cost J fin (k fin ) for collecting the data in Step 1 is 4.35, which is negligibly small compared with the cost for achieving the same control performance using RL alone and K init + RL.Therefore, designing K AC in Step 1 is quite beneficial compared to the cost to obtain it online.Table 4 shows the average of the cost J fin (k fin ) with the control laws designed by the proposed approach and four other comparison methods.For each method, the average is computed with 3500 different control laws × 100 simulations starting from the same initial states as Table 2.The nonlinear control law obtained by the proposed approach Table 4. Averages of the cost J fin (k fin ) with control laws obtained by each method after Step 2. Average of the costJ fin (k fin ) for the proposed approach is lower than other methods.

Method
Average of the cost J fin (k fin ) K AC + RL (proposed approach) 36.2K init + RL 40.3 RL alone 40.7 K AC alone 39.0 K init alone 133.4 (K AC +RL) outperforms all the other control laws, i.e., K init alone, K AC alone, RL alone, and K init + RL.In particular, the combined quasi-optimal linear and nonlinear control law (K AC +RL) designed by the proposed approach improves the average cost J fin (k fin ) by 7.2% than the quasi-optimal linear control law K AC alone, showing the effectiveness of the nonlinear controller µ to attain better performance of the control system.Although the proposed approach requires additional costs for learning the nonlinear control law µ compared to K AC alone, this improvement of the average cost J fin (k fin ) becomes dominant over the learning cost in the long run since the learning cost is incurred only once when the controller is trained.
In conclusion, the proposed approach was shown to be effective for improving transient learning performance and designing a control law with a smaller cost than RL alone under the same number of training trials.
To further interpret this result, we plot the control input generated by the control law obtained by the proposed approach in Fig. 5.The figure illustrates that the designed control law is almost linear near the origin because the nonlinear control law µ is almost zero and the quasi-optimal liner control law K AC is dominant.On the other hand, µ complements K AC as the norm of the state tends to be large; it reduces the cost by suppressing the application of unnecessarily large control inputs that cause saturation (see Appendix B for additional visualization).This can also be observed in the representative examples of the time-series of control input in Fig. 6(a).The corresponding state variables, i.e., the angle and the angular velocity of the pendulum, are shown in Fig. 6(c),(d), where the experiment is executed with the initial state x 0 = [0.4,0] ⊤ .These figures show that the control law obtained by the proposed approach attains almost the same state trajectories as the quasi-optimal linear control law K AC but with the smaller control input by suppressing it for the saturated region.Quantitatively, these differences sum up to the improvement of the average of the cost J fin (k fin ) as shown in Table 4.
Remark 4. One of the major causes of wear on the plant is the operation of unstable system during the learning process in Step 2. Thus, to further reduce the wear during the learning process, i.e, to improve the transient learning performance, it is important to guarantee the stability of the closed-loop system during the learning process.Such an extension would be possible by modifying the structure of the nonlinear control law µ as proposed in literature [18].Specifically, we define the input to the plant u by where h and μ are functions to be learnt by any RL-based algorithms of user's choice in Step 2.Then, according to Theorem 1 in reference [18], the control law (39) makes .Visualization of the control law resulted from K AC + RL (proposed approach).The designed control law is almost linear near the origin because the quasi-optimal liner control law K AC is dominant.On the other hand, the nonlinear control law µ complements K AC as the norm of the state tends to be large; it reduces the cost by suppressing the application of unnecessarily large control inputs that cause saturation.
the origin locally stable for any choice of the hyperparameters in μ under some assumptions (see [18] for details).This result can be regarded as an extension of the existing approach [18] in that the model-based LQR controller in [18] can now be obtained from a single experimental data without knowing the information of the plant model using Algorithm 1 of this paper.

EFFICIENCY OF HYPERPARAMETER TUNING
In the learning process of the RL-based control law, control inputs are generated in a stochastic manner, and thus, learning is not always successful; in the case of the inverted pendulum, for example, the resulting control law may fail to stabilize the pendulum, or it may deteriorate the performance compared to the initially given linear control law K init even if the pendulum is stabilized.The ratio of successful learning depends on the setting of hyperparameters.In particular, the initial variance of the exploration term σ 2 init and the initial learning rate β init are two major factors that directly affect the result.In this section, we show, through numerical simulations, that the stability and the performance of the control law obtained with the proposed approach (K AC +RL) are more robust against hyperparameter settings when compared to the control laws obtained with the online RL method alone (RL alone) and the online RL method combined with the initially given linear control law (K init +RL).

Evaluation conditions
We vary the initial variance of the exploration term σ 2 init in (36) and the initial learning rate β init in (37), and design N sim = 100 sets of control laws for each hyperparameter setting.The percentage of (i) successful learning and (ii) improvement in performance is then evaluated for the three methods, K AC +RL (proposed), RL alone, and K init +RL.More specifically, we compute the average of the cost J fin (k fin ) for 100 .The control input generated by the control law obtained by the proposed approach.The nonlinear control law µ complements the liner control law K AC as the norm of the state tends to be large; it reduces the cost by suppressing the application of unnecessarily large control inputs that cause saturation.
trials using each designed control law.The percentage of successful learning is then calculated by q/N sim × 100, where q is the number of times satisfying the following two conditions: (i) the parameter W does not diverge to infinity at the end of the training process, and (ii) average of the cost is less than the pre-defined penalty cost for destabilization, which is set to 1000.The percentage of improvement in performance is calculated by p/N sim × 100, where p is the number of times that average of the cost J fin (k fin ) of using the trained control laws outperforms the quasi-optimal linear control law, K AC alone.

Evaluation results
The percentages of (i) successful learning and (ii) improvement in performance are shown in Figs.7(a) and 7(b), respectively.According to Fig. 7(a), the ratio of successful learning is almost the same between the three methods, implying that dedicated tuning of the hyperparameters is not necessary for designing a stabilizing control law.On the other hand, Fig. 7(b) shows that hyperparameter tuning is important to obtain a better control law than the quasi-optimal linear one K AC .In this regard, the proposed approach is unmatched, i.e., the control law obtained with the proposed approach shows better performance than K AC for a wider range of hyperparameters compared with other methods.This is because the control performance is almost optimized near the origin by K AC designed in Step 1, and thus, the performance can be easily improved with a small initial variance σ 2 init and a learning rate β init , which determine the degree of exploration and the change in the control law parameter by the learning algorithm in Step 2. This feature allows us to avoid the tedious tuning of the hyperparameters In both figures, the results were obtained with (top) proposed approach, (middle) Online RL method alone, and (bottom) Online RL method using the initially given linear control law in parallel.The percentage of successful learning is almost the same between the three methods.On the other hand, the percentage of improvement for the proposed approach tends to be higher than other methods.The proposed approach in a wider range of hyperparameter combinations.
and reduces wear on the plant in the controller design process.

CONCLUSION
In this paper, we have proposed a completely model-free approach to redesign the optimal regulator for nonlinear systems.Specifically, we have developed a model-free two-step design approach that improves the transient learning performance of RL and reduces the risk of wear on the plant during the learning process.To this goal, we have first developed an offline RL algorithm for designing a quasi-optimal linear control law.The quasi-optimal control law is then used to assist the control performance of the exploration phase of the online RL.Using an inverted pendulum with input saturation as an example, we have shown that the proposed approach improves the transient learning performance of online RL and robustly achieves improvement in the performance of the control system for a wide range of hyperparameters.5) are generated by the nonlinear control law µ in such a way that u AC + u RL becomes closer to ±s (±0.5) when u AC is too large / too small.In other words, the inputs u RL are generated so that they are complementary to u AC in terms of reducing the cost.inputs u RL are generated so that they are complementary to u AC in terms of reducing the cost.

Figure 1 .
Figure 1.Structure of the proposed control law and the model-free two-step design approach.In Step 1, the linear auxiliary control law K AC is designed by offline RL, which contributes to reducing wear on the plant in the learning process of online RL.In Step 2, we design the nonlinear control law µ by online RL.

Figure 2 .
Figure2.Plant used in the numerical simulation (inverted pendulum with input saturation).We design a nonlinear optimal regulator with the proposed model-free two-step design approach based on RL to control this inverted pendulum.

Figure 3 .
Figure3.Convergence of K j updated by Algorithm 1. K j converges to the linear optimal law K ⋆ .K 5 is selected as K AC according to the termination condition(14) with ǫ given in Table1.

Figure 4 .
Figure 4. Averages of the cost J fin (k fin ) in each trial in Step 2. An enlarged view with a different scale of the vertical axis is inserted to highlight the average cost at the end of the learning process.The proposed approach (K AC +RL) shows significantly better transient learning performance than that of RL alone especially in the early stage of learning.

Figure 5
Figure5.Visualization of the control law resulted from K AC + RL (proposed approach).The designed control law is almost linear near the origin because the quasi-optimal liner control law K AC is dominant.On the other hand, the nonlinear control law µ complements K AC as the norm of the state tends to be large; it reduces the cost by suppressing the application of unnecessarily large control inputs that cause saturation.

Figure 6
Figure 6.The control input generated by the control law obtained by the proposed approach.The nonlinear control law µ complements the liner control law K AC as the norm of the state tends to be large; it reduces the cost by suppressing the application of unnecessarily large control inputs that cause saturation.

Figure 7 .
Figure 7.The percentage of (a) successful learning and (b) improvement in performance.In both figures, the results were obtained with (top) proposed approach, (middle) Online RL method alone, and (bottom) Online RL method using the initially given linear control law in parallel.The percentage of successful learning is almost the same between the three methods.On the other hand, the percentage of improvement for the proposed approach tends to be higher than other methods.The proposed approach in a wider range of hyperparameter combinations.
FigureB1.Visualization of the (a) linear and (b) nonlinear part of the control law resulted from K AC + RL (Proposed approach).The surface shown in (a) is flat.This is because K AC is a linear control law.On the other hand, the inputs in (b) (and Fig.5) are generated by the nonlinear control law µ in such a way that u AC + u RL becomes closer to ±s (±0.5) when u AC is too large / too small.In other words, the inputs u RL are generated so that they are complementary to u AC in terms of reducing the cost.

Table 1 .
Simulation parameters of Step 1.

Table 3 .
Simulation parameters of Step 2.
Yutaka Hori received the B.S. degree in engineering, and the M.S. and Ph.D. degrees in information science and technology from the University of Tokyo in 2008, 2010 and 2013, respectively.He held a postdoctoral appointment at California Institute of Technology from 2013 to 2016.In 2016, he joined Keio University, where he is currently an associate professor.His research interests lie in feedback control theory and its applications to synthetic biomolecular systems.He is a recipient of Takeda Best Paper Award from SICE in 2015, and Best Paper Award at Asian Control Conference in 2011, and is a Finalist of Best Student Paper Award at IEEE Multi-Conference on Systems and Control in 2010.He has been serving as an associate editor of the Conference Editorial Board of IEEE Control Systems Society.He is a member of IEEE, SICE, and ISCIE.