Output Feedback Controller for a Class of Unknown Nonlinear Discrete Time Systems Using Fuzzy Rules Emulated Networks and Reinforcement Learning

A model-free adaptive control for non-affine discrete time systems is developed by utilising the output feedback and action-critic networks. Fuzzy rules emulated network (FREN) is employed as the action network and multi-input version (MiFREN) is implemented as the critic network. Both networks are constructed using human knowledge based on IF–THEN rules according to the controlled plant and the learning laws are established by reinforcement learning without any off-line learning phase. The theoretical derivation of the convergence of the tracking error and internal signal is demonstrated. The numerical simulation and the experimental system are given to validate the proposed scheme.


Introduction
Due to the complexity of controlled plants nowadays, it is commonly difficult or impossible to establish its mathematical model especially for the discrete time system [1]. By utilising only input-output data of the controlled plant, the model-free approaches have been developed [2,3]. On the other hand, the performance of the controllers is related to data's quality and quantity [4]. For some engineering applications, it is very difficult to access all state variables, thus the output feedback is still a preferable scheme [5,6]. Furthermore, the close-loop analysis and stability approaches have been proposed [7,8,9] to guarantee the performance of controllers. From the engineering point of view, the stability analysis beside of closed-loop's performance is only a basic minimum requirement even for the artificial intelligence controller [10]. Therefore, the optimal controllers are more desirable for modern applications [11] or by nature view [12].
To ensure the closed-loop performance with the optimisation of the predefined cost function, the schemes based on adaptive dynamic programming have been utilised but the mathematic models have been required for its iterative learning [13,14]. With the modelfree aspects, reinforcement learning (RL) algorithms have been developed to solve optimal control [15,16] with the estimated solution of the Hamilton-Jacobi-Bellman equation [17,18]. To mimic the RL process, the approaches based on action-critic networks have been derived by artificial neural networks ( ANNs) under considering the controlled plant as a black box [19,20]. Nevertheless, even the mathematic model is unknown but the engineer still has basic human knowledge of the controlled plant such that 'IF higher output is required THEN more control effort should be supplied'. Thus, the controlled plant can be considered as a grey box.
To integrate the human knowledge as IF-THEN format into the controller, fuzzy logic systems ( FLSs) have been utilised in control applications [21] also including the optimal problems [22]. By including the learning ability to FLS, the integrations between FLS and ANN have been developed such as fuzzy neural network (FNN) [23] and fuzzy rules emulated network (FREN) [24,25]. Thereafter, the approaches of using FNN and FREN for solving the optimal problem with RL have been proposed [26,27] when the controlled plants have been considered as a class of affine systems. On the other hand, the problem of non-affine systems has been studied in Ref. [28] by the approach of critic-action networks when the state feedback has been utilised for gaining enough information to tune ANNs.
In this work, the output feedback model-free controller is proposed when the control effort is non-affine with respect to system dynamics. The controller is designed by the action network called FRENa with the set of IF-THEN rules according to the controlled plant. Thereafter, the long-term cost function is estimated by the multi-input version of FREN called MiFRENc when IF-THEN rules are established under the general aspect for minimising both tracking error and control energy. The learning laws are derived with the RL approach to tune all adjustable parameters of FRENa and MiFRENc aiming to minimise the tracking error and the estimated cost function. Furthermore, the closed-loop analysis is provided by the Lyapunov method to demonstrate the convergence of the tracking error and internal signals.
This paper is organised as follows. Section 2 introduces a class of systems under our investigation and problem formulation. The proposed scheme is introduced in Section 3 including the network architectures with IF-THEN rules of FRENa and MiFRENc and their formulations. The learning laws and closed-loop analysis are derived in Section 4. Section 5 provides the results of the simulation and experimental system.

Controlled Plant as a Class of Nonlinear Discrete-Time Systems
In this work, the controlled plant for a class of non-affine discrete time systems is considered as where y(k + 1) ∈ R is the plant's output with respect to the control effort u(k) ∈ R, f (−) is an unknown nonlinear function, n u and n y are unknown system orders and d(k) denotes a bounded disturbance such that |d(k)| ≤ d M . For further analysis, the following assumptions are expressed according to the unknown nonlinear function f (−) with respect to the control effort u(k).

Assumption 2.1:
The derivative of y(k + 1) with respect to u(k) is existed and bounded such that where g m and g M are positive constants.

Remark 2.2:
The condition mentioned in (2) indicates that the controlled plant in (1) is a positive control direction. That will assist the setting of IF-THEN rules according to the change of control effort u(k) altogether with the change of output y(k + 1).
Referring to condition (2), it is clear that the change of output y(k + 1) with respect to the change of control effort u(k) can be rewritten as where u(k) > 0 and g d m and g d M are constants according to g m and g M , respectively. This will lead to the setting of IF-THEN rules such that IF u(k) is positive-large, THEN y(k + 1) should be positive-large or IF u(k) is negative small, THEN y(k + 1) should be negative small. By utilising those IF-THEN rules, the adaptive controller based on FRENs will be established in the next section.

RL Controller
The proposed controller is illustrated by the block diagram in Figure 1. In this work, the plant is selected as a DC motor current control. Only the armature current is measured as the output y(k + 1) (mA) when the control effort u(k) (V) is the voltage fed to the driver unit.  According to this knowledge, the action network (FRENa) is first established to generate the control effort y(k) when the input is the tracking error e(k) defined as where r(k) is the desired trajectory. Second, the critic network is designed using MiFRENc to produce the estimated long-term cost functionL(k) for the controller FRENa. The details of two networks and its IF-THEN rules are given as follows.

Controller or Action Network
To utilise the action network, the IF-THEN rules with the relation between the tracking error e(k) and the control effort u(k) are first established. By considering the basic knowledge such that, positive-large e(k) means lack of y(k) in positive-large. In order to compensate, it clearly requires that the control effort u(k) should be positive-large. For conclusion, we have IF e(k) is positive-large, THEN u(k) should be positive-large. With seven linguistic levels, it leads to the design of IF-THEN rules as where notations of linguistic variables N, P, L, M, S and Z denote negative, positive, large, medium, small and zero, respectively. Employing this set of IF-THEN rules, the network architecture of FRENa is illustrated by Figure 2. According to the network architecture in Figure 2 and the function formulation of FREN in Ref. [24], the control effort u(k) is determined by where and Let us consider FRENa as the function estimator of the unknown control effort, thus it exists the ideal control effort u * (k) with the ideal parameter β * a such that where ε a (k) is a bounded residual error |ε a (k)| ≤ ε aM . By using the dynamics (1) with the control laws (5) and (8), the tracking error e(k + 1) is rearranged as Recalling Assumption 1 and using mean value theorem, the error dynamic (9) can be rewritten as where and Employing the control laws (8) and (5), it yields and we obtain It is worth to note that the tracking error obtained by (14) is functional byβ a (k) and the unknown but bounded d a (k) such that |d a (k)| ≤ d aM . This relation will be used for the performance analysis afterward.

Estimated Cost-Function or Critic Network
In this work, the long-term cost function L(k) is employed by an infinite-horizon of the tracking error e(k) and the control effort u(k) with the discount factor γ L as where where p and q are positive constants and 0 < γ L ≤ 1. (15) is functional by two input arguments with the quadratic functions (f x = x 2 ) of e(k) and u(k). Thus, an adaptive network MiFRENc is utilised to estimate L(k) as the block diagram in Figure 1. In order to design MiFRENc, IF-THEN rules are first established by Table 1. Thereafter, the network architecture of MiFRENc is illustrated by Figure 3. By utilising the network in Figure 3 and results in Ref. [24], the estimated cost functionL(k) is determined byL

L(k) in
where and Using the universal approximation property of MiFREN [24], there exists an ideal parameter β * c such that where ε c (k) is a bounded residual error such that |ε c (k)| ≤ ε cM . Adding and subtracting β * T c φ c (k) on the left-hand side of (17) yieldŝ  In order to improve the performance of FRENa and MiFRENc, the learning laws will be developed and explained in the next section.

Action Network Learning Law
Considering the tracking error within a (k) as (14) and the estimated cost functionL(k), in this work, the error function of action network is given as Thereafter, the cost function to be minimised is utilised as Applying the gradient descent, the tuning law for β a is derived as where η a is the learning rate. By using the chain rule and (13), it yields Recalling (24) with (25) and using e a (k) in (22), it leads to By eliminating d a (k) in (14), the learning law (26) is rewritten as The final learning law of FRENa given by (27) is a practical one because all parameters required on the left-hand side are certainly obtained at the time index k + 1.

Critic Network Learning Law
In general, the error function of critic networks is employed by the estimated cost function L(k). Therefore, in this work, the error function e c (k) is given as where δ is a positive constant. In order to tune β c , the cost function E c (k) is defined as Applying the gradient descent at (29) with respect to β c (k), we have where η c is the learning rate. Using the chain rule along E c (k) in (29), e c (k) in (28) andL(k) in (17), it yields Rewriting (30) with (31), it leads to Finally, we have a practical tuning law for MiFRENc.

Closed-Loop Analysis
In the following theorem, the closed-loop performance of the output feedback controller is demonstrated while the tracking error and internal signals are bounded.

Theorem 4.1:
For the non-affine discrete time system mentioned in Section 2, the performance of the closed-loop system configured by the structure of FRENa and MiFRENc in Section 3 is guaranteed in terms of the bonded tracking error and internal signals when the designed parameters are selected as follows: where ν a and ν c are upper limits of ||φ a (k)|| 2 and ||φ c (k)|| 2 , respectively.
Proof : The proof is given in Appendix.
The validation of the proposed control scheme will be presented in the next section for the computer simulation system with a non-affine discrete time system and the hardware implementation system for DC motor current control plant.

Simulation System and Results
The controller developed in this work is first implemented on the nonlinear discrete time given as It is worth to mention that the mathematic model in (36) is used only to establish the simulation. In this test, the desired trajectory is given as where k M = 500 as the maximum time index, A r = 1.0 and ω r = 8. To follow (33), δ is selected as δ = 0.75 and ν a = ν c = 1.5. By using this setting and (35), the learning rate of MiFRENc is determined as In this case, the learning rate for MiFRENc is selected as η c = 0.5. To select the learning rate of FRENa, let us chose g m and g M as 1 and 6, respectively. By using (34), the learning rate of FRENa is determined as 0 < η a ≤ g m ν 2 a g 2 M = 1 1.5 2 6 2 = 0.0123.
Thus, the learning rate for FRENa is selected as η a = 0.01. Figures 4 and 5 illustrate the setting of membership functions for FRENa and MiFRENc, respectively. The initial setting of adjustable parameters β (1) for FRENa and MiFRENc is given as Table 2. Figure 6 displays the tracking performance with both plots of y(k) and e(k) and Figure 7 represents the control effort u(k). The estimated cost functionL(k) is illustrated in Figure 8. The phase plane trajectory of u(k) and e(k) is depicted in Figure 9 to demonstrate the closed-loop system's behaviour.

Experimental System and Results
The experimental system is constructed by a DC motor current control. The output y(k + 1) is the armature current (mA) and the input u(k) is the control voltage applied to the driver circuit depicted in Figure 1. Same as the simulation systems, let us select δ = 0.75, ν a = ν c = 1.5, g m = 5 and g M = 10. Thus, the learning rate of FRENa is designed as 0 < η a ≤ g m ν 2 a g 2 M = 5 1.5 2 10 2 = 0.0222.
In this case, we select η a = 0.01. For MiFRENc, we use the same learning rate as the simulation system such that η c = 0.5 because of the same network architecture. The desired     trajectory is given as where and k M = 2000. Figures 10 and 11 represent the setting of membership functions of FRENa and MiFRENc, respectively. All adjustable parameters β (1) for FRENa and MiFRENc are initialised as the setting in Table 3. Figure 12 displays the motor current y(k) and the tracking error e(k) to demonstrate the performance of the closed-loop system. The maximum absolute value of tracking error is |e(k)| max = 48.2936 (mA) and the average absolute value of tracking error at steady state is 0.4924 (mA) when k =1500-2000. Figure 13 shows the control effort u(k). The estimated cost functionL(k) is illustrated in Figure 14. The phase plane trajectory of u(k) and e(k) is plotted in Figure 15. Thus, the large variation is detected because of the back-EMF. In order to evaluate the proposed scheme working under the situation of back-EMF, the pulse-train trajectory is implemented with the response displayed in Figure 16. It is clear that the effect of back-EMF is eliminated within the second pulse (B).

Conclusions
A model-free adaptive control for a class of non-affine discrete time systems has been developed by RL. The closed-loop system has been established by the output feedback with two adaptive networks FRENa and MiFRENc. The initial settings of FRENa and MiFRENc have been conducted according to the human knowledge of the controlled plant within the format of IF-THEN rules. The performance has been enchanted by the learning laws for both FRENa and MiFRENc while the tracking error and internal signals have been guaranteed the convergence over the reasonable compact sets. The numerical system and experimental results have represented to verify theoretical conjecture.

Disclosure statement
No potential conflict of interest was reported by the author(s).