Policy iteration-based integral reinforcement learning for online adaptive trajectory tracking of mobile robot

This paper considers trajectory tracking control for a nonholonomic mobile robot using integral reinforcement learning (IRL) based on a value functional represented by integrating a local cost. The tracking error dynamics between the robot and reference trajectories takes the form of time-invariant input-affine continuous-time nonlinear systems if the reference trajectory counterpart of the translational and angular velocities are constant. This paper applies integral reinforcement learning to the tracking error dynamics by approximating the value functional from the data collected along the robot trajectory. The paper proposes a specific procedure to implement the IRL-based policy iteration online, including a batch least-squares minimization. The approximate value function updates the control policy to compensate for the translational and angular velocities that drive the robot. Numerical examples illustrate to demonstrate the tracking performance of integral reinforcement learning.


Introduction
Trajectory tracking control is a fundamental and essential task for mobile robots. Model predictive control [1] and nonlinear control [2,3] cope with the task so far. In these model-based control, the controller depends on the model, in which the control performance is constant. This fact limits the control performance when implementing the controller on other robots with different physical parameters and the environment. Adaptive dynamic programming (ADP) [4][5][6][7] does not need a complete model of the plant and information of the environment to update the parameters in the controller. There are several ADP methods to track trajectories for mobile robots [8][9][10]. For example, ADP is available for a discrete-time linearized tracking error model around the reference trajectory [8]. Otherwise, ADPbased adaptive update of the control law is also available for a continuous-time nonlinear tracking error model [9,10]. These methods require the linearization around the system equilibrium point or information about the reference trajectory.
ADP deals with the Hamilton-Jacobi-Bellman (HJB) equation under the Bellman's principle of optimality in mind without a model of the system but with the data of the robot's trajectory [18]. ADP gives approximate solutions to the HJB equation by assuming the value function to a parametric one. In this context, the neural network is often useful in representing the parametric function [9]. The value function updates the control policy, i.e. the control law, iteratively with dynamic programming. One way to estimate the approximated value function and update the control policy is to use the actor-critic structure. The critic estimates the parameters while the actor improves the control policy [5,6,9,11]. In particular, the article [11] discusses the convergence and stability properties of the policy iteration of ADP for discrete-time nonlinear systems.
ADP has been discussed typically for discrete-time systems. On the other hand, a method called integral reinforcement learning (IRL) [7,[12][13][14] allows us to consider learning control for continuous-time systems. Since existing systems follow some differential equations, learning methods of continuous-time systems are attractive. There are two ways to obtain the optimal control law in the IRL: the value iteration and policy iteration. The policy iteration is suitable for online learning, while the value iteration requires some noise to add when collecting the data. The control policy in the IRL-based policy iteration converges to the optimal one in the cases of offline learning [15,16], and online learning [12], if one gets an approximate solution to the HJB equation. The IRL-based policy iteration gives an optimal solution to the HJB equation from limited real-time information on the control system. Online learning enhances control performance in real-time.
This paper considers a trajectory tracking control problem for a nonholonomic mobile robot. A time-invariant input-affine continuous-time nonlinear system represents the tracking error dynamics. We propose an IRL-based policy iteration algorithm to design the control inputs for tracking the reference trajectory by using the data collected along the robot trajectory. Since approximation of the value functional is necessary for implementing the IRL-based policy iteration, this paper studies a selection of the basis function vector in an actor-critic structure. This paper also employs a batch least-squares minimization to update the control policy. The proposed algorithm sequentially updates the value function from online data. Finally, numerical examples illustrate to demonstrate the performance of the proposed policy iteration algorithm. A preliminary version of this paper is [17]. This paper includes comments on the convergence of the IRL-based policy iteration, a detail of the online learning algorithm, and additional numerical examples.

Problem formulation
Let X × Y ⊂ R × R be the coordinate on the plane. The continuous-time kinematic model of a nonholonomic unicycle mobile robot is written by where t ≥ 0 is the time, (x, y) ∈ X × Y is the position of the mobile robot, and θ ∈ [0, 2π) is the orientation with respect to X × Y. In addition, v ∈ R and ω ∈ R are the translational velocity and the angular velocity, respectively, to drive the robot. (x(t), y(t), θ(t)) is the robot trajectory.
Let (x rf (t), y rf (t), θ rf (t)) be the reference trajectory. We assume that the virtual point satisfies Equation (1). The tracking error between the trajectory of the robot and the reference one is defined by ⎡ where e = [e x , e y , e θ ] represents the error between (x, y, θ) and (x rf , y rf , θ rf ) as shown in Figure 1. The time derivative of the tracking error (2) is given by where v rf and ω rf are the reference trajectory counterpart of the translational velocity and the angular velocity, respectively. We assume that v rf and ω rf are constant to make the error dynamics time invariant. A traditional tracking problem is to find the control inputs v(t) and ω(t) such that e(t) converges to 0 as the time t goes to the infinity.
In this paper, we introduce auxiliary control inputs to the control inputs [8]. Consider the control inputs of the form where u 1 ∈ R and u 2 ∈ R are the auxiliary control inputs. Note that e = 0 implies u 1 = 0 and u 2 = 0. That is, Equation (4) can be an admissible control policy [12] that is needed for IRL. Then, the error dynamics (3) and the control inputs (4) lead to the form of a timeinvariant input-affine continuous-time error dynamics as follows:ė where The problem we intend is the following: Find the auxiliary control inputs u 1 (t) and u 2 (t) in Equation (4) from the data e(τ ) (0 ≤ τ ≤ t) such that the error e(t) converges to 0 as the time t goes to the infinity.

Integral reinforcement learning
To solve the problem in the previous section, we discuss IRL in this section that requires a value functional, a control policy, the Bellman equation, update of the control policy, and properties of the policy iteration.

Value functional
Consider the error dynamics (5). The cost hat has the value associated with the error and a control policy is called the value functional. The value functional is defined as where u = μ(e) ∈ R 2 is the control inputs, μ : R 3 → R 2 represents the admissible control policy that satisfies μ(0) = 0, and r : Note that e = 0 in equation (5) results in u = 0, which implies μ(0) = 0. In this paper, the local cost is a quadratic form such that where Q ∈ S 3 and R ∈ S 2 are positive semidefinite and positive definite matrices, respectively. Since μ(0) = 0 holds, we have that V μ (0) = 0.

IRL form
The purpose of the IRL is to determine a control policy that minimizes the value functional from the data To do that, we rewrite Equation (6) as the following one [5]: where T > 0 is the sampling period, and The continuous-time Bellman Equation (8) is called the IRL form.

Optimal control policy
The optimal value functional, that is a minimum of V μ (e) in the IRL form (8), is written by where u(t : t + T) = {u(τ ) : t ≤ τ < t + T} is the optimal control input. Under the assumptions of the existence of an optimal control policy, the optimal control policy can be written as Let us define the Hamiltonian by By differentiating Equation (6) with respect to the time, the following HJB equation [6,18] hold: H(e, μ(e), ∇V * ).
Partial differentiation of the right-hand side of the Hamiltonian with respect to μ yields Since R is positive definite, there is the unique control policy that achieves the minimum of the Hamiltonian. Equation (13) is the optimal control policy in (10) satisfying Equation (11). Substituting Equation (13) into the HJB Equation (11), (11) becomes the following nonlinear partial differential equation such that In general, it is difficult to solve Equation (14) analytically to obtain ∇V * .

IRL-based policy iteration
An initial control policy μ 0 (·) that stabilizes the error dynamics (5) is needed to start policy iteration. The subsequent control policies are expected to be capable of stabilizing the error dynamics. This characteristic is suitable for online control policy update.
The policy iteration consists of two steps: the policy evaluation step and the policy improvement step. The iteration lasts until one gets a convergence while the data along the robot trajectory is collected online. In policy evaluation step that concerns the IRL form (8), we solve the following equation for V μ j+1 with the data where j is the iteration number. In policy improvement step that concerns the optimal control policy (13), we shall update the control policy by Note that control policy (16) requires g(e) but not require f (e) in error dynamics (5). This implies that the IRL-based policy iteration does not need references ω rf and v rf in (5). The implementation of the policy iteration will be discussed in Section 4.

Convergence of IRL-based policy iteration
A convergence property resides in the IRLbased policy iteration with online policy update as follows: is equivalent to finding the solution to the following equation:
Let us comment on these lemmas. In the case of offline policy update, the policy iteration also has a convergence property based on Equations (16) and (17) [15,16]. Since Equations (15) and (17) are equivalent according to Lemma 3.1, the IRL-based policy iteration converges to the optimal control policy (10) with both offline and online policy updates.

Approximate solution
Since the IRL form (8) is difficult to solve for the value functional V μ (e) and may not have an analytic solution. Thus, we utilize a neural network for approximating the solution to (8). Then this section provides a detailed algorithm of the IRL-based policy iteration.

Approximation of value functional
We assume that the value functional (6) has the form of the following single hidden layer neural network: where W j ∈ R L is the learning weight vector and φ : The input layer of the neural network in Equation (18) takes the tracking error e(t) as the input. For a given basis function vector, the hidden layer finds L signals from the input e(t). Then, the output layer yields the estimate of the value functional.

Online implementation of IRL-based policy iteration
To estimate the learning weight vector W j in Equation (18) from the collected data, we introduce the actorcritic structure shown in Figure 2, where the critic and actor are correspond to the policy evaluation step and the policy improvement step, respectively.
There is room to implement the IRL-based policy iteration in Subsection 3.4 and the article [12]. A pseudo-code of the IRL-based policy iteration is shown in Algorithm 1. An initial weight vector W 0 must be chosen so that the initial control policy μ 0 (·) stabilizes the error dynamics (5). The weight vector W j (j ≥ 1) is estimated repeatedly while W j+1 − W j > ε holds for a given threshold ε > 0. In searching the weight vector, we solve a batch least-squares problem instead of recursive least-squares problem in [12].

Policy Evaluation Step
In the policy evaluation step, we first collect the data of the error dynamics (5) such that e(τ ) : t + T ≤ τ ≤ t + ( + 1)T for = 0, 1, . . . , k b − 1, where k b is the batch size that must satisfy k b > L.
From the IRL form (15) and the approximate value function (18), one can find a learning weight vector W j+1 that satisfies the following relationship: To determine W j+1 , we solve the batch least-squares problem, i.e. min W j+1 where is the temporal difference error. The solution to problem (20) is The policy evaluation corresponds to the critic and approximates the value functional by updating the weights.

Policy Improvement Step
In the policy improvement step, from Equations (16) and (18), we shall determine an improved control policy by the following policy: The weight vector W j+1 , which has been updated at the policy evaluation step, determines the subsequent control policy. The policy improvement corresponds to the actor.

Numerical example
This section provides numerical experiments of the IRL-based policy iteration for the error dynamics (5) with Algorithm 1 to obtain the auxiliary control input u in Equation (4) with control policy (21). Note that vector g(e) is known while vector f (e) is unknown in Equation (5). We consider two types of reference trajectories in Subsections 5.1, 5.2 and 5.3 : circular and straight line trajectories with different length of the basis function vector. These trajectories comes from constant v rf and ω rf . Subsection 5.4 discusses on the selection of the basis function vector. We also consider a set of time-dependent v rf and ω rf in Subsection 5.5 that genrates a figure-of-eight reference trajectory.
In all examples, we give initial learning weight vector W 0 and attempt to improve the control performance by updating the vector.

Circular trajectory (L = 9)
Let us consider the reference trajectory generated by reference translational and angular velocities (v, ω) = (v rf , ω rf ) in equation (1). Given (v rf , ω rf ) = (π/10, π/10), we have the corresponding reference trajectory (x rf (t), y rf (t), θ rf (t)) as follows: The basis function vector is given by x , e x e y , e x e θ , e 2 y , e y e θ , e 2 θ , e 3 x , e 3 y , e 3 θ ] , that is, L = 9, the weighting matrices Q and R in the local cost (7) are I 3 and I 2 , respectively, the sampling period T is 0.1 (s), and the batch size of the least squares k b is 22 and the iteration threshold ε is 10 −7 . The initial conditions are the following: (x(0), y(0), θ(0)) = (0.5, 0, 0) and W 0 = [0.2, 0, 0, 0.2, 0, 0.2, 0, 0, 0] , that is, W 0 φ(e) = 0.2e e. Then, we conduct a numerical experiment with the time from 0 to 20 (s). Figure 3 shows the robot and reference trajectories of the mobile robot. The solid and broken lines represent the robot and reference trajectories, respectively, while the dotted line represents the robot trajectory by the initial control policy. The dotted line seems to converge to the reference after a rotation. However, the trajectory does not converge exactly. On the other hand, the solid line converges to the reference at a half rotation. This figure shows that the policy iteration improves the tracking performance. Figures 4  and 5 show the tracking error and the control inputs, respectively. These figures illustrate that the algorithm works effectively to update the control policy. Figure 6 shows the transition of the learning weight vector W ∈ R 9 . We can see that the learning weight vector converges at 6 iterations, which implies that the control policy also converges. In Figure 5, since the initial error   is far from the equilibrium point (e x , e y , e θ ) = (0, 0, 0), after the control policy update, we can see that spikes occur in the control inputs, v and ω, around 2 (s).

Circular trajectory (L = 21)
For the same circular trajectory, we use another basis function vector as follows: x , e x e y , e x e θ , e 2 y , e y e θ , e 2 θ , e 3 x , e 2 x e y , e x e 2 y , e 2 x e θ , e x e 2 θ , e 3 y , e 2 y e θ , e y e 2 θ , e 3 θ , e 4 x , e 2 x e 2 y , e 2 x e 2 θ , e 4 y , e 2 y e 2 θ , e 4 θ ] ,
Then, we conduct a numerical experiment with the time from 0 to 30 (s). Figure 9 shows the robot and reference trajectories. Figures 10 and 11 show the tracking error and the transition of the learning weight vector W ∈ R 21 , respectively. From Figures 9 and 10, the tracking errors converge to zeros asymptotically. From Figure 11, we can see that the learning weight vector converges at 7 iterations. This example shows that the mobile robot tracks the reference trajectory even when the reference trajectory is different.

Discussion on selection of basis function vector
Let us give a comment on the selection of the basis function vector φ(e). The basis function vector needs to be chosen to which the parametric function is capable of sufficiently approximating the value functional. However, since the true vector is unknown and there is no constructive way to choose the basis function vector, we adopt a heuristic way. In Subsection 5.1, the basis function vector consists of monomials of e. If we select monomials as the second order alone, i.e. L = 6, then the learning weight vectors diverge. Similarly, in Subsection 5.3, the basis function vector of L = 9 does not attain the control performance with that of L = 21. The basis function should be selected according to the reference trajectory and the initial conditions of the robot.
The selection of monomial basis vectors in the article [12] experiences that the second order monomials are enough to obtain convergent trajectories for linear dynamics. In this case, the learning weights correspond to the third and higher order monomials vanish. On the other hand, the second order monomials are not enough and make nonlinear dynamics diverge. These observation are consistent with those in the examples in this paper.

Figure-of-eight trajectory
Let us give the time-dependent reference as follows: for t ≥ 0. Then, we have the corresponding reference trajectory (x rf (t), y rf (t)) = (1.0 + sin(t/10), 1.0 + sin(t/5))(t ≥ 0) where θ rf (t) is a complicated function of t. The weighting matrices Q and R are diag(1, 10, 0.01) and I 2 , respectively. φ(e) and W 0 are the same as in Subsection 5.1, i.e. L = 9. T is 0.03 (s), k b is 30, ε is 10 −3 , and (x(0), y(0), θ(0)) = (0.7, 0.8, π/4). A numerical experiment is conducted with the time from 0 to 63 (s). Figure 12 shows the robot and reference trajectories. The trajectory with the IRLbased policy iteration converges to the reference while the trajectory with the fixed initial control policy μ 0 obviously remain the tracking error. Since v rf and ω rf are time-dependent, there is no guarantee that the convergent policy is optimal. The convergence of this example depends on the threshold ε. If we take ε = 10 −7 as in other examples, the trajectory diverges. From this observation, the optimal control policy is sensitive to time-dependent v rf and ω rf .

Conclusion
This paper adopted the IRL-based policy iteration to solve the trajectory tracking problem for a nonholonomic unicycle mobile robot. This paper adopts the reference trajectories whose translational and angular velocities are constant to make the error dynamics a time-invariant system. The obtained control policy determines the translational and angular velocities that drive the robot based on the data collected along the robot trajectory. To do that, we approximated the value functional to the linear combination of the monomials with coefficients. The data collected online train the coefficients as the parameters in the value function. In particular, this paper solves the batch least-squares minimization to update the control policy online. Numerical experiments demonstrated that the data updates the control policy to improve tracking performance by the proposed algorithm. Time-varying settings of the error dynamics based on time-dependent translational and angular velocities are the future direction.