Adaptive impedance control of robot manipulators based on Q-learning and disturbance observer

ABSTRACT In this paper, an adaptive impedance control combined with disturbance observer (DOB) is developed for a general class of uncertain robot manipulators in discrete time. The impedance control is applied to realize the interaction force control of robot manipulators in unknown, time-varying environments. The optimal reference trajectory is produced by impedance control, and the impedance parameters are achieved using Q-learning technique, which is implemented based on trajectory tracking errors. The position control with DOB of robot manipulators is implemented to track the virtual desired trajectory, and the DOB is designed to compensate for unknown compounded disturbance function by bounding both tracking error inputs and compounded disturbance inputs in a permitted control region, of which the compounded disturbance function is taken into account of all uncertain terms and external disturbances. The appropriate DOB parameters are selected applying linear matrix inequalities (LMIs) method. Both the impedance control and the bounded DOB control can well guarantee semiglobal uniform boundness of the closed-loop robot systems based on Lyapunov analysis and Schur complement theory. Simulation results are performed to test and verify effectiveness of the investigated combining adaptive impedance control with DOB.


Introduction
Applications of robot manipulators have been extended to many fields, such as domestic service, medical care, industrial production and so on, and robot manipulators are anticipated to work by interacting with fragile object, other machines and even humans (Peshkin et al., 2001;Lambercy et al., 2007). On the one hand, the interaction is in unknown, time-varying, complex environment, which makes the trajectory tracking problem of nonlinear multiple-input multiple-output (MIMO) robot manipulators becomes more difficult, and on the other hand, most robot manipulators in practical application have unmodeled dynamics and uncertainties (Lewis, Dawson, & Abdallah 2004;Lewis, Jagannathan, & Yesildirak, 1998;Yang, Yang, Chen, & Na, 2016).
The problem of interaction control between robot manipulators and working environment has became increasingly important and popular. Studies of interaction control mainly involve force control and impedance control (Hogan, 1985). The impedance control focuses on selecting appropriate impedance parameters compared with force control method. The impedance control is preferred to force control in interaction, because it does not CONTACT Chenguang Yang cyang@theiet.org rely on a direction decomposition. Many research findings of impedance control have been applied to robot manipulators in recent two decades. The impedance control approach was firstly proposed in Hogan (1985) to introduce an ideal dynamic behaviour to the interaction control between robot manipulator and environment. In Johansson and Spong (1994) and Matinfar and Hashtrudi-Zaad (2005), the impedance control is investigated, and the impedance parameters is properly selected by applying an optimal control method as the linear quadratic regulator(LQR). The system control obtained satisfying trajectory tracking performance and force regulation, but the environment dynamics are completely known. In Jung and Hsia (2010), Hosseinzadeh, Aghabalaie, Talebi, and Shafie (2010) and Li, Sam Ge, and Yang (2012), desirable impedance parameters are chosen as constant values, while in many tasks, interaction environment is timevarying, uncertain and unstructured, the conventional impedance control methods are incapable of incorporating environment properties. Preliminary work on estimation of impedance parameters for a robot manipulator working in an unknown environment has been studied in Diolaiti, Melchiorri, and Stramigioli (2005), a desired impedance model is constructed by precisely estimating the stiffness and damping parameters of the interaction environment. Furthermore, research work of time-varying force control for a robot manipulator is investigated in Xie, Sun, Liu, Cheng and Liu (2009), where a cosine wave reference force is tracked. However, impedance control is referred, which is only briefly mentioned in simulation section, and no theoretical analysis is provided. An automatic cell injection system is proposed in Xie, Sun, Liu, Tse and Cheng (2010), and the research focuses on time-varying force trajectory tracking. However, the method is only studied for one-link manipulator.
In our previous work Ge, Li, and Wang (2014), the developed method is verified for the time-invariant environment dynamics, such that the method is inapplicable to the time-varying environment interacted with the end-effector of robot manipulator. In Wang, Li, Ge, and Lee (2015), the optimal critic learning is proposed for unknown and time-varying environment, however, the uncertain effect of robot manipulator for trajectory tracking is not considered.
To compensate for uncertainties, many research works focus on disturbance observer (DOB) of states and external disturbances. In Wen, Zhou, Liu, and Su (2011), a robust adaptive control with DOB is designed for a class of nonlinear systems with uncertainty. The adaptive parameters are properly selected by saturating input states and compensating for external disturbances. In Xu, Lu, Zhou, and Yang$ (2004), a DOB control based on saturation of inputs and compensation for external disturbances is designed, and the state feedback theory is added to DOB control. In Yang, Fukushima, and Qin (2012), an adaptive robust control method is proposed for robot manipulators, the decentralized controller is designed by introducing a DOB and an adaptive sliding mode term to compensate for uncertainties of robot manipulators.
Most DOB methods are usually subject to compensate for external disturbances, which have be widely used in the field of trajectory tracking control for robot manipulators. However, most research studies are concentrated in continuous time. In Zeinali and Notash (2010), the dynamic model of robot manipulator is divided into two terms, the known-structure dynamics and the unknownstructure dynamics. Corresponding with the known term, a known system controller term is designed, and a feedback control and an adaptive control terms are proposed to correspond with the unmodeled dynamics. In Chen (2011), neural network (NN) control is proposed, a satisfying control performance is achieved by introducing the neural fuzzy network method, observer and sliding-mode method. In these studies, the stability of closed-loop robot control systems are reliably guaranteed, and the trajectory tracking control has obtained satisfying performance. Moreover, the digital controller of robot manipulator is applied more and more extensively at present, and the quick run speed of the digital implementation is more important in practical industry application. Recent relevant research works for nolinear uncertain robot manipulators focus on trajectory tracking control in discrete time.
In Li, Ma, Yang, and Fu (2015a), a adaptive controller is designed for a class of robot manipulators in discrete time, which have unknown fixed terms or timevarying payload uncertain terms. A satisfying control performance is obtained based on estimation for the external payload terms. However, it assumes that the uncertain terms of robot manipulators are bounded in a fixed range, and the structure of controller is complex, such that their applications in practice are limited.
Based on the above discussion, we will extend our previous works to propose an adaptive impedance control based on Q-learning and disturbance observer for an unknown, time-varying environment and an uncertain time-varying robot system. The objective of this paper is to achieve the optimal control performance of trajectory tracking requiring little knowledge of the environment and the robot dynamics. As discussed above, impedance iterative learning method, adaptive impedance control and DOB method have been developed and applied, but very few control methods have been proposed both for environments with unknown time-varying parameters and robot manipulator with nonliear uncertainties. This is the motivation to develop novel trajectory tracking control using optimal impedance control with DOB in the rest of this paper.
We highlight the contributions of this paper as follows: • The uncertain time-varying damping-stiffness environment is described as linear stiffness system with unknown dynamic parameters. • The optimal virtual desired reference trajectory is derived subject to unknown environment dynamics in Cartesian space by applying the impedance control with Q-learning, and the online adaptation of impedance parameters are achieved. • The optimal position trajectory to track the virtual desired reference trajectory is obtained subject to uncertain robot system in joint space, and the compounded effect of uncertainties and disturbances is compensated by DOB with saturation.
Throughout this paper, the notations used are detailed in Table 1.. the inverse of a n-order reversible matrix 0 n n-dimensional zero vector 0 a×b a×b-dimension zero matrix I m m-dimensional identity matrix x n -dimensional position vector f e n-dimensional impedance force vector x d n-dimensional desired trajectory in Cartesian space x r n-dimensional virtual desired reference trajectory q n -dimensional joint position q r n-dimensional virtual desired reference joint position τ n-dimensional vector of control input torque τ e n-dimensional external force torque

System structure
In this paper, Study of the whole system includes a class of rigid robot manipulators and an unknown time-varying environment.
A novel trajectory tracking control method, integrating an adaptive impedance control and a DOB controlling both trajectory tracking errors and all uncertain terms, is proposed to achieve a satisfying interaction performance and a satisfying trajectory tracking performance.
In particular, the system control framework is shown in Figure 1. The framework consists of two parts: an optimal impedance control and a bounded DOB control.
In the first part, a certain optimal interaction performance between the environment and the end-effector of robot system is achieved by founding a proper impedance model, and an optimal reference trajectory is provided to the second part as the virtual desired reference. However, it is extremely difficult to identity the time-varying parameters of working environment. In this regard, the research of this paper focuses on adopting ideal Q-learning to derive a desired optimal impedance function.
In the second part, joint position control of robot manipulators is implemented to track the virtual desired trajectory produced by the impedance control in the first part. Furthermore, the DOB is designed to approximate and compensate for all uncertainties and external disturbances of robot manipulators.

System model
In this paper, we consider a system in which a class of rigid robot manipulators is physically interacting with an unknown time-varying environment.

Environment model
The second part of the control system in (1) is considered using a typical damping-stiffness environment, the interaction of environment and robot is described in Figure 2.
In the model, the contact parameters relate the end-effector position x to the interaction force f e at each contact effector, C e and K e are unknown timevarying damping and stiffness matrices of the dynamics, respectively. Introducing an environment model proposed (Wang et al., 2015), we define that k describes the time-step index, the unknown time-varying environment  dynamics in discrete time is given as follows: where x(k) ∈ n is the position state vector of endeffector, f e (k) ∈ n is the interaction force imposed by the environment, and A e (k) and B e (k) are parameter matrices of the environment, and they are also unknown time-varying functions of the damping matrix C e and the stiffness matrix K e . This kind of damping-stiffness environment model stands for a large range of connection environment with robot system, it may represent a class of viscoelastic objects in robot works.

Assumption 2.1: The environment parameters A e (k)
and B e (k) are assumed to be unknown time-varying matrices, and they are stabilizable.
Compare with the previous studies in Matinfar and Hashtrudi-Zaad (2005) and Ge et al. (2014), in this paper, research of the interaction force control and position control based on the Assumption 2.1 are more practical and more complicated.
A class of robot manipulators are required for the damping-stiffness environment model in (1) to achieve a satisfying interaction performance.

Impedance control
The impedance control method is introduced to obtain an optimal control performance in (1) by using a Q-learning to approximate impedance parameters.
In this paper, we adopt the desired target impedance model to implement impedance control in Cartesian space as follows Li et al. (2012): where is the target impedance function, x d (k) ∈ n is the desired trajectory and x r (k) ∈ n is the virtual reference trajectory of the robot end-effector in Cartesian space, and e rd (k) = x r (k) − x d (k) is the corresponding desired tracking error.
Obviously, the end-effector of robot manipulator is described in Cartesian space, and intermediate links of the kinematic chain are to be represented in this space. However, joints of robot manipulator are in joint space. We need proceed the map between Cartesian space and joint space by inverse kinematics and forward kinematics. Furthermore, we can obtain virtual desired joint angles and virtual reference angles according to (2).
Let T represents the sampling time interval and the robot joint angels be q ∈ n in continuous time, and the sampled joint angles q(k) = q(t k ) at time t k = kT. The relationship between the position in Cartesian space and the joint angles in joint space can be obtained by where q r (k) ∈ n is the virtual desired joint angles in joint space, ϕ(·) and ψ(·) are the backward kinematics function and forward kinematics function of robot manipulators, respectively. The position control target is designed to make lim k→∞ x(k) = x r (k).

Robot manipulator model
In this paper, the end-effector of robot manipulator physically interacts with the environment, of which the model is defined in (1), and the trajectory tracking control will be considered for n-degrees of freedom (DOF) rigid robot manipulators. The robot dynamic model is described in continuous time as follows: where q ∈ n ,q ∈ n andq ∈ n are the joint angle position, velocity and acceleration, and M(q) ∈ n×n , C(q,q) ∈ n×n and G(q) ∈ n are the symmetric positive definite inertia matrix, the Coriolis-Centrifugal torque matrix and the gravity torque vector, and τ ∈ n and τ e = J T τ (q)f e (k) ∈ n are the control input torque vector and the external force vector mapped to the generalized coordinates with J τ (q) as the Jacobian matrix, respectively. The dynamic model of robot manipulator (4) has the following properties (Lewis et al., 2004): Property 2.1: The inertia matrix M(q) is uniformly bounded, g 1 > 0 and g 2 > 0 are constants, and thus, M(q) satisfies the following inequality Property 2.2: The Coriolis-Centrifugal torque matrix C(q,q) and the gravity vector G(q) are bounded by

Impedance adaptation learning
As discussed in Section 2, an impedance control is proposed based on Q-learning method to obtain the optimal virtual desired reference trajectory x r (k).

Q-function construction
In the following section, the desired trajectory in Cartesian space is generated by adopting an exogenous system, and the Q-learning method is introduced to derive the optimal control, and in which we does not rely on prior information of environment and robot system. In fact, the traditional optimal problem can be regard as the robot desired trajectory is zero, which is a special case. Further, relative robot studies are needed to make the problems identical.
In particular, the following assumption is considered: Assumption 3.1: Assume that the desired trajectory x d (k) is generated by the following exogenous system: where σ (k) ∈ n is an observable auxiliary vector, W d ∈ n×n and U d ∈ n×n are known matrices, and (W d , U d ) is also observable.
It is noted that a wide class of desired trajectory x d (k) can be determined by the exogenous system. To ensure the parameter convergence of the linear time-varying environment model (1) (Zhang, Ge, Hang, & Chai, 2000), we design a a control input to formulated the optimal control problem as follows: where L(k) ∈ n is the control gain vector, which minimizes the system cost function defined in quadratic form: where S ∈ n×n and R ∈ n×n are weights of the endeffector position tracking error and interaction force, respectively, which satisfy S = S T ≥ 0 and R = R T ≥ 0. The stabilizing feedback gain vector L(k) can be calculated by using solution sequence of algebraic Riccati equation (DARE) in discrete time. According to the heuristic dynamic programming in Landelius (1997), the solution sequence P(k + 1) is derived as After enough iterations, P(k + 1) can converges to the solution of the DARE. Introducing the auxiliary state σ (k) in (7) into environment model in (1), then, an extended state vector η(k) ∈ 2n is defined as follows: The augmented matrices of system (1) can be defined as follows: Then, the environment model (1) can be renewed as: The corresponding function with the system cost function in (9) can be rewritten as It is noted that the cost functionJ(k) correlates with extended system state η(k) and impedance force f e (k). Similarly, the control input law (8) can be renewed as whereL = [L T 1 (k),L T 2 (k)] T ,L 1 (k) ∈ n×n andL 2 (k) ∈ n×n are control gains for the state system x(k) and the auxiliary state σ (k), respectively.
According to Remark 2.1, we know that the matrix A e (k) + B e (k)L(k) has all its eigenvalues in the open unit disc, which also applies to the work environment (14).
We can define a cost-to-go function V(x(k)) with a quadratic form: The cost function (17) is minimized by finding the appropriate f e (k) in (16).
Assume the optimal impedance force f * e = arg lim f e (k) V(k) exists, corresponding with cost-to-go function in (17), V * (k) is quadratic, and it can be described as follows: where P(k) is the solution sequence of the DARE, which is derived as (16) can be calculated by using solution sequence of DARE, such that we havē Consider our previous results in Wang et al. (2015) and the cost-to-go function in (17), a Q-function with quadratic form is introduced as follows: where H(k) is a parameter matrix, and it is written as follows where It is easy to prove that the matrix H describing the Q-function is positive semi-definite.
The goal of f e (k) is to determine the optimal control law: Note the corresponding Q-function Q * (η(k), f e (k)) = lim f e (k) Q(η(k), f e (k)) is also quadratic, when Q * (η(k), f * e (k)) exists: Employ the optimization algorithm based on the gradient as below: Compare (23) with (26), the optimal control policy is acquired as: Noting (21) and (22), we know if the parameter H(k) can be obtained by an identification method, then, the system dynamic parameters will no longer be needed. In particular, (21) equals to (25) when f * e (k) exists, and the optimal performance will is achieved.
By introducing f e (k) into (21) and (25) Based on the above discussion, Q(η(k), f e (k)) will converge to Q * (η(k), f * e (k)) with the optimal control input f * e (k).
The f * e (k) satisfies a time-varying temporal difference equation as follows: It is obvious that the unknown and time-varying environment (1) has damping-stiffness dynamics, and it is more complex by using the traditional impedance for the systems. Considering the structure of Q-function in (29) or (21), the optimal impedance control is proposed by using Q-learning in discrete time.

Impedance adaptation control with Q-learning
In this subsection, we will employ a successive Q-learning method to solve (10) to obtain the sequence matrix P(k) (Wang et al., 2015), and impedance parameters are obtained by applying the Q-learning method. The algorithm is summarized as follows: (a) Choose a stable control vector u 0 (k) when the iteration index j = 0. (b) The evaluation solver of Q(η(k), f e (k)) at the (j + 1)th iteration is calculated as follows: whereH(k) j+1 is the approximation of H(k) at the (j + 1)th iteration, D = [S 0; 0R], and (20). (e) The control vector can be updated by

) is obtained by solving DARE (19). (d)L j can is obtained by solving
(f) Update j ←− j + 1, and go back to (30).
To obtain the approximative solver in (30), and achieve the optimal control performance, the recursive timevarying least square method is applied.
The (j + 1)th step of Q-function is introduced as follows: whereh j+1 (k) = vec(H j+1 (k)), of which vec() represents a linear transformation to convert a matrix into a column vector, andh j (k + 1) = vec(H j (k + 1)). Then, we can rewrite (32) by the following linear-inparameters form as: where θ(k) is system parameter vector and φ(k) is regressive vector.
The parameter θ(k), minimizing (34), is given bŷ where the estimation gain matrix g(x) is designed as follows: where the covariance matrix N(k) at the kth step with To avoid N(k) becoming too close to singularity, we define 0 and 1 are the positive scalars, and assume λ min (N(k)) ≤ 1 . Then the covariance matrix is designed as follows: where λ(·) denotes the eigenvalue of a matrix. Based on the above discuss about impedance control policy design and Q-learning, consider the exogenous system of the desired trajectory x d (k) in (7) and the impedance control in (27), we rewrite (27) as follows: Compare the optimal impedance control (39) with the desired target impedance model (2), it is obvious that the considered damping-stiffness environment has been changed to the stiffness environment. The proposed adaptive impedance control by using Q-learning is investigated to simplify the structure of the interaction environment model, and only the stiffness term exists to achieve optimal interaction performance between the damping-stiffness environment and the robot manipulators.

Discrete-time trajectory tracking controller of robot manipulator
Consider the robot dynamic model (4) and the actual desired reference trajectory x r (k) in Section 2, the q r (k) is able to derive according to (2). Defineq = [q T ,q T ] T ∈ R 2n , the dynamics corresponding with the model (4) is written as Li, Ma, Yang, and Fu (2015b) T is sampling time and q(k) is the sampled joint angle, which are defined in (3), v(k) =q(t k ) is the sampled joint angle velocity, τ (k) = τ (t k ) is the control torque and τ e (k) = τ e (t k ) is the external torque at the sampling time instant t k = kT, respectively. The equivalent robot dynamics can be derived as is the gravity torque vector in discrete time. (k) ∈ R 2n×2n and (k) ∈ R 2n×n are counter-part matrices in discrete time corresponding with the matrices (q,q) and (q) in (40) in continuous time.
At each sampling time, the matrix (k) is determined, andˆ (k),ˆ (k) can be obtained via a numerical method at sampling time t k = kT.
Further, we define u(k) = τ (k) −Ĝ(k) ∈ n is a corresponding control input in the presence of uncertainties and disturbances, such that a standard dynamics corresponding with system model (41) is derived as follows: where d(k) ∈ R 2n represents a external disturbance vector which is bounded. Assume the matching conditions, for example, system's structure property, are satisfied, then, all uncertain terms are guaranteed in a range space. Then, an unknown function vector F(k) ∈ R 2n , consisting of uncertainties including external disturbances in uncertain elements in (45), is defined as follows: Substituting (46) into (45) yields To track the reference trajectory q r (k), a new error vector is defined as ξ e (k) =q(k) −q r (k) ∈ 2n with q r (k) = [q r (k),q r (k)]. Therefore, an error dynamics model can be described as follows: where (k) =ˆ q r (k) −q r (k + 1) ∈ 2n , and F(k) can be formulated under the following assumption.
Assumption 4.1: The unknown complicated function F(k) in (48) can be formulated as the following exogenous system: where w(k) ∈ R 2n is the observer parameter, and W f ∈ R 2n×2n and U f ∈ R 2n×2n are auxiliary matrices.
With respect to the error dynamic represented in (48), we provided the following assumptions.

Assumption 4.2:
The function F(k) and its partial derivatives both are continuous, and they locally uniformly are bounded in Euclidian norm as follows: with F * > 0 as a constant. Assumption 4.3: Considering Property 2.2 and the bounded reference trajectory, we assume that the vector (k) is bounded as follows: with * > 0 as a constant.
To compensate the effect of robot uncertainties, we introduce the saturation method to design the position controller. The saturation function is considered as follows: Assumption 4.4: We assume sat(φ(k)) is a saturated nonlinear function, and we define sat(φ(k)) as Define an auxiliary control vector τ u (k) = sat(K 1 ξ e (k) + K 2F (k)), of which K 1 ∈ R n * 2n , K 2 ∈ R n * 2n are feedback gain matrices, andF(k) is estimation of the unknown complicated function F(k), and a bounded controller for system (48) is designed as follows: whereˆ + represents pseudo inverse matrix ofˆ . For design the bounded, saturated disturbance observer, we apply the following Lemmas and Definition as:
We assume v i = V i,1 ξ e (k) + V i,2F (k) satisfies |v i | ≤ τ u imax , such that the control input τ u (k) in (53) can be saturated in τ u imax . We further define the following saturated control input τ u (k) as F(k), the estimation value of F(k), is achieved designing the following observer aŝ where w(k) defined in (49), b(k) ∈ R 2n is an auxiliary vector as the observer, K 3 ∈ R 2n×2n is design as feedback gain matrix. Note equations (49), (48) and (56), the estimation error of uncertain termsF(k) =F(k) − F(k) is derived as Substituting (53) into (48), the closed loop system formulated by where (58) and the uncertain error (57) are combined and formulated as: Stability of the controller and control performances can be achieved by the proof in next section.

Controller realization and stable analysing
Further, to guarantee that the closed control system (59) is asymptotically stable, the design parameter matrices K 1 , K 2 , K 3 , V 1 , V 2 of the bounded observer can be achieved applying Schur complement Lemma and stability method as follows: In this section, the feedback gain matrix K 1 and observer gain matrix K 2 can be derived by applying the LMIs theory, and the stable of closed-loop system and the robust control performance for uncertainty, nonliear, and vary-time can be given.
We define the Lyapunov function as follows: where symmetric positive matrixP ∈ R 4n×4n can guarantee the closed system is stable. Further, assume the matrixP(k) exists, and we define it asP withQ −1 1 =P 1 ∈ R 2n×2n > 0 andP 2 ∈ R 2n×2n > 0. Then, we have V(k) = V(k + 1) − V(k), which can be further analyzed that where S 1 is a matrix, which represents as: It is obvious that V < 0 in (64) holds if S 1 < 0. Applying the Schur complement Lemma 5.1, a new matrix S 2 < 0 can be obtained from matrix S 1 < 0, and there S 2 < 0 ⇔ S 1 < 0, such that the matrix S 2 can be derived as where ' * ' in S 2 (i, j) represents the transpose matrix of S 2 (j, i) with i = and j as the index of row and column, respectively, and the following expressions are similar.
Thus, it is shown that V < 0 holds if and only if S 2 < 0 under existing positive symmetric defined matrixP. Moreover, under ensuring the system is stable, the design parameters of bounded observer are achieved using the following computing and analyzing.
Substituting (63) and (59) into (67), we have (67) Furthermore, we define auxiliary matrices as follows: 1 = diag{Q 1 , I 2n , I 2n , I 2n , I 2n } 2 = diag{I 2n , I 2n , I 2n , I 2n ,P 2 } Thus, a new matrix S 4 = T 2 ( T 1 S 3 1 ) 2 can be obtained as It is shown that V(k) < 0 if and only if S 4 < 0, which implies that q(k) → q r (k) andw(k) → 0 as k → ∞. Thus, the following Theorem is derived and is described as: Theorem 5.1: Giving auxiliary matrices U f , W f , if existing symmetric positive-defined matricesP 1 = Q −1 1 > 0, P 2 > 0, if existing matrices X 1 , X 2 , X 3 satisfy S 4 < 0, then, the closed-loop robot system in (59) is asymptotically stable based on the impedance control and the bounded observer, and the system control has satisfying robustness for robot manipulators with uncertainty under designing the parameters as follows:

Simulation studies
To verify the validity of the proposed control method, a 2-DOF rigid robot manipulator is considered, of which the end-effector has a physical interact with the dampingstiffness environment. The parameters of 2-DOF rigid robot manipulator are given in Table 2. Mass centre distance of link 1 0.1 m l c2 Mass centre distance of link 2 0.1 m The damping-stiffness environment model is described in (69): In joint space, the robot initial coordinates are given as q r (0) = q(0) = [π/3, 2π/3] T . It is noted that the initial position in Cartesian space is By applying LMIs theory, the following parameters is obtained as: The interaction force between environment and endeffector of robot manipulator is regulated to imposed along with the x-axis and the y-axis. The desired trajectory in Cartesian space is determined with U d = 1 and W d = 1.
To verify effectiveness of the investigated combining adaptive impedance control with DOB, LQR method is applied to obtained the desired impedance control based on the DARE, and the environment parameters A e (k) and B e (k) are known in simulation. The LQR method is compared with the desired impedance obtained by the proposed Q-learning method, which does not rely on the environment knowledge.
We design the saturated observer based on system stateξ e (k) and unknown functionF(k). The following simulation process are showed under the system sampling interval T = 0.01 s.
To show the effectiveness of the proposed method, using above design parameters K 1 , K 2 , K 3 , V 1 , V 2 , the interaction performance and trajectory tracking control are shown in Figures 3-7.
In the Cartesian space, the simulation results of impedance control based Q-learning are shown in Figures 3-5, the weightsS andR are given byS = 1 and R = 0.2. Figure 3 shows that the convergence of control gainL is demonstrated and compared. Figure 4 shows     that the interaction force f e obtained by applying the proposed impedance control method has high tracking performance. Figure 5 shows the cost-to-go performance using proposed method, and the convergence to zero is satisfying.
In the joint space, the simulation results of impedance control based on DOB are shown in Figures 6 and 7. Figure 6 shows actual Joint position trajectories q(1) and q(2) compared with the desired virtual reference trajectories q r (1) and q r (2), and Figure 7 shows position tracking errors of q 1 and q 2 for the desired virtual reference trajectories q r (1) and q r (2).
Analyze the simulation results, the adaptive adjustment takes time lead to the initial errors, which are small away from the desired reference trajectories for less than 5s. We can improve control performance at the initial stage if some prior knowledge of the environment and uncertain robot have been given, and initial parameters can be properly selected.

Conclusion
In this paper, a new method is proposed to realize the interaction force control of uncertain robot manipulators and unknown environments. The adaptive impedance control is introduced to obtain optimal virtual reference trajectory, and the impedance parameters are adjusted by the Q-learning method in Cartesian space. The position control with bounded DOB is investigated to obtain optimal virtual trajectory for tracking the virtual reference trajectory in the joint space, and the effect of uncertainties and disturbances is compensated by bounding them in a permitted control region. The method combined Qlearning and DOB is proposed to realize the impedance adaptation, such that we obtained the optimal trajectory tracking performance in both Cartesian space and joint space, where the optimal impedance parameters of system are properly selected online without any prior knowledge both of the environment dynamics and robot dynamics. Simulation results are performed to test and verify effectiveness of the proposed adaptive impedance control method.

Disclosure statement
No potential conflict of interest was reported by the authors.