Undiscounted Control Policy Generation for Continuous-Valued Optimal Control by Approximate Dynamic Programming

We present a numerical method for generating the state-feedback control policy associated with general undiscounted, constant-setpoint, infinite-horizon, nonlinear optimal control problems with continuous state variables. The method is based on approximate dynamic programming, and is closely related to approximate policy iteration. Existing methods typically terminate based on the convergence of the control policy and either require a discounted problem formulation or demand the cost function to lie in a specific subclass of functions. The presented method extends on existing termination criteria by requiring both the control policy and the resulting system state to converge, allowing for use with undiscounted cost functions that are bounded and continuous. This paper defines the numerical method, derives the relevant underlying mathematical properties, and validates the numerical method with representative examples. A MATLAB implementation with the shown examples is freely available.

Index terms-Approximate dynamic programming, Control policy, Undiscounted infinite-horizon, Optimal control

I. Introduction
Practical methods for generating the optimal control policy (i.e. the state feedback function) for general nonlinear optimal control problems are useful tools for control engineers. If the optimal control policy is known, a real-time optimal controller can be implemented on very computationally limited hardware as the optimal control signal can be generated simply by interpolating the pre-computed optimal control based on the current system state. However, one practical difficultly lies in precomputing the optimal control policy, which can be very computationally expensive. Although several methods for solving this class of problem are well-studied, dynamic programming (DP) variants being one example, they all have associated limitations or drawbacks. Policy iteration is one extensively studied variant of DP (e.g. Bertsekas 2017, p. 246;Puterman 1994, p. 295;Puterman and Brumelle 1979) that has been used for over 40 years for finding the optimal control policy for discrete-valued, non-linear, infinite-horizon problems, i.e. where the state and control variables are taken from discrete sets.
Approximate dynamic programming (ADP) is another well-known extension of DP (see for instance Powell (2009) for a general introduction) that approximates the cost function using a prescribed set of basis functions. One group of ADP methods approximate the cost function by interpolating costs and optimal controls between discrete gridded points (e.g. Munos and Moore (2002); Santos and Vigo-Aguiar (1998)). This approach allows for extending DP to applications with continuous state variables.
Assuming the problem of finding the approximatelyoptimal control policy for continuous-valued, non-linear, infinite-horizon problems, one might attempt to use traditional policy iteration in concert with ADP. However, this is problematic as traditional policy iteration requires the set of states and controls to be discrete (i.e. finite) to terminate, while the interpolation performed with ADP leads to a continuous (i.e. infinite) number of possible states and controls. This has led to the development of several methods that can be broadly classified as approximate policy iteration (API) methods, where the termination criterion of conventional policy iteration is altered in order to terminate in finite time and generate an approximately optimal solution.
There are several excellent papers that consider different variants of API. However, the vast majority of these are limited to the case where the cost function is discounted, i.e. where future costs are successively weighted less and less (Bertsekas, 2011;Santos & Rust, 2004;Scherrer, 2014;Stachurski, 2008). Though a discounted cost function may be relevant for some problems and allows for more easily determining a termination criterion, a sizeable portion of optimal control problems are better formulated as undiscounted problems (e.g. minimum fuel/energy/time problems, or yield maximisation for chemical plants and cultivation). Guo, Si, Liu, and Mei (2017) introduce one API method for the undiscounted case from a reinforcement learning perspective, but this method is limited both in that the cost function must be a sum of a positive definite function of the state and a quadratically weighted function of the controls, and that the state and control cannot be arbitrarily constrained.
In this paper we will introduce a method similar to API schemes that approximates the solution to the infinitehorizon problem by instead solving a finite-horizon problem. More specifically, the method uses conventional interpolating ADP to approximate the undiscounted, infinite-horizon, non-linear, optimal control problem where the state is constrained to converge to a unique equilibrium. The primary contribution of this paper is a termination criterion that terminates at a suitable horizon without requiring the presence of a discount factor, while also allowing for (nearly) arbitrary cost, constraints, and problem dynamics -a combination that is novel to the best of the authors knowledge. The method's sole tuning parameter allows for controlling the trade off between memory consumption, computational time, and accuracy. This allows for the method to be used without in-depth knowledge of the method. Furthermore, as the method's output is the optimal control policy (i.e. the optimal control tabulated by the system state) subsequent on-line control can be implemented using a computationally fast interpolation operation.
The structure of this paper follows; in Section II we will define the problem studied in this paper and the structure of the interpolating ADP method we subsequently base our presented method on. We will assume a working knowledge of ADP methods for optimal control. Sundstrom and Guzzella (2009) gives a straightforward introduction while Bertsekas (2017);Puterman (1994) go into more detail. This is followed by Section III, where we derive relevant properties of the studied problem. Though these properties are mostly already known, by deriving them we can both highlight some important details, as well as use a language and notation more commonly seen by control engineers as compared to existing API literature. In Section IV we present our method of generating an approximation of the optimal control policy, as well as highlight how existing API methods compare with our method. Finally, in Section V we use two representative examples to show the results generated by our method. For ease of reference, a list of the symbols and notation used in this paper is shown in Table I.

II. Problem formulation
Assume a dynamic system f d : R n × R m → R n whose associated state evolution is recursively given by for the system state x k ∈ R n and control input u k ∈ R m at samples k ∈ [0, 1, 2, . . . ]. Define the infinite sequences  u 1 , . . . , u N −1 ]. In particular, for bothx andx N we respectively define x 0 as the initial condition.

A. The infinite-horizon problem
as the problem we study in this paper. Here, we denote f c : R n × R m → R the cost function, g : R n × R m → R l the inequality constraint(s), a scalar parameter α ∈ R the average constraint, and f a : R n × R m → R the average constraint function. We define a feasible trajectory as any trajectory (x,ū) that satisfies (3d). The set S gives a convenient notation for demanding that the "textbook" problem dynamics and inequality constraints hold, while the set V α denotes an additional average constraint. Crucially, as none of the functions in (3) are explicitly dependent on k, its solution satisfies the principle of optimality (Bertsekas 2017, p. 20;Bellman 1954). Bertsekas (2017, p. 15) shows that this in turn implies that the optimal control trajectoryū * can equivalently be formulated as the control policy (i.e. state-feedback) where µ * k : R n → R m are functions that are independent of the initial condition x 0 . Note that whilex andū (with various sub-and super-scripts) are sequences of vectors of scalars,μ (with various sub-and super-scripts) are instead sequences of functions. We will refer toμ * as the optimal control policy. Definition 1. Define F ⊆ R n as the set of initial conditions with feasible solutions, i.e.
Assumption 1. For the remainder this paper we assume: A.1 f c , f d , g, and f a are continuous and bounded.

A.2
The optimal solution (x * ,ū * ) associated with x 0 is unique. A.3 The optimal control policy associated with (3) exists, and can be expressed as i.e. it is not only independent of the initial condition x 0 , but also independent of the sample index k. We will refer to this as a stationary control policy (Bertsekas & Shreve, 1979).
Optimal stationary control law µ * N A.4 F is nonempty, lim k→∞ (x * k ) exists and is independent of x 0 for all x 0 ∈ F, and x * k is asymptotically stable in the sense of Lyapunov for x 0 near lim k→∞ (x * k ). Note that A.1 implies that J (x,ū) is finite for any feasible trajectory, and by A.4 we can furthermore view J * as the average (mean) cost.
as the problem's equilibrium point.
Note that A.2, A.3, and A.4 may be difficult to determine a priori for a given problem. We will briefly discuss the possible effects of them not holding in Section IV.

B. Interpolating ADP
The method we introduce in this paper uses a conventional interpolating ADP scheme, and we will here use the standard method of gridding x and u into finite Cartesian sets. We define as the distance between neighbouring grid points for each dimension of the states and controls respectively. We also define as the discrete set of state and control grid points resolved by ADP respectively, separated by d x and d u respectively and bounded by the region(s) where g(x, u) ≤ 0. We then use conventional multilinear interpolation to approximate the cost J and optimal control policy µ for the real-valued states that do not lie in the discrete set X . For example, assuming x ∈ R 2 , u ∈ R 1 , and g(x, u) = |x| 1 ≤ 1 ∧ |u| ≤ 1, choosing the very coarse (but illustrative) d x =[2, 2] T and d u = 0.5 gives the sets III. Infinite-horizon, average-constrained problem properties In this section we introduce properties of the undiscounted, infinite-horizon, average-constrained problem that will later be utilised by the method we introduce in Section IV.
A. Solution convergence Definition 3. For x ∈ R n , u ∈ R m , using the same functions as in (3), define subject to as the optimal reachable equilibrium operating point x * eq , u * eq . (Note that we have identical states on both the left-and right-hand side of (11c), i.e. an equilibrium state.) We can view this as the unique stationary point of the system with lowest cost that we can reach for any initial condition in the feasible set F.

Theorem 4. Given A.1 and A.4,
i.e. the equilibrium we reach will be optimal in the sense of (11).
We can then formulate (3b) as As N → ∞, we are guaranteed that 1 N J 0→i−1 = 0 for any fixed i > 0 per our assumption that f c is bounded. This implies that J * is only dependent on J i→N −1 . By A.4, we can make (x * i , u * i ) arbitrarily close to (x eq , u eq ) for sufficiently large i.
For an alternate view of the same proof, see Bertsekas (2012, p. 298).
By Theorem 4 we can intuitively view the infinitehorizon problem's solution as ignoring any (finite) costs during the transient phase and driving the state to the reachable stationary point with lowest cost. This is a special case of the turnpike property (Trélat & Zuazua, 2015;Zaslavski, 2014), which states that the solution to problems with a sufficiently long (finite) horizon tends to display transient dynamic initial and terminal phases, with a middle stationary phase that is independent of the initial and terminal conditions. Of course, the infinite-horizon problem does not have a terminal phase, and we can thus view the solution to our problem (3) as consisting of an initial transient followed by stationary operation at the optimal reachable equilibrium point.
From a notation perspective, by Theorem 4 we do not need to make the distinction between x eq and x * eq . For consistency, we will use x * eq from here on out.

B. Average-constraint relaxation
Definition 5. For a fixed, bounded, scalar relaxation parameter λ ∈ R, define the relaxed cost as Now we can introduce the relaxed problem as where we view J R (x R ,ū R ) as the relaxed representation of J (x,ū), and J * R and (x * R ,ū * R ) as the optimal relaxed cost and optimal relaxed trajectories respectively. Note that (x R ,ū R ), and therefore also (x * R ,ū * R ), are not formally constrained to lie in V α .
For clarity, we will use the notationx R andū R when referring to trajectories associated with the relaxed problem. We will for ease of notation assume that (x * R ,ū * R ) is unique (much as (3c)), though we can in principle use DP (and in turn the method to be presented) to solve problems with non-unique solutions.
Lemma 6. For a given α, assume for some λ we have The weak duality theorem (Andréasson et al., 2016) In (20), by our assumption ζ * As ζ * R ∈ S ∩ V α , ζ * R also minimises (3), allowing us to replace the inequality in (21) with strict equality. By A.2 ζ * and ζ * R are unique, ensuring that that ζ * = ζ * R . Finally, as ζ * R is independent of constant terms we have that Theorem 7. Given (3c) and its relaxed counterpart (18c),
In essence, for a given λ R.2 ensures us that (x * R ,ū * R ) = (x * ,ū * ) for some value of α. We can intuitively view λ as a tuning parameter, where different values of λ are associated with different solutions, each of which (trivially) have an associated average that we can compute by means of (24).
Using the relaxed problem formulation allows us to avoid the explicit average constraint (3f), which is primarily of use in the sense that the problem becomes more numerically tractable. At its core, the method we will introduce in this paper approximates the solution to (3) by instead solving a finite-horizon problem of sufficient length. One naive method of satisfying the average constraint would then be to introduce an additional state variable that stores the accumulated average, i.e. z N = We could then add an equality constraint demanding z n /N = α. However, this is computationally demanding (as we need to introduce an additional state variable, which DP schemes scale poorly with) and introduces a bias in the achieved average (as the average z n /N is taken over both the initial transient and the stationary phase, we therefore only achieve the desired average as N → ∞). Using the relaxed formulation thus avoids these issues entirely.

C. Convergence of finite-horizon problem
We will in this section introduce notation for the finitehorizon problem, which will then be used for constructing the method presented in this paper.
Definition 8. For a given finite horizon N , bounded λ, and initial condition x 0 , define subject to as the N-horizon relaxed problem with average cost J * N R and associated (finite-length) state and control trajectories where µ * N R,k : R n → R m is the k'th state-feedback control policy, as the N -horizon sequence of control policies associated with (25).

Definition 9. Define
as the (not necessarily optimal) k'th closed-loop state given by repeatedly applying a (sample-independent) control policy µ k times from an initial state x 0 , e.g.
Note that the method of generating x k,CL is very similar to the forward-calculation stage of ADP, and differs only in that the control policy is kept constant.
Definition 10. For a given control policy µ, define We can thus view F k (µ) as the set of initial conditions in X that satisfies the problem constraints and dynamics (the latter trivially, as we use µ to apply a control and give the next state) after applying the control policy µ k times.
Definition 11. For k > 0, introduce the maximum control where the notation [a] i refers to the i'th element of a vector a and . . . refers to the ceiling function. We can view ∆ k µ as indicating the convergence of µ * k R,0 to µ * , evaluated at the gridded state points X whose associated state evolution remains feasible after k/2 iterations.

Definition 12. Introduce the maximum state deviation
Note that the notationally heavy second line of (31) is equivalent to the mean feasible state after k/2 iterations. Similarly to Definition 11, we can thus view ∆ k x as indicating the convergence of x 0,CL , x 1,CL , . . . , x k/2 ,CL tox * , evaluated at the points where x k/2 ,CL remains feasible.
Trivially, using Definition 11 and Definition 12 gives: Definition 14. Given a control policy tolerance ε µ ∈ R m and state convergence tolerance ε x ∈ R n , define as the minimum horizon. Proposition 13 ensures us that that for any ε µ and ε x there exists an associated finite horizon N M , which we view as the shortest finite-horizon approximation of the infinite-horizon problem.
IV. The UCPADP method In this section we introduce the primary contribution of this paper: Undiscounted Control Policy generation by Approximate Dynamic Programming (UCPADP), a method that generates an approximation of µ * . At its core, in UCPADP we generate an approximation of the optimal control policy by iteratively testing successively larger horizons until the termination criteria (32) are satisfied. For computational efficiency reasons we will return to, UCPADP will approximate the control policy as i.e. the generated horizon will lie in a range between N M and 2N M . We can at this stage highlight one of the more significant differences between UCPADP and conventional API: the choice of termination conditions. Conventional API generates improved control policies analogous to µ * 1 R,1 , µ * 2 R,2 , µ * 3 R,3 , . . . with an associated cost J * 1 R , J * 2 R , J * 3 R , . . . , and eventually terminates when the difference between either successive policies or cost is below a given threshold, for instance as in Santos and Rust (2004); Stachurski (2008, CPI, PSDP). This is similar to the test performed in (32b), which requires the control policy to be near-stationary. However, in conventional API the termination tolerance (analogous to ε µ ) is sized based on the discount factor, and depending on the specific method chosen the tolerance is either undefined or tends towards zero when the discount factor tends towards one (i.e. becomes the undiscounted case we study here). Bertsekas (2011); Scherrer (2014) review other methods that do not terminate based on the change in the control policy, but instead use some other termination criterion. However, these methods also assume a discounted problem formulation. Guo et al. (2017) is one example of a method that considers the undiscounted case, however their method imposes fairly significant limits on the class of cost and constraint functions (as discussed previously).
The state convergence condition (32c) is to the best of our knowledge novel, and serves a crucial purpose in that it demands the horizon be long enough for all gridded feasible initial conditions to converge to a region near the equilibrium. Recall that by A.4 x * k (the true optimal state trajectory) is stable in the sense of Lyapunov for initial conditions near the equilibrium, and in concert with Theorem 4 we are thus ensured that an initial condition near the equilibrium will also remain in its vicinity. As we apply test (32c) to all feasible elements in X , at least one initial condition x 0 ∈ X will therefore start and then remain in the nearby vicinity of the equilibrium. Ultimately, by combining (32b) and (32c) we are ensured that µ * N M R,0 is nearly constant during the interval needed for all feasible gridded points in X to reach the vicinity of the equilibrium.
In UCPADP, we determine µ * N M R,0 numerically efficiently in a manner similar to API implemented with ADP. We do this using a nested scheme that repeatedly switches between backward-calculation phases (successively generating control policies with longer associated horizons) and forward-calculation phases (applying tests (32b) and (32c), and eventually terminating when both tests pass). A description of the phases in UCPADP follows, see Fig. 1 for an illustration. For now, assume ε µ and ε x are given (fixed) vectors.
First, we arbitrarily choose a small initial horizon N and perform N backward-calculation iterations, giving us (among other data) µ * N R . We can then perform test (32b) and, by performing N/2 forward-calculation steps, test (32c). If both tests pass we terminate and return µ * N R,0 as our approximation of µ * . Conversely, if either of these tests fail by A.4 we are ensured that increasing the horizon sufficiently will give a control policy that satisfies the tests. In UCPADP we chose to proceed by increasing the horizon to 3N . Fortunately, in our DP scheme we can compute µ * 3N R using only 2N additional backward-calculation iterations by resuming the backward-calculation from µ * N R . This is possible as each successive backward-calculation step is independent of the total horizon. After generating µ * 3N R we can now again test (32b) and (32c). Should both tests pass we can return µ * 3N R,0 as our approximation of µ * , and otherwise recursively repeat this procedure of doubling the number of back-calculation steps until the tests pass (i.e. generating and testing horizons N, 3N, 9N, 27N, . . . ). A pseudocode implementation of the UCPADP method is listed in Algorithm 1.
Up to this point we have assumed that the problem solution is unique (A.2), converges to a stationary control policy (A.3), and all states converge to a unique equilibrium (A.4). Let us now briefly consider the case where we do not know if these assumptions hold beforehand. Beginning with A.2, recall that we can determine whether or not this assumption holds during the backward-calculation phase by checking if the minimum cost is unique, and in the case of a non-unique cost we can resolve this by simply returning one arbitrarily selected optimal solution. Let us now focus on the case where A.3 and A.4 are unverified. Applying the UCPADP method gives one of two possible outcomes: UCPADP either never terminates (i.e. (32b) and (32c) never pass), or it terminates after a finite number of back-calculation iterations. If UCPADP never terminates, then one possible cause is that A.3 and/or A.4 do not hold (i.e. the termination criteria (32b) and (32c) correctly detected a non-stationary control policy and/or detected that the system states do not converge to a single equilibrium). Alternatively, it is possible that the problem's discretisation and/or tolerances were poorly chosen. Regardless, should UCPADP never terminate it is clear that no valid solution could be generated. If UCPADP does terminate, we are assured that either: (i) A.3 and A.4 do hold and a near-optimal control policy is generated, or (ii) the problem is maliciously nonlinear and A.3 and/or A.4 do not hold (which went undetected by (32b) and (32c)), ultimately giving a control policy without any clear optimality guarantees. As the class of problems we can attempt to solve with UCPADP covers general non-linear systems it is not surprising that there exist pathological problems that lead UCPADP (and ADP in general) to generate erroneous solutions. Ultimately it is up to the user of UCPADP to determine whether or not the studied problem is of a class that satisfies the (arguably mild) assumptions A.3 and A.4.
In Algorithm 1, we extend the notion of termination used thus far by adding a parameter N max that allows for configuring a maximum horizon that terminates UCPADP if N > N max . This acts as a safety and guarantees that UCPADP terminates after a finite number of iterations. In the event that this limit triggers UCPADP to terminate we can conclude that either the minimum horizon is larger than N max , that A.3 and/or A.4 do not hold, or the discretisation and/or tolerances were poorly chosen. Of course, should this happen then we can not say anything about the stability (let alone the optimality) of the returned control policy.
Tests (32b) and (32c) are straightforward to compute exhaustively, as the initial conditions x 0 come from the discrete set X . Furthermore, in UCPADP we have chosen to double the number of additional back-calculation steps to perform between each test evaluation. This attempts to balance the time spent on backward-calculation iterations and the horizon length sufficiency tests, though we may ultimately solve for problem horizons up to 2N M , as indicated by (33b). Ultimately this choice is arbitrary, and it is possible for some problems to use another scheme for selecting a new length.
From a practical perspective, we have found that setting ε µ ≈ 2d u and ε x ≈ 2d x (the distance between points in U and X respectively) is a good design choice for well-Algorithm 1. Pseudocode UCPADP algorithm. Here, DP 1-back and DP 1-fw are the one-step backward and forward ADP operations. X is the set of initial conditions tested in the ADP method. N init is the initial problem horizon. C N is the cost-to-go after N iterations. Note here that a reverse notation is used for the calculated control policy; µ 1 corresponds to the state-feedback control policy from the first back-calculation step (i.e. µ * N N −1 ) while µ N corresponds to the last (i.e. µ * N 0 ). We can view the index k as counting the number of back-calculation steps performed. Note the abuse of notation on line 14 that indicates the ∆ N x and ∆ N µ tests respectively.
return µN , XCL, N 16: end function behaved problems. Smaller values raise the risk of never terminating, e.g. due to residual state trajectory jitter caused by approximation inherent to interpolation, while larger values give an unnecessarily large approximation of the true control policy µ * R,0 . Ultimately, this implies that UCPADP has to some degree only one tuning parameter: the ADP discretisation, which trades off accuracy with computational time and memory demands.
As UCPADP is based on interpolating ADP (and in turn DP) it is subject to the inherent limitations of DP methods, in particular its poor scaling with problem dimensionality (colloquially referred to as the "curse of dimensionality") (Bellman, 1954;Bertsekas, 2017). This limits UCPADP to low-to moderate-dimensional problems. The examples shown in the following section (with two state variables and one control variable, giving a total of three independent variables) are easily solved using an ordinary desktop computer on the order of one minute to one hour (depending on the demanded solution accuracy). In practice, we expect UCPADP to be viable for up to 4-6 continuousvariable problems, depending on the discretisation of the state and control variables, the nature of the problem, and the available computational power.
A general implementation of the UCPADP method in the MATLAB language, including the numerical examples in the following section, is available at https://gitlab.com/ lerneaen_hydra/ucpadp.

V. Representative examples
We illustrate the UCPADP method, introduced in Section IV, by solving two simple problems. Though "toy" problems in some sense, recall that Assumption 1 allows for significantly more difficult (and practically relevant) problems. First we consider the classical minimum-time inverted pendulum problem, where we highlight the stopping criterion of UCPADP. Afterwards, we consider the problem of maintaining an average pendulum angle with minimum control power, illustrating the average-constraint properties shown in Subsection III-B.
We will consider the dynamical system given by a simple pendulum (Fig. 2) for both problems. For a pendulum with length l, point mass m, gravitational force g, damping coefficient d, angle θ, and applied torque u, the dynamic equation for the system can be derived as In both the following examples we will assume a discretetime control system with sample rate t s , i.e. the control input u is piecewise constant over intervals of uniform time t s . If the problem is reformulated as a set of coupled firstorder ordinary differential equations with a state variable vector then we can express the state at the next sample as where f p is given by solving (34) over a time t s with initial condition x k and constant control input u k .

A. The inverted pendulum
To illustrate the mechanics of UCPADP's termination criterion, consider the traditional minimum-time inverted pendulum problem (formulated here as an infinite-horizon problem) subject to All the following results are shown for a sample time of t s = 0.2, pendulum parameters set to give the system dynamics equationθ + sin (θ) = u, state variables discretised by a Cartesian grid with separation d x = [0.05, 0.05] T in the range allowed by (37e) and (37f), and the control variable discretised with even spacing d u = 0.01 in the range allowed by (37d). Setting ε x and ε µ to the suggested value of twice the discretisation gives ε x = [0.1, 0.1] T and ε µ = 0.02.
Note that the cost function (37b) equally penalises all pendulum configurations other than the single vertical zerovelocity state combination, and with an infinite horizon (and small enough d x ) gives a solution arbitrarily close to the traditional minimum-time formulation. The state bounds (37e) and (37f) have been chosen to give a reasonable range for the specific initial value we will study shortly.
For the above problem, UCPADP terminates after testing a horizon of N M = 135, indicating that 45 < N M ≤ 135. An illustration of termination criterion (32b) is shown in Fig. 3, where we can verify the condition is satisfied as all values are above N/2 = 68. Furthermore, ∆ k µ will by construction take values from U = {0, ±0.01, ±0.02, . . . , ±1}. For ε µ = 0.02 (32b) will thus only be satisfied for values −0.01, 0, 0.01. We can see this in Fig. 3, where ∆ 135 µ = 0. Similarly, criterion (32c) is illustrated in Fig. 4, where we can verify that representative trajectories all converge to a region bounded by ε x (shown by the yellow box). An illustration of the control policy ultimately generated by UCPADP is shown in Fig. 5. Solving this specific problem took approximately 10 minutes using a standard desktop PC. Fig. 6 shows a comparison of the solution generated by UCPADP and a reference solution, generated by formulating a problem with an explicit horizon of N = 10N M = 1350 (i.e. one order of magnitude longer the UCPADP horizon), for x 0 = [0, 0] T . Here the reference solution is generated using a traditional ADP scheme, configured with the same sample time and state/control grid discretisation. Note that we intentionally compare the UCPADP solution to a traditional ADP solution (in contrast to, for instance, an analytical solution) as we wish to highlight the accuracy of the automatically sized horizon, rather than the accuracy of an interpolating ADP scheme.
The average cost over the time interval shown in Fig. 6 is 0.09664 for the UCPADP solution, while the cost associated with the reference solution is 0.09689, i.e. a deviation 1 of 0.25%. We can conclude (for this specific problem) that the cost associated with the UCPADP solution is virtually 1 The fact that the UCPADP solution has a lower associated cost is likely due to the inherent approximation of interpolating ADP.    identical to a conventional ADP solution, indicating that the identified horizon N M = 135 was sufficient.

B. The constant-angle pendulum
Let us now consider a problem that illustrates the properties of the average constraint introduced in Subsection III-B. Assume we wish to solve i.e. the problem of keeping the average pendulum angle at a setpoint θ ref while minimising the quadratic control input u 2 k . By Theorem 7 we can avoid including the average constraint (38c) by augmenting the cost functional (38a) as for a constant scalar λ. Assuming the problem reaches an equilibrium with control u eq and states θ = θ ref ,θ = 0, by (34) we have u eq = mgl sin(θ ref ). We can thus express the equilibrium cost as which is a function of one variable. Equation (40) has one unique stationary point (a minimum) in the permissible range |θ| < 1, and we can thus find the specific value λ that gives the lowest equilibrium cost at the desired setpoint by setting dceq dθ ref = 0 and solving for λ, giving We can now reformulate (38) as the equivalent problem subject to As in the previous example, we discretise the state and control variables evenly in the permissible space, here with separation d x = [0.02, 0.02] T and d u = 0.01 respectively. Solving (42) for pendulum parameters resulting in a system dynamics equationθ +θ + sin (θ) = u and θ ref = 0.5 gives the results shown in Fig. 7 (again compared with a reference solution given by explicitly choosing a large horizon, one order of magnitude larger than the horizon given by UCPADP). For this problem, we find that the UCPADP solution generates a solution with control cost (i.e. u 2 k ) of 0.2321 over the horizon shown in Fig. 7, while the control cost associated with the reference solution is 0.2323 (i.e. a deviation of 0.09%), again showing that the accuracy of the UCPADP solution is virtually identical to that of a reference ADP solution.
For comparison, in Fig. 8 we also show the solution quality parameterised by different finite horizons. More specifically, we solve the finite-horizon counterpart of (42), i.e. using the notation introduced in (25), for varying finite horizons N (denoted the problem horizon), resulting in the associated control policies µ * N R,0 . We then apply the control policy to the set of initial conditions feasible with a long horizon N = 1350 (denoted the trajectory horizon). The plot shows the augmented cost of the trajectory horizons, i.e. J * R , parameterised by different problem horizons. We can identify that the average cost is higher for short problem horizons than for long problem horizons, and that the cost associated with problem horizons 40 is constant, indicating that for this problem a problem horizon 40 is sufficient. It may therefore seem like UCPADP is inefficient in its choice of problem horizon (135 samples). However, computing the average cost of any given problem horizon shown in Fig. 8 is time consuming, with each individual problem horizon taking approximately the same time to compute as the entire UCPADP solution, as well as requiring problem-specific knowledge of the initial conditions and trajectory horizon to average over. The trade-off between spending time computing additional backcalculation steps and checking whether a given horizon is sufficiently large thus motivates a scheme like our proposed horizon-doubling method.

VI. Conclusions
In this paper we have introduced UCPADP, a numerical method inspired by API. UCPADP can be used to generate a near-optimal control policy for general undiscounted continuous-valued infinite-horizon nonlinear optimal control problems. The problem can also optionally be constrained to converge to a given equilibrium. The primary contribution of UCPADP is the introduction of a termination criterion that is amenable to the undiscounted case, while still allowing for general costs and constraints. We have evaluated the method by solving two simple, but representative, problems. For both examples we showed that the generated control policy was on par with the accuracy of a reference ADP solution (whose accuracy is determined by the chosen discretisation of the problem).
UCPADP has several properties that render it useful as as one part of the process of constructing an on-line controller. Firstly, it shares a property with other API methods in that it does not require any a-priori information about a suitable horizon, instead performing an indefinite number of iterations and terminating when a suitable problem horizon is found. Secondly, the tuning parameters are simple to grasp, as they trade off solution accuracy with computational time and memory demands. Finally, the output from UCPADP, as with other API methods, is a control policy (i.e. a state feedback table). After this control policy is computed in an off-line phase it can in turn be used to construct a subsequent on-line controller with very low computational demand, only requiring a simple interpolation operation to determine the control signal.
Full source code of the implementation as well as the specific problems studied is available at https://gitlab .com/lerneaen_hydra/ucpadp.

Funding
This work was performed within the Combustion Engine Research Center at Chalmers (CERC) with financial support from the Swedish Energy Agency.