A Markov chain Monte Carlo algorithm for Bayesian policy search

ABSTRACT Policy search algorithms have facilitated application of Reinforcement Learning (RL) to dynamic systems, such as control of robots. Many policy search algorithms are based on the policy gradient, and thus may suffer from slow convergence or local optima complications. In this paper, we take a Bayesian approach to policy search under RL paradigm, for the problem of controlling a discrete time Markov decision process with continuous state and action spaces and with a multiplicative reward structure. For this purpose, we assume a prior over policy parameters and aim for the ‘posterior’ distribution where the ‘likelihood’ is the expected reward. We propound a Markov chain Monte Carlo algorithm as a method of generating samples for policy parameters from this posterior. The proposed algorithm is compared with certain well-known policy gradient-based RL methods and exhibits more appropriate performance in terms of time response and convergence rate, when applied to a nonlinear model of a Cart-Pole benchmark.


Introduction
Reinforcement learning (RL) can be considered as a topic of interest among the researchers in the field of machine learning, control and robotics; see Sutton and Barto (1998) for an introduction. In RL, an agent interacts sequentially with an environment to collect some data samples and utilizes the generated data to find a policy that optimizes a long-term performance criterion. Among the existing RL methods, gradient-based algorithms have proved to be effective in dealing with highdimensional state spaces. A drawback of these algorithms is that a local search of the policy is performed. Consequently, they may suffer from poor convergence rate or being trapped in local optima. In order to cope with this problem and facilitating convergence a Bayesian inference approach would be advocated.
This work concentrates on the control problem in Markov decision processes (MDP) with continuous state and action spaces in finite time horizon. We propose a novel RL policy search method using particle Markov chain Monte Carlo (P-MCMC), a recent and efficient family of MCMC methods for complex distributions, developed in Andrieu, Doucet, and Holenstein (2010). Our method is best applicable for risk-sensitive scenarios, where a multiplicative expected total reward is employed to measure the performance, rather than the more common additive one; since with a multiplicative reward one can CONTACT Vahid Tavakol Aghaei tavakolaghaei@sabanciuniv.edu fully utilize sequential Monte Carlo (SMC), aka the particle filter (Doucet, De Freitas, & Gordon, 2001;Gordon, Salmond, & Smith, 1993), within the iterations of the P-MCMC. The simulation results of learning to model and control a physical system reveal that significant improvements in time response of the plant as well as reduction in learning time can be achieved, which is important in real applications. The proposed method does not require any gradient calculations and thus it does not generate a point estimate for the policy, instead, it works with estimates of the reward function directly and provides samples from the policy which can then be used to explore the surface of the policy performance to identify the favourable regions for the policy space. Therefore, it does not bear the risk of getting stuck in local optima or diverge due to poor choices of the learning rate for the gradient ascent update rules. In that sense, our method may be advantageous, at least in terms of breadth of applicability, over methods that do require gradient calculation. The claims on robustness and convergence time are supported with provided numerical experiments in Section 3.2 where the proposed method has been compared with two state-of-art gradient-based methods on the benchmark example of an inverted-pendulum problem. The comparisons are performed for two different reward structures, interval-based reward and quadratic reward. The simulation results regarding the convergence, which reveal the competitiveness of the suggested method with that of gradient-based ones are depicted in Figures 1 and 2.
The paper is organized as follows: In the rest of Section 1, we commence with a general description of MDPs, provide a short review of the gradient-based policy optimization strategies and proceed with Bayesian framework and risk-sensitivity ideas. In Section 2, the proposed method is presented. In order to demonstrate the efficiency of the proposed algorithm, numerical simulations are depicted in Section 3.2. Finally, concluding remarks and future work are expressed in Section 4.

Markov decision processes
RL problems can be specified with the notion of MDP. A Markov decision process is defined by the tuple (S, A, g, η, r).
Here, S ⊆ R d s , d s > 0, represents a set of d s -dimensional continuous state space, and A ⊆ R d a , d a > 0, stands for a continuous action space. We treat the state and action variables at time t as random variables S t ∈ S and A t ∈ A whose realizations will be denoted as s t and a t . At time t > 0, a transition from current state s t to next state s t+1 as a result of taking an action a t admits a transition law described by the transition density function g(s t+1 |s t , a t ). We denote the probability density for the initial state by η(s 1 ). The reward function r : A × S × S → R assigns an immediate real-valued scalar reward also known as reinforce signal for the state transition from s t to s t+1 with action a t , represented as r(a t , s t , s t+1 ).
Let the control policy be a stochastic parametrized policy with parameter θ ∈ ⊆ R d θ for some d θ > 0 and policy space . The policy parameter θ corresponds to a probability density function h θ (a t |s t ) for the randomized action A t at time t given state S t = s t .
where x t = (s t , a t ). Therefore, the joint probability density of a trajectory (also called path, or rollout) x 1:n until time  n−1 is where f (x 1 ) = η(s 1 )h θ (a 1 |s 1 ), is the initial distribution for X 1 .

Reward-based policy search
The objective of policy search in RL is to seek optimal or plausible policy parameters θ with respect to some expected performance of the trajectory X 1:n . Define R n : X n → R to be the discounted sum of the immediate rewards up to time n where γ ∈ (0, 1] is a discount factor, and let U : R → R be a monotonically increasing and continuous utility function with inverse U −1 . In a finite horizon RL setting, the performance of a certain policy, J n (θ ) based on U and R n is given by where p θ (x 1:n ) is the trajectory distribution. Various works formulate reinforcement learning as an inference problem for the policy parameter θ that is based on either maximizing (some function of) J n (θ ) with optimization techniques (Dayan & Hinton, 1997;Gullapalli, Franklin, & Benbrahim, 1994;Kappen, Gomez, & Opper, 2012;Kimura & Kobayashi, 1988;Maddison et al., 2017;Mitsunaga, Smith, Kanda, Ishiguro, & Hagita, 2005;Peters, 2005;Toussaint & Storkey, 2006), or exploring admissible regions of J n (θ ) via some Bayesian approach (Hoffman, Doucet, Freitas, & Jasra, 2008;Wingate, Goodman, Roy, Kaelbling, & Tenenbaum, 2011).

Gradient-based algorithms for policy optimization
There are different choices for the function U depending on the nature of the reinforcement learning problem in terms of dealing with risk. A common choice is U(x) = x, which corresponds to the risk-neutral case where the performance measure of a policy reduces to its expected total reward. This case has been most extensively studied in the literature. Specially, among the available RL methods, the policy gradient algorithms have drawn the most attention. The policy gradient algorithms can be implemented in high-dimensional state-action spaces. This makes them well-suited in the robotics domain, where indeed problems usually involve coping with aforementioned spaces. Usage of policy gradient algorithms has been pioneered in the works by Gullapalli et al. (1994). These methods have been employed to deal with complex control and robotics problems, such as those dealt in Kimura and Kobayashi (1988), Peters (2005), and Mitsunaga et al. (2005).
Specifically, the goal of policy optimization in RL is to quest optimal policy parameters θ that maximizes the expected value of some function of R n .
Although it is hardly ever possible to evaluateθ directly with this choice of R n , maximization of J n (θ ) can be accomplished by policy gradient (PG) methods that utilize the steepest ascent rule to update their parameters at iteration i as where β is a learning rate for the gradient ascent and the vector is the gradient of the expected total reward with respect to θ = (θ 1 , . . . , θ d θ ). However, unless the state and the action spaces are finite or the Markov chain {X t } t≥1 admit linear and Gaussian transitional laws, it is impossible or prohibitively difficult to evaluate the gradient ∇J n (θ ) in (6). In the sequel, we will review three main policy gradient methods that are proposed to efficiently approximate ∇J n (θ ). Those methods are the state-of-art methods and their performance will be compared with that of our algorithm in Section 3.

The REINFORCE algorithm
One of the very first methods in estimating ∇J n (θ ) in (6) is the REINFORCE algorithm introduced by Williams (1992) which exploits the idea of likelihood ratio methods.
(8) Due to the lack of exact information about the trajectory distribution p θ (x 1:n ) or nonlinearity in p θ (x 1:n ), the integration over this probability distribution may not be possible. The REINFORCE algorithm approximates ∇J n (θ ) by producing N ≥ 1 generated trajectories of length n from p θ (x 1:n ), and then performing the Monte Carlo estimate In Williams (1992), it is shown that this estimator suffers from high variance. In order to reduce this variance, Williams (1992) proposes to make use of a baseline b ∈ R d θ to modify (9) as is adaptively calculated during iterations from sample trajectories x (i) 1:n in a heuristic manner so that the variance of the approximation is minimized and it is adapted during the iterations.

The G (PO) MDP algorithm
It can be observed from (9) that overall reward of a complete trajectory, R n , is used in order to calculate ∇ θ J(θ ). Due to the large variance in R n (X 1:n ), the gradient may not be estimated with high accuracy. In order to surmount this problem a modified version of the REINFORCE algorithm, called the G(PO)MDP algorithm, has been proposed by Baxter and Bartlett (2001), where the immediate reward of each time step is utilized as follows: is also adaptively determined like the REINFORCE algorithm, but separately for each t; see Baxter and Bartlett (2001) for the details.

The eNAC algorithm
Natural gradient methods introduced by Amari (1998) and Kakade (2002) have evolved into various policy gradient algorithms for RL as shown, such as Natural Actor Critic (NAC) and episodic NAC (eNAC) (Peters, 2005). The idea behind these algorithms is to explore J n (θ ) more efficiently by employing the Fisher information matrix for the trajectory distribution and performing the gradient search with the natural gradient F(θ ) −1 ∇J n (θ ) instead of J n (θ ). The eNAC algorithm estimates the natural gradient based on a regression that helps avoiding a direct approximation to F(θ ). More details about the eNAC algorithm can be found in Peters (2005).

A Bayesian framework for RL
In this paper, we will deal with the particular choice that corresponds to risk-sensitive RL, for some risk factor κ > 0 (Howard & Matheson, 1972). For the particular choice of the utility function in (13), the performance measure J n (θ ) in (4) can be written explicitly as The classical policy optimization problem in finite horizon risk sensitive setting can then be defined as finding the policy θ ∈ that maximizes U −1 (J n (θ )), which is equivalent to maximizing J n (θ ), as noted in Marcus, Fernández-Gaucherand, Hernández-Hernández, Coraluppi, and Fard (1997). Several risk-sensitive RL algorithms can be found in the literature. Risk sensitivity was originally discussed in early works in economics such as von Neumann and Morgenstern (1944). For RL, variants of the Q-learning approach (Watkins, 1989) or policy iteration (Sutton & Barto, 1998) can be found. Other examples for policy optimization for risk sensitive RL are Koenig and Reid Simmons (1994); Mihatsch and Neuneier (2002); Neuneier and Mihatsch (1998) ;Shen, Tobia, Sommer, and Obermayer (2014). However, these algorithms are for finite action spaces and therefore do not apply in general to our setting.
Lately, Maddison et al. (2017) consider the direct policy gradient approach for the choice in (13) that can be applied to general state and action spaces.
The main contribution in this paper is to propose Bayesian approach for RL that is implemented via MCMC. Specifically, our methodology is a particle MCMC algorithm involving an SMC (particle filter) in its iterations. The novelty of the presented approach is due to a formulation of the policy search problem in a Bayesian inference where the expected multiplicative reward, is treated as a pseudo-likelihood function. The reason for taking J n (θ ) as an expectation of a multiplicative reward function is the ability to employ unbiased lower variance estimators of J n (θ ), despite the methods that utilize a cumulative rewards formulation which lead in estimates with high variance. We work with the choice in (13) for U and hence Equation (14) is assumed for the performance measure of the policy parameter θ. Instead of trying to come up with a single optimal policy, we cast the problem into a Bayesian framework where we treat J n (θ ) as if it is the likelihood function of θ. Combined with an uninformative prior μ(θ), this leads to a pseudo-posterior distribution for the policy parameter. We then aim to target a quasi-posterior distribution and draw samples from via applying MCMC which is constructed as This approach may be desirable particularly because we can explore the high expected reward regions of without having to approximate any gradient information. Once we have samples from π n (θ ), we can deduct point estimates for θ if desired, such as the mean of the samples, or the maximum a posteriori point, etc.
Note that there are also Bayesian methods for policy learning tailored for the specific choice of U(x) = x. Similar to our setting, these methods assign a prior distribution with probability density μ(θ) for the policy parameter θ and use some suitable function of J n (θ ) as the 'likelihood' function, therefore targeting the resulting posterior distribution for θ, denoted as π n (θ ). For example, Hoffman et al. (2008) ensure a non-negative J n (θ ) and take that as the likelihood function, and then utilize a trans-dimensional EM-based MCMC algorithm to sample from the posterior distribution that is proportional to The sampling procedure in their work is carried out by utilizing a reversible jump MCMC approach proposed in Green (1995), Green (2003) and Richardson and Green (1997). Drawing on the work of Hoffman et al. (2008), a hybrid MCMC algorithm was designed for (15) in Wilson, Fern, and Tadepalli (2010). Unlike Hoffman et al. (2008) and our method, the MCMC moves proposed in Wilson et al. (2010) targets an approximation of the posterior distribution by making use of Importance Sampling approaches (IS). Alternatively, Wingate et al. (2011) allows for negative reward but treats exp(J n (θ )) as a 'likelihood' and then apply an approximate MCMC algorithm to sample from the posterior distribution that is proportional to μ(θ) exp(J n (θ )).
There are also hybrid approaches such as the expectation-maximization (EM) algorithm in Toussaint and Storkey (2006), where they try to maximize the value function V(θ ) with respect to policy parameters θ by applying forward-backward algorithms to apply E-step, where a fixed policy is assumed and posteriors are calculated over the state-action pairs. Afterwards the policy update procedure is done by M-step.
In the Bayesian methods for U(x) = x that are mentioned above, since J n (θ ) is the expectation of an additive function of instantaneous rewards, estimating J n (θ ) and incorporating the estimate in the methods seem to be challenging due to high variance, difficulty in ensuring non-negative J n (θ ), or bias as shown by Maddison et al. (2017). In the next section, we will show that in the risk-sensitive scenario, where J n (θ ) is the expectation of a multiplicative quantity (see (14)), we can utilize efficient, lower variance, unbiased estimators of J n (θ ) in our proposed method.

MCMC for policy search
MCMC methods are a vastly popular family of methods used to obtain samples from complex distributions. Here, the distribution of our interest is π n (θ ), which is indeed hard to calculate expectations with respect to or generate exact samples from.
Then, we either accept the proposal and set θ (k) = θ with probability α(θ, θ ) = min{1, ρ(θ, θ )} or we reject the proposal and set θ (k) = θ , where the acceptance ratio is The MH algorithm requires the ability of calculating the acceptance ratio ρ(θ, θ ) for any givenθ , θ ∈ . For our case, this is not possible since calculation of ρ(θ, θ ) involves calculation of J n (θ ) and J n (θ ) which is not generally possible due to the complex structure of J n (θ ) in (14). Fortunately, we can still target π n (θ ) using an MCMC algorithm if unbiased and non-negative estimates of J n (θ ) can be obtained, i.e. we can obtain random vari-ablesĴ n (θ ) ≥ 0 such that The MH algorithm that uses suchĴ n (θ ) instead of J n (θ ) is called the pseudo-marginal MH algorithm (Andrieu & Roberts, 2009); see also Andrieu et al. (2010) for a broader family of such algorithms for hidden Markov models, namely particle MCMC algorithms. The implementation of the pseudo-marginal MH algorithm for our RL problem is given in Algorithm 1.
Algorithm 1: Pseudo-marginal MH for RL Input: Number of time steps n, initial value and estimate of expected performance (θ (0) ,Ĵ Obtain an unbiased estimateĴ n of J n (θ ) by using Algorithm 2 with N particles. Accept the proposal and set θ (k) = θ and J otherwise reject the proposal and set θ (k) = θ andĴ (k) n =Ĵ n . end

SMC for estimating J n (θ)
Algorithm 1 displays we can work with non-negative unbiased estimators of J n (θ ) instead of J n (θ ) itself, and still target π n (θ ) exactly. The question is, then, how to obtain unbiased estimators of J n (θ ) for any given θ . This is possible using an SMC algorithm thanks to the fact that the reward function is multiplicative. For sake of simplicity (and without loss of generality regarding the validity of our methodology), we will assume from now on that κ = 1 and the instantaneous reward function only depends on x t = (a t , s t ) and not s t+1 . 1 With this assumption, we can rewrite exp γ t−1 r(a t , s t ) dx 1:n , which suggests that we may cast the problem of estimating the expected overall reward in our RL setting into the framework of Feynman-Kac formulae (DelMoral, 2004), in which {X t } t≥1 is the Markov chain and exp{γ t−1 r(a t , s t )} is the potential function at time t. In other words, we can think of {X t } t≥1 as the latent Markov process of a hidden Markov model, and γ t−1 r(a t , s t ) as the conditional observation density at time t for some fixed observation. It was shown in DelMoral (2004) that a non-negative and unbiased estimator of J n (θ ) can be obtained by running an SMC algorithm, aka a particle filter, as in Algorithm 2.
Algorithm 2: SMC algorithm for an unbiased estimate of J n (θ ) Input: Policy θ, number of time steps n, discount factor γ Output: Unbiased estimate ofĴ n Start withĴ 0 = 1.
To put our methodology in context, it is worth acknowledging the works that relate reinforcement learning and hidden Markov models (or more gen-erallyFeynman-Kac formulae) and thereby adopting existing inference tools developed for hidden Markov models to reinforcement learning already: Toussaint and Storkey (2006) developed an EM algorithm for finite and Gaussian reinforcement learning scenarios. Rawlik, Toussaint, and Vijayakumar (2010) proposed an approximate inference approach, again based on the EM algorithm from more challenging scenarios. Finally, a recent independent work by Maddison et al. (2017) points to this relation and proposes Algorithm 2 for an unbiased estimator of J n (θ ), as we do. However, while Maddison et al. (2017) ultimately uses this estimator to utilize a policy gradient algorithm for risk-sensitive policy learning, we employ it in an MCMC algorithm by exploring the posterior distribution for Bayesian learning. To the best of our knowledge, no previous work has targeted π(θ) by executing an SMC algorithm in a Bayesian framework for a multiplicative reward structure, without the need to calculate gradient information.
Algorithm 2 requires propagation, weighting, and resampling of N particles at a time step. When only one particle is used i.e. N = 1, the algorithm reduces to sampling a single path X 1:n ∼ p θ (x 1:n ) and calculating exp{R(X 1:n )} = n t=1 e γ t−1 r(X t ) as an estimate of the expected reward. Despite its expected inferior statistical properties compared to N > 1, the N = 1 version of Algorithm 2 is worth noting, because it allows for a straightforward real time implementation of the algorithm, say, by a single robot.

Simulations
This section discusses the implementation of the proposed method for a nonlinear model of an inverted pendulum and demonstrates its usefulness in comparison with policy gradient algorithms for reinforcement learning.

The inverted pendulum model
The state space S of the inverted pendulum is constituted of four continuous components that are angle, angular velocity, position and velocity of the cart-pole system, i.e. s t = (φ t ,φ t , p t ,ṗ t ). The action space A is onedimensional and continuous which plays the role of the force applied to the cart. It is assumed that the previous state and the current action determine the current state, where G : S × A → S is a deterministic function, so that g(s t |s t−1 , a t ) = δ G(s t−1 ,a t ) (s t ) in terms of densities. The function G is implicitly dictated by the dynamics of the inverted pendulum, given by Dann, Neumann, and Peters (2014). When action a is applied to the pendulum at state s = (φ,φ, p,ṗ), the acceleration in φ and p are determined as where g = 9.8 m/s 2 is the gravitational constant, b = 0.1 N/ms is the friction coefficients, l = 0.6 m is the length of the pole, M = 0.5 kg is the mass of the cart, and m = 0.5 kg is the mass of the pole. The control objective is to stabilize the inverted pendulum in the upright position as well as to bring the cart to the origin from a random initial state.
Given the state value S t = s t at time t, the action A t is stochastic and drawn from the Gaussian distribution N (θ T s t , σ 2 ) with mean θ T s t and variance σ 2 so that the conditional density of the action is, The objective of policy search is to learn the parameter vector θ, while the variance in action is assumed to be constant.

Numerical experiments
The performance of the proffered method will be compared with three algorithms of policy gradient RL, namely eNAC, REINFORCE and GPOMDP. During the simulations two distinct structured reward functions for the proposed method and PG algorithms have been utilized. as well as, a quadratic reward function r(a, s) = −s T Qs − ca 2 , where Q is a positive definite matrix and c is a positive real number. Here, we take Q to be a diagonal matrix with the diagonal [12, 1.25, 2, 0.25] T and c = 0.01. These predefined values are beneficial in adjusting the contributions of the state and action to the reward function.
Discounted factor for the overall reward function R n is γ = 0.99. The simulation length for all of the experiments is n = 1000 time steps with a sampling time of 0.01 in seconds. During the learning process, the variance of the action is taken as σ 2 = 4; however, during the testing process, this value is set to 0 to reflect a deterministic action policy.

Comparison with gradient-based methods based on total discounted reward
First, we compare our MCMC method with three gradientbased methods, namely REINFORCE, GPOMDP, and eNAC as mentioned in Section 1.3. Each of the gradient-based methods is run for 500 iterations and H = 100 episodes are run at each iteration to calculate an approximation of the gradient. For the MCMC method in Algorithm 1, 50,000 iterations are run, with N = 1 particle for each iteration. This choice of algorithm parameters for the compared methods ensures that the computational complexity for all the methods is the same. Also, a Gaussian random walk proposal is chosen with q(θ |θ) = N (θ ; θ, q ), where q is a diagonal covariance matrix with (10, 1, 1, 1) T on its diagonal. The prior distribution for θ is N (0, 10 3 I 4 ), which reflects the lack of prior information for θ prior to our runs. Here I 4 is a 4 × 4 identity matrix. Later we will discuss the importance of the choice of q for a better performance of the algorithm by setting two different values for q and inspect the resulting effects.

Interval-based reward.
Firstly, we performed comparisons according to the interval-based reward function represented in Equation (18). The corresponding results depicted in Figure 3, where the trace plots for parameter updates versus computation time for the suggested MCMC algorithm and gradient-based methods are given. While the gradient-based methods aim for a point estimate, the MCMC method aims for the quasi-posterior π n (θ ) in a Bayesian framework and provides samples for θ that are (approximately) distributed from π n (θ ). We can observe from plots for the gradient methods that those methods, converge to different maximal points of the expected reward function. However, samples from the MCMC methods 'recognize' both of these maximal points by exploring (roughly) the highprobability regions of π n (θ ). This can be better seen from the histograms in Figure 4 for the components of θ, representing the component-wise marginal distributions with respect to π n (θ ). Next, we measure the differences between the performances of the proposed method and PG algorithms in terms of convergence. For this purpose, the simulations are run 5 times each and averaged over estimated expected total rewards, where for the MCMC method, the logarithms of the estimated expected overall rewards, i.e. logĴ n (θ ) versus simulation time are observed. This is demonstrated in Figure 1 where the shaded parts in the plots show the standard deviation. As can be observed from Figure 1, expected average return for MCMC has converged in a faster manner in comparison to others. Note that J n (θ ) is taken differently for the MCMC (with utility function U(x) = exp(x)) and the gradient-based methods (with utility function U(x) = x), which explains the difference in the values at the yaxes; however, the convergence time argument is still justified.
To have a measure of the performance of the time responses of the obtained parameters, we calculated the composition of the Integral Absolute Error (IAE) for φ t , p t and a t (the closer each of them to 0, the better) as where e t stands for the error signal coming from the difference between the desired values of φ t , p t and a t with their actual quantities in the feedback control. While the θ values selected to measure the IAE for the gradientbased methods are simply the values obtained at the last iteration, for MCMC method, this is the average of last quarter of the total 50,000 iterations of the algorithm, representing an estimate for the mean of the target distribution π n (θ ). Figure 5 illustrates both obtained IAE and a sample time response for the compared methods revealing that performance of the proposed method outperforms GPOMDP and REINFORCE based on IAE in the time domain responses; however it is outperformed by eNAC, when an Interval-based reward function is used.

Quadratic reward.
Here we replaced the interval-based reward function with a quadratic one provided in (19) and repeated the experiments. It is worth mentioning that quadratic rewards are widely used in learning the behaviour of the dynamic systems. As before, the trace plots for MCMC and gradient-based methods are generated and then presented in Figure 6, histograms for the samples obtained from MCMC are given in Figure 7, estimated expected returns are depicted in Figure 2, and finally both the IAE, as well as, a sample time response for the compared methods are made available in Figure 8. Note that for the quadratic reward case REINFORCE algorithm could not manage to converge to an optimal solution, that is why it is excluded from the comparisons. The conclusions that can be drawn from those figures for this reward type are almost the same as for interval-based reward, except that as it is obvious from Figure 8, MCMC performed better than eNAC and GPOMDP in terms of the IAE for the time responses, as well. The control performance in terms of IAE is summarized in Table  It is worth noting that the structure of the reward function influences the ultimate performance of the control system in terms of time response. The situation can be better understood by comparing Figures 5 and 8 where using a Quadratic reward improves the overall time responses of the system by reducing the settling time as well as eliminating the overshoots occurred (thereby resulting in a less IAE) in time responses which used an Interval-based reward.

Comparison within the MCMC method 3.2.2.1. Effects of having different number of particles.
By taking into account the real time applications of the proposed method, for all the above experiments, MCMC has been performed by taking number of particles as N p = 1. In simulation environments we can use different number of particles in order to improve the performance of the algorithm. By utilizing the quadratic reward structure, the performance of the proposed MCMC method for N p = 10 and N p = 100 particles, are compared with each other. Sampled values for the policy parameters, performance in terms of the total expected reward and histograms for the learned policy parameters are demonstrated in Figures 9-11, respectively. As one can observe from the figures, N p = 100 produces smoother histogram estimates for the marginal distributions. However, convergence of the algorithm needs more time in case of N p = 100 due to a 10-fold increase in computation time per iteration, as it is obvious from Figure 10.

Effects of the proposal covariance q and action variance σ 2 .
Samples generated by the MCMC need to be mixed well in a way that they easily forget their previous values. In particular, we need to choose a suitable proposal variance q to ensure well-mixing. While a very small q will result in providing samples that are highly correlated and mixing in a very slow fashion, a very large q is prone to stick to the old values of the generated samples by rejecting the new proposals. Below, we demonstrate this phenomenon in the policy search setting. We take a quadratic reward structure as in the previous subsection and we looked at all the four combinations of (σ , q ) for • σ = 2.5, σ = 1.5 and • q = diag[100, 10, 10, 10], q = diag[1, 0.1, 0.1, 0.1] The corresponding simulations to these settings are depicted in Figures 12 and 13. When comparing the trace  plots in those figures with the one in Figure 6(c), it is observed that choosing q to be neither very large nor very small, as in Figure 6(c), results in the method to converge to its target distribution in a way that all the generated samples are mixed well. Specifically, part (a) of Figures 12 and 13 indicate that the large values for q may cause the rejection of the proposed quantities for the policy parameters and the generated samples will have a tendency to stick to their old values. In part (b) of the same figures, the generated samples mix slowly and therefore are highly correlated due to small q .
We also observe that, in spite of bad choices of q , the proposed MCMC algorithm is still able to tackle the policy parameter estimation problem and manages to converge, where their performance in case of convergence properties are presented in parts (c) and (d) of Figures 12 and 13.
Finally, our conclusions hold for the two different values of the action variance σ , which indicates a certain level of robustness for our method.

Conclusion and future work
In this paper, we considered the control problem of an MDP with continuous state and action spaces in finite time horizon and presented a novel policy search method for a class of parameterized polices in RL under a Bayesian learning framework. The proposed learning approach is an MCMC algorithm that uses estimates of the expected multiplicative total reward which are attained with an SMC algorithm. Using a multiplicative reward formulation, inspired by risk-sensitivity notion, proved to be advantageous in comparison to its additive counterpart as illustrated by our simulation results.
The proposed MCMC algorithm was applied to the control problem of stabilization of a nonlinear model of an inverted pendulum using both quadratic and delayed reward, and simulation results were compared with some well known gradient-based policy search (PG) approaches. Comparing the results achieved by the proposed MCMC algorithm with those of PG algorithms, some observations stand out: (i) Whereas the PG methods attempt to find a point estimate for the policy parameters (and are prone to get trapped in local optima), the proposed MCMC algorithm aims for the quasi-posterior distribution in a Bayesian scheme which yields admissible values for the policy parameter. In the advanced Bayesian policy search approach, since the objective distribution is proportional to the expected total reward, then search for the optimal parameters are steered to the regions with high probable reward portions. (ii) The proposed method either outperforms the tested PG algorithms in terms of time response specifications, or its performance is close enough to that of PG methods and meets the desired control objectives. (iii) When it comes to comparing the performance in terms of convergence time, the proposed MCMC algorithm was observed to be superior to the tested PG algorithms. (iv) Proposed MCMC algorithm can be advantageous over the gradient-based methods by circumventing the difficulties faced when using gradient information. Gradient-based tactics may be afflicted by poor convergence performances by getting stuck in local optima or diverging due to the weak choices of the learning rate used in gradient ascent updating rules, instead we suggest to leverage estimates of the expected multiplicative reward, without the need to approximate any gradient.
In this work, we show the merits of the proposed MCMC algorithm in Algorithm 1 via computer simulations. In a real-time system, it may be hard to implement, for example with a robot, when multiple particles are involved at the same time. However, for such cases, Algorithm 1 can still be used effectively with one particle per iteration, as shown in the numerical section. Indeed, the work is underway to use the N = 1 scheme for a realworld discrete time implementation where trajectories are generated by a robot. Note 1. For the general case of dependency on s t+1 as well, one can extend the state X t of the Markov chain to include S t+1 and redefine the transition laws of the modified Markov chain, then the methodology developed in this paper can also be applied to the modified Markov chain.

Disclosure statement
No potential conflict of interest was reported by the authors.