Bayesian optimal design for ordinary differential equation models with application in biological science

Bayesian optimal design is considered for experiments where the response distribution depends on the solution to a system of non-linear ordinary differential equations. The motivation is an experiment to estimate parameters in the equations governing the transport of amino acids through cell membranes in human placentas. Decision-theoretic Bayesian design of experiments for such nonlinear models is conceptually very attractive, allowing the formal incorporation of prior knowledge to overcome the parameter dependence of frequentist design and being less reliant on asymptotic approximations. However, the necessary approximation and maximization of the, typically analytically intractable, expected utility results in a computationally challenging problem. These issues are further exacerbated if the solution to the differential equations is not available in closed-form. This paper proposes a new combination of a probabilistic solution to the equations embedded within a Monte Carlo approximation to the expected utility with cyclic descent of a smooth approximation to find the optimal design. A novel precomputation algorithm reduces the computational burden, making the search for an optimal design feasible for bigger problems. The methods are demonstrated by finding new designs for a number of common models derived from differential equations, and by providing optimal designs for the placenta experiment.


Modelling complex physical processes
Often the dynamics behind a complex physical process can be approximately described by a system of non-linear ordinary differential equations (ODEs), where the solution to these equations provides a model predicting how a quantity of interest will behave with respect to time.It is assumed that the system of ODEs depends on some unknown physical properties (parameters) of the process in question and, potentially, may also depend on some additional controllable (design) factors.
The value of the parameters may be of direct interest or we may be interested in predicting the behaviour of the process at certain values of the design variables and/or at a certain time.In either case, we need to estimate the unknown parameters.An experiment can be conducted where observations of the quantity of interest are collected at various different times and, possibly from multiple runs of the experiment with different values of the design factors.To estimate the parameters, a statistical model is assumed that links the parameters to the observations via a data-generating process based on the solution to the ODEs (see, for example, Ramsay et al. 2007).
It may be possible to conduct a designed experiment where the observation times and design variables are actively selected in advance of the experiment.Optimal design of experiments refers to this selection being made optimally to minimise a loss function which reflects the ultimate goal of the experiment, e.g.estimation of the unknown parameters or prediction of a future process.
We consider Bayesian optimal design of experiments for ODE models.Bayesian optimal design has very principled foundations but can be hard to implement in practice due to the computational complexities involved.Firstly, it involves the minimisation of the expected loss function which will typically be analytically intractable.Secondly, the dimensionality of the domain of the expected loss function can sometimes be large since every quantity that can be specified for the experiment corresponds to a dimension.
For ODE models, these problems are compounded by the fact that, typically, the solution to the system of ODEs is not analytically tractable.A possible approach is to use numerical methods to find an approximate solution to the systems of ODEs (see, e.g., Iserles, 2009).However, this has two disadvantages.First, the numerical solution can be computationally expensive.Bayesian inference for computationally expensive models has begun to receive considerable attention in the Statistics literature (e.g.Kennedy and O 'Hagan 2001;Rasmussen 2003;Bliznyuk et al. 2008;Fielding et al. 2011;Overstall and Woods 2013).In each case, to some degree, evaluation of the likelihood (which depends on the computationally expensive numerical solution) is replaced by an evaluation of an approximation.The second disadvantage, which is perhaps more serious, is that the numerical error, unavoidable with numerical methods, is typically not taken account of when performing subsequent evaluations with the numerical solution.This issue was explained by Chkrebtii et al. (2015) who proposed a fully probabilistic solution to the system of ODEs.We take advantage of this methodology in our treatment of optimal design for ODE models.Overstall and Woods (2015) proposed the approximate coordinate exchange (ACE) algorithm for Bayesian optimal design for non-ODE models.Very briefly, the ACE algorithm uses a cyclic descent algorithm (see, for example, Lange, 2013, pg 171) to minimise an approximation to the expected loss function.The approximation used is a Gaussian process (GP) emulator fitted to a Monte Carlo integration approximation of the expected loss.In this paper, we extend this algorithm to ODE models.We replace evaluation of the intractable solution to the ODEs by a value generated from the probabilistic solution as proposed by Chkrebtii et al. (2015).

Measuring human placentas
To aid exposition of the ideas and methodology introduced throughout this paper, consider the following application from biologists at the Southampton Centre for Biological Sciences (University of Southampton, UK).Interest lies in the transport of the amino acid serine within a human placenta.Specifically, we are concerned by how serine moves from the outside to the inside of a portion of placental cell membrane (called a vesicle).To investigate, an experiment is to be performed where initial amounts of radioactive and non-radioactive serine (in µl) are placed inside and outside the vesicle.The amount of radioactive serine inside the vesicle is then measured at a series of observation times.The serine transport process can be described by a system of ODEs which depend on the initial amounts of serine and four unknown physical parameters (see Table 1) which are of interest to the scientists.The solution to the system of ODEs provides theoretically predicted amounts of radioactive and non-radioactive serine inside the vesicle at a certain time.The practitioners have control over the initial amounts of non-radioactive serine inside and outside the vesicle for each experiment and the values of the observation times.Our task is to choose the initial amounts and observation times optimally with respect to the goal of estimating the physical parameters.

Organisation of the paper
The paper is organised as follows.In Section 2 we describe the background to the problem including statistical inference for ODE models, the premise of Bayesian optimal design and a brief description of the ACE algorithm.In Section 3 we describe the proposed methodology for optimal design for ODE models including a description of the probabilistic solution to ODEs of Chkrebtii et al. (2015) and how this can be embedded in the ACE algorithm.In Section 4, we apply this methodology to three illustrative examples where the goal is parameter estimation.Finally, in Section 5, the methodology is applied to the human placenta example where differing goals of parameter estimation and model selection are considered.

Statistical inference for ordinary differential equations
Let x ∈ X be a vector of k design variables, i.e. a treatment, and let θ ∈ Θ be a p × 1 vector of physical parameters.Consider the following system of S ODEs which define an initial value problem where u(t) is the gradient vector of u(t) with respect to time t, and u 0 ∈ R S denotes the initial conditions.In (1), f : R S × [T 0 , T 1 ] × X → R S is a suitably well-behaved function that, at the very least, we assume satisfies the Lipschitz condition (see, e.g.Iserles, 2009, pg 3).This means that (1) has a unique solution.Note that the solution actually depends on θ and x, i.e. u(t) = u(t; θ, x) but we only use the longer notation when we need to be clear that there may be more than one θ or x.
Now we believe that the physical process in question is governed by (1).For j = 1, . . ., M , we observe the process for treatment x j , initial conditions u 0j , and at times t j = t j1 , . . ., t jn j .For l = 1, . . ., n j , we observe the c × 1 vector of responses y jl ∈ Y jl , where Y jl denotes the c-dimensional sample space for the jlth observation.Let y j = y j1 , . . ., y jn j be the cn j × 1 vector of responses for the jth treatment and let y = (y 1 , . . ., y M ) ∈ Y be the n × 1 vector of responses for the complete experiment where n = c M j=1 n j and Y = M j=1 n j l=1 Y jl is the overall sample space.We assume that y are realisations according to y|ψ, d ∼ F (ψ; d) , where F is a known probability distribution, ψ ∈ Ψ is a P × 1 vector of model parameters (with parameter space Ψ), and d ∈ D is a q × 1 vector specifying the design (with design space D).The distribution in (2) essentially defines a statistical model.Note that we decompose ψ = (θ, γ), where γ ∈ Γ is a (P − p) × 1 vector of nuisance parameters.The design, d, is the set of controllable experimental conditions and can include the treatments x 1 , . . ., x M ; the initial conditions u 01 , . . ., u 0M ; and the observation times t j1 , . . ., t jn j , for i = j, . . ., M .In practice, some of these may be fixed by the protocol of the experiment.Alternatively, the initial conditions may be unknown, and included in the vector of parameters, ψ, either as physical or nuisance parameters.
The dependence of the distribution in (2) on θ and d is through the solution of the system of ODEs given by (1).The most obvious way to do this is to assume that where Ramsay et al. (2007) call this a distributed partial data problem.
As an example, consider the human placenta example introduced in Section 1.2.The system of S = 2 ODEs is given by u1 where and T 0 = 0 and T 1 = 600 seconds.The solution to this system is u(t) = (u 1 (t), u 2 (t)) which are the amounts of radioactive and non-radioactive serine, respectively, inside the vesicle at time t.The values of x = (x 1 , x 2 ) ∈ X = [0, 1000] 2 are the amounts of radioactive and non-radioactive serine outside the vesicle at time t = 0, and u 0 = (u 01 , u 02 ) ∈ [0, 1000] 2 are the corresponding amounts of serine inside the vesicle at time t = 0.In the experiment, we are able to control x 2 and u 02 (i.e.x 1 and u 01 are fixed by the experimental protocol), and the response, y jl , is the amount of radioactive serine inside the vesicle at time t jl for process conditions x 2j and u 02j .We assume a statistical model where E (y jl |θ, x j , t jl ) = u 1 (t jl ), so that in this case G(u, θ) = u 1 .The design is the collection of initial amounts of non-radioactive serine x 2j and u 02j , and the observation times t j1 , . . ., t jn j ; for j = 1, . . ., M .

Decision-theoretic approach to Bayesian optimal design
We now describe the decision-theoretic approach to Bayesian optimal design.We complete the statistical model given by F in (2) by specifying a prior distribution for ψ which does not depend on the design d.Once we observe y, the posterior distribution of ψ is given by where π(y|ψ, d) is the mass/density function of F and π(ψ) is the prior density function.Note that the right-hand-side of (4) also defines the joint distribution of ψ and y.The posterior distribution of the physical parameters is found by marginalising (4) with respect to the nuisance parameters.
Bayesian optimal design relies on the specification of an appropriate (for the goal of the experiment) loss function denoted by λ(ψ, y, d) which depends on the design and, potentially, the unobserved responses and unknown parameters.An optimal design, d * , is given by a value of d that minimises the expectation of λ(ψ, y, d) with respect to the joint distribution of ψ and y (given d), i.e.
For example, suppose we were interested in posterior point estimation of the elements of θ, then an appropriate loss function might be the squared error loss (SEL) given by which does not depend on the nuisance parameters.It can be shown that and so the optimal design minimises the expected trace of the posterior variance matrix of θ.
As mentioned in Section 1, we are faced with three problems when trying to minimise the expected loss function: • high dimensionality of the design space, D; • intractability of the integration required to evaluate L(d) and λ (ψ, y, d); • both evaluation of λ (ψ, y, d) and the joint distribution of y and ψ will typically depend on the intractable solution to the system of ODEs.
The approximate coordinate exchange (ACE) algorithm, proposed by Overstall and Woods (2015), is a solution to the first two problems and is described in the next section.

Approximate coordinate exchange algorithm
To overcome the problem of the high dimensionality of the design space, the coordinate exchange (CE; Meyer and Nachtsheim 1995) algorithm is used to minimise the expected loss function, L(d).This is the same as a cyclic descent algorithm where L(d) is minimised, sequentially, over each element (or coordinate) of the design space, where all other elements are held fixed.This process is then repeated until convergence.
However, instead of minimising the intractable L(d), at each iteration, the ACE algorithm minimises an approximation, L(d).Consider the Monte Carlo (MC) approximation to L(d): where {ψ i , y i } B i=1 is a sample generated from the joint distribution of ψ and y, given d.The LB (d) is a consistent and unbiased estimator of L(d).However, there are at least two reasons why LB (d) would be a poor choice for L(d).First, LB (d) is a stochastic approximation and so is a non-smooth function.Secondly, LB (d) is computationally expensive.This problem is aggravated by the fact that, in some cases, the loss function is, itself, an intractable function requiring MC approximation.
Instead, Overstall and Woods (2015) proposed constructing an approximation (or emulator) for L(d) based on a "small" number of evaluations of LB (d).One of the most common types of emulator is the Gaussian process (GP) emulator.The use of GP emulators in more general optimisation problems dates back to, at least, the expected improvement approach of Jones et al. (1998).The GP emulator provides a predictive distribution for L(d) and we set L(d) to be the predictive mean of this distribution.This approach of smoothing the MC approximations can be seen as an extension to the approach of Muller and Parmigiani (1996), for minimising the expected loss, to higher dimensionality design spaces, by using the CE algorithm.A similar approach was used by Gotwalt et al. (2009) who used a deterministic quadrature rule to evaluate the log determinant of the Fisher information matrix averaged over a prior distribution (a commonly-used classical objective function for optimal design) and then maximised the resulting approximation over the design space using CE.
The actual algorithm is provided in Appendix A. Note that a GP emulator, like all statistical models, can fit inadequately.Bastos and O'Hagan (2009) developed diagnostics to assess the adequacy of GP emulators.However, applying these methods automatically within the ACE algorithm is infeasible.Instead, once L(d) has been minimised, we, independently of the GP emulator, decide whether to accept the change to the current design before we move onto the next element/coordinate in the ACE algorithm.This accept step is accomplished using a Bayesian hypothesis test.For more details on this feature, on the overall ACE algorithm, and on the wider issue of Bayesian optimal design, see Overstall and Woods (2015).

Methodology
In this section we describe how the ACE algorithm can be extended to find Bayesian optimal designs for ODE models.
In Section 3.1, we briefly describing the method of Chkrebtii et al. (2015) for finding a probabilistic solution to a system of ODEs and then, in Section 3.2, describe how this solution can be embedded in the ACE algorithm.

Probabilistic solution to ODEs
Let R λ denote a square integrable kernel function with length scale parameter λ ∈ (0, ∞).Furthermore, let where α > 0. Let u(t) = (u 1 (t), . . ., u S (t)), then the central assumption of the method of Chkrebtii et al. ( 2015) is that u s (t) and its time derivative, us (t), have a joint GP prior, i.e.
2. For r = 1, . . ., N − 1 complete the following steps: (a) Compute the S × 1 vector giving the predictive mean of u(τ r+1 ) as where F r is the r × S matrix with qth row given by f q for q = 1, . . ., r. Compute the common predictive variance of u s (τ r+1 ) as (b) For s = 1, . . ., S, generate a prediction of the solution at τ r+1 , u s (τ r+1 ), from the predictive distribution, i.e.
Once step 2 is complete, then a probabilistic solution is given by for s = 1, . . ., S, where m s (•) is the sth element of the vector of predictive means given by and C(•, •) is the common predictive variance given by The methodology holds for general kernel functions.However, Chkrebtii et al. (2015) suggest two example kernel functions, the squared exponential kernel function given by and the uniform kernel function given by where I(A) is the indicator function for the event A. Simplistically, the best choice of kernel function depends on the assumed smoothness of the u(t).The squared exponential kernel function is infinitely differentiable and can be used for when we expect u(t) to be smooth.On the other hand, if u(t) is non-smooth then a better choice is the uniform kernel function which is not differentiable.For closed form expressions for the functions Q λ , S λ , W λ and V λ , under both the squared exponential and uniform kernel functions, see Chkrebtii et al. (2015).
Consider the system of ODEs, given by (3), that describe the transport of serine in a human placenta.Under the squared exponential kernel function, physical parameters θ = (100, 0.05, 100, 100), initial values u 0 = (0, 1000), treatment x = (7.5, 1000), and τ containing N = 501 evenly spaced time points, Figure 1 shows 1000 draws from the probabilistic solution of u 1 (t) and u 2 (t) plotted against t ∈ [T 0 , T 1 ] = [0, 600].Note how the uncertainty in the solution increases as time, t, increases away from t = T 0 where we know, in this example, the true value of u(t).

Extending the ACE algorithm to ODE models
At various points of the ACE algorithm we are required to evaluate a Monte Carlo approximation of the expected loss function, i.e. as given by ( 6).First, this requires a sample {y i , ψ i } B i=1 to be generated from the joint distribution of y and ψ.This is accomplished by generating a value ψ i from the prior of ψ and then generating a value y i from the distribution F(ψ i , d) (see equation ( 2)).This will require B evaluations of the intractable solution of the system of ODEs, for the jth treatment and time points: t j1 , . . ., t jn j , for j = 1, . . ., M .Secondly, we need to evaluate the loss function at each value in the sample {y i , ψ i } B i=1 .Unfortunately, the most commonly used loss functions are, themselves, intractable.For example, the squared error loss function given by ( 5) depends on the posterior mean, E (θ l |y), of each element of the vector of physical parameters, θ.Overstall and Woods (2015)  is an additional sample generated from the prior distribution of ψ where ψj = θj , γj and θjl is the lth element of θj for l = 1, . . ., p. Evaluation of E (θ l |y i ), in the loss function, for i = 1, . . ., B, is now replaced by evaluation of Ê (θ l |y i ) to give a nested Monte Carlo approximation to the expected loss function.A further B evaluations of the intractable solution, u(t), will be required for each treatment and for each time point.Therefore, in total, we need 2B evaluations of u(t) for each treatment and for each time point, one for each vector of physical parameters in the samples {ψ i } B i=1 and ψj B j=1 . Thus we are required to evaluate for i = 1, . . ., 2B, j = 1, . . ., M and l = 1, . . ., n j .
The basic idea is to replace evaluation of the unknown u(t) in the ACE algorithm by a value generated from the probabilistic solution outlined in Section 3.1.However, since B will be in the order of 1000s, it will be computationally infeasible to use the full probabilistic solution where both the physical parameters, θ, and auxiliary parameters, α and λ are unknown with prior distributions.However, if we are prepared to fix the values of the auxiliary parameters, then significant computational savings can be found thus making the method feasible.
In summary, before starting the ACE algorithm, an initial phase is completed as follows.
In the above algorithm note the dependence on the initial values u 0j , for the jth treatment, i.e. the initial values are known.In some situations, the initial values will be unknown and become part of the inference problem, i.e. they are given a prior distribution which we update to a posterior distribution in light of the experimental responses.If that is the case, then we can replace all occurrences of u 0j by a value, u 0ji , generated from their prior distribution, as we do with the unknown parameters.
In each example, designs are found under the three different loss functions as described below.For each loss function, let ψj , where ψi = θi , γi , is an additional sample generated from the prior distribution of ψ.
• Squared error loss (as described in Section 2.2).
• Self-information loss (SIL) given by Both of the integrals in (8), denoted as are analytically intractable.However we can approximate them using Monte Carlo integration as follows.
We then replace evaluation of I 1 and I 2 in the self-information loss function by Î1 and Î2 , respectively.
Similar to approximating the squared error loss in Section 3.2, both of the nested Monte Carlo schemes for approximating the absolute error and self-information loss require 2B evaluations of u(t) for each treatment and time point, which we now replace by a value generated from the probabilistic solution.For the probabilistic solution, the discrete grid of points denoted by τ will be a set of N equally-spaced points where the absolute difference between any two values is denoted by h.Unless stated otherwise, for the auxiliary parameters, λ = 4h.

Compartmental model
Compartmental models are used in pharmokinetics to understand how drugs behave inside a body.The open one-compartment model with first-order absorption is described by the following system of S = 2 ODEs for t ∈ [0, 24] hours where u 1 (t) and u 2 (t) are the amounts of drug outside and inside the body respectively, D is the known initial dose and θ = (θ 1 , θ 2 , θ 2 ) are unknown parameters.
The system in ( 9) is a homogenous linear system with constant coefficients meaning it can be solved analytically as For this example, we find and compare designs under the three loss functions from Section 4.1, and under both the exact and probabilistic solutions to the ODEs.
This type of compartmental model (or variants of it) is often used to demonstrate optimal experimental design methodology (see, for example, Atkinson et al. 1993;Gotwalt et al. 2009;Ryan et al. 2014;Overstall and Woods 2015).We follow the setup of Ryan et al. (2014) and Overstall and Woods (2015) where D = 400, n = 15 and log θ l ∼ N(µ i , 0.05), independently, for l = 1, 2, 3 with (µ 1 , µ 2 , µ 3 ) = (log 0.1, log 1, log 20).The amount of drug inside the body, y i is observed at observation time t i and we assume the following statistical model independently, where σ 2 = 0.1 and τ 2 = 0.01.Note that (10) implies that G(u, θ) = u 2 .The design only involves the n observation times: t 1 , . . ., t n .An added stipulation is that the observation times have to be at least 15 minutes apart.Such constraints are straightforward to incorporate into the ACE algorithm (see Overstall and Woods 2015).
Since we know that u(t) is smooth we employ the squared exponential kernel function.The discrete grid, τ , is of size N = 501.The auxiliary parameter α is fixed at N .
For each loss function, we compare the designs found under the exact and probabilistic solutions (using ACE) to a uniform designs of n = 15 equally-spaced time points in [0, 24] hours.The top row of Figure 2 shows boxplots of twenty evaluations of the Monte Carlo approximation to the expected loss for the uniform design and the optimal design found under each of the loss functions.There is negligible difference between the designs found under the exact and probabilistic solutions and these designs are clearly superior to the uniform designs.In the bottom row of Figure 2 are the observation time points associated with the designs under comparison.The optimal designs appear to favour having a set of observation times near the peak of u 2 (t) at t ≈ 2.5 hours and then a series of observation times at the end of the observation interval.Of the points near the peak, the optimal design under self-information loss has two distinct sets whereas the designs under squared and absolute error loss have just one set of points.The FitzHugh-Nagumo equations (FitzHugh 1961 andNagumo et al. 1962) aim to describe the behaviour of spike potential in the giant axon of squid neurons.They are given by the following system of S = 2 ODEs for t ∈ [0, 20]ms where u 1 (t) is the voltage across the axon membrane, u 2 (t) is the recovery variable giving a summary of outward current and θ = (θ 1 , θ 1 , θ 3 ).
The experiment involves measuring the voltage, y i , at time t i , for i = 1, . . ., n = 21.Following Ramsay et al. (2007), the following statistical model is assumed independently, where σ ∼ U[1/2, 1].Furthermore, we assume the following prior distributions for the unknown parameters: As noted by Ramsay et al. (2007), the solution to the FitzHugh-Nagumo equations can alternate between smooth evolution and sharp changes of direction.For this reason we employ the uniform kernel.The discrete grid is of size N = 200 and the auxiliary parameter α is fixed as N .
The design consists of the n observation times: t 1 , . . ., t n .Similar to Section 4.2, we stipulate that the observation times have to be at least 0.25ms apart.We find designs under each of the loss functions in Section 4.1.In each case, we compare the optimal design to a uniform design of n equally spaced points in [0, 20]ms.Figure 3 shows boxplots of twenty evaluations of the Monte Carlo approximation to the expected loss for the uniform design and the optimal design found under each of the loss functions.In each case, there is a clear improvement to be made on using a uniform design.Also shown in Figure 3 are the four designs under comparison.Both the squared and absolute error optimal designs have a significant number of frequent observations at the beginning of the experiment.
For example these designs have around a third of their observation times before 2.5ms.On the other hand, the self-information loss and uniform designs only make 3 observations before this time.A feature of all of the optimal designs is that they make no observations between about 2.5 and 6ms.The remaining observation times appear roughly evenly spaced.The background to this plot shows 100 draws from the probabilistic solution, u 1 (t), giving the voltage at time t, for 100 values drawn from the prior distribution of θ.It appears that the initial phase of high frequency observations is to learn about the steep increase in voltage for small t.The remaining set of evenly spaced observation times is for learning when the voltage has high noise due to parameter uncertainty.

JAK-STAT mechanism
The JAK-STAT mechanism describes a set of biochemical reactions of STAT-5 transcription factors in response to binding of the Erythropoietin hormone to cell surface receptors.The system of S = 4 ODEs, for t ∈ [0, 60], is where u 01 and ω are unknown, and u 4 (t) = 0 for t ∈ [−ω, 0].The examples thus far have been initial value problems (IVPs) whereas the system above is an example of a delay initial function problem (DIFP).After gene activation in the cell nucleus, the transcription factors revert to their initial state, returning to the cytoplasm for the next activation cycle.This stage is explained by the unknown time delay denote by ω.
The function G is given by An experiment has already been conducted (Swameye et al., 2003) which consisted of two series of n = 16 measurements of the first two elements (G 1 and G 2 ) of G at a set of distinct observation times: t 1 , . . ., t n .A further observation of G 3 is made at time t = 0 and of G 4 at time t * .The following statistical model is assumed independently, for i = 1, . . ., n, where C i = diag {σ 2 1i , σ 2 2i }.See Raue et al. (2009) and Chkrebtii et al. (2015) for analyses of these data.In these papers, σ 2 1i , σ 2 2i , σ 2 3 and σ 2 4 (for i = 1, . . ., n) are experimentally determined, i.e. they are known.
The design task considered here will be to optimally find a follow-up design based on information on the parameters from the existing data.We assume the same statistical model as above and find t 1 , . . ., t n and t * .In the terminology of Sections 2 and 3, the prior distribution for θ, ω and u 01 is the posterior distribution for the existing data as per the analysis of Chkrebtii et al. (2015).Instead of the variance parameters being fixed, we assume that σ 2 1i = σ 2 1 , σ 2 2i = σ 2 2 , for all i = 1, . . ., n and These prior distributions are consistent with the experimentally determined values from the original analysis.
Note that the system of ODEs depends on the forcing function κ(t).This is unknown, but has been measured at 16 time points.Following Chkrebtii et al. (2015), we assume these measurements are without error and use a GP to give a probabilistic approximation to κ(t) at any value of t ∈ [0, 60].
The nature of the DIFP does introduce an added complexity to our implementation of the probabilistic solution.In step 2(b) of the main phase of the algorithm in Section 3.2 we are required to compute In an IVP, this is dependent on u(τ r+1 ) which is generated from a normal distribution in step 2(a).However in the DIFP, to compute f r+1 , we also need u 4 (τ r+1 − ω i ), where ω i is a value generated from the prior (posterior from original analysis) distribution of ω.If τ r+1 − ω i ≤ 0, then u 4 (τ r+1 − ω i ) = 0 by the initial conditions of the system of ODEs.For τ r+1 − ω i > 0, in the probabilistic solution of Chkrebtii et al. (2015), the conditional distribution of u 4 (τ r+1 − ω i ) can be derived and a value generated.However, this will be computationally expensive to incorporate in the implementation of the probabilistic solution described in Section 3.2.Instead, if τ r+1 − ω i > 0, then we replace u 4 (τ r+1 − ω i ) by u 4 (τ r) where r = arg min r =1,...,r+1 i.e. from the series of u 4 (τ 1 ), . . ., u 4 (τ r+1 ) generated in step 2(a), thus far, we choose the value at time τ r that is "closest" to τ r+1 − ω i .
As noted by Chkrebtii et al. (2015), the time delay can cause discontinuities in the derivative meaning we emply the uniform kernel function.The discrete grid, τ , is of size N = 500, and the auxiliary parameters are λ = 0.085 and α = 8000 which are consistent with their posterior distribution from the original analysis.
We use the extended ACE algorithm to generate designs under the three loss functions from Section 4.1 and compare these designs to the original design used in the experiment of Swameye et al. (2003).As in the previous examples, the observation times need to be at least 1 second apart.Figure 4 shows boxplots of twenty evaluations of the Monte Carlo approximation to the expected loss for the original design and the optimal design found under each of the loss functions.In each case, there is a clear improvement to be made on repeating the experiment with the same set of observation times.Also shown in Figure 4 are the four designs under comparison.Clearly the optimal designs favour having a dense set of points early in the observation window and then a smaller set of times at the end of the window.This is especially true for the designs under the squared and absolute error loss where 75% of the observation times are less than 15 seconds, compared to about 60% for self-information loss and 50% for the original design.It appears that the early observation times can learn about the peak in G 1 and the sharp decrease in G 2 at and up to 10 seconds, respectively.For the single observation time, t * , of G 4 , the optimal designs clearly favour making a very early observation.Note that t * for each of the optimal designs is between 1 and 2 seconds.
5 Application: Transport of serine across human placenta Now consider the human placenta example introduced in Sections 1.2 and 2.1.By the protocol of the experiment, the initial amounts of radioactive serine inside (u 01 ) and outside (x 1 ) are fixed as 0 and 7.5, respectively.We consider M = 2, . . ., 7 placentas and let n j = n * = 8, for all j = 1, . . ., M .We also fix t jl = t l for l = 1, . . ., n j and j = 1, . . ., M .The following statistical model is assumed for the experimental responses y jl = u 1 (t l ; θ j , x j ) + jl , for j = 1, . . ., M and l = 1, . . ., n * , where x j = (x 1 , x 2j ), jl iid ∼ N 0, σ 2 , and θ j are the physical parameters for the jth placenta.For d = 1, . . ., p, we specify a multiplicative hierarchical prior structure for θ j as where c d iid ∼ U [0, 0.05], and θ = (θ 1 , . . ., θ p ) are the population physical parameters.For the latter, we assume the following independent prior distributions for each element where Tri[a, b] denotes the symmetric triangle distribution on the interval [a, b].We assume a 1 = a 2 = a 4 = 80, b 1 = b 2 = b 4 = 120, a 2 = 0.02 and b 2 = 0.08.This reflects actual prior knowledge on the value of these parameters from past experiments.For the response variance, we assume We expect the solution to be smooth so use the squared exponential kernel function.The discrete grid, τ , is of size N = 601 and the auxiliary parameter α = 10N .Specifying a design corresponds to specifying the M = 7 experimental conditions x 21 , . . ., x 2M , and initial values u 021 , . . ., u 02M , as well as the common n * = 8 observation times t 1 , . . ., t n * .Therefore, the design space has 22 dimensions.
For each value of M , we use the ACE algorithm to find designs under the three loss functions from Section 4.1.In addition, suppose a question of interest concerns whether the reaction rates are symmetric, i.e. is θ 3 = θ 4 ?To answer this question, we define two models: m 1 (where θ 3 = θ 4 ) and m 2 (where θ 3 = θ 4 ).An appropriate loss function, termed the Model selection loss (MSL), is where m ∈ M = {m 1 , m 2 } denotes the model and m = argmax m∈M π(m|y) is the model that maximises the posterior model probability.The posterior model probability of model m is given by , where π(m) is the prior model probability of model m and π(y|m j ) = π(y|θ j , γ)π(θ j , γ)dθ j dγ (13) for j = 1, 2, with θ j being the parameters under m j .Under j = 1, 2, the integration in ( 13) is intractable but can be approximated using a Monte Carlo approximation in a similar way to I 2  under the self-information loss.In each case, as in the previous examples, the observation times need to be at least 5 seconds apart.
Figure 5 shows boxplots of twenty evaluations of the Monte Carlo approximation to the expected loss for the optimal design found under each loss function plotted against M = 2, . . ., 7. For M = 7, also shown is a boxplot of twenty evaluations of the Monte Carlo approximation for a design (see Table 2) proposed by the biologists at the Southampton Centre for Biological Sciences.For M = 2, . . ., 6, the corresponding boxplots show evaluations of the Monte Carlo approximation for twenty designs formed by randomly choosing M rows from the proposed design.As expected, the expected loss decreases with the number of placentas, M .Also, the optimal designs are clearly superior to the proposed designs.However, it should be noted that, for each loss function, the optimal design is expected to gain more information from M = 2 placentas than the proposed design is expected to do so from M = 7 placentas.
We now look at the case of M = 7 placentas in more detail.The initial concentrations of nonradioactive serine inside (u 02 ) and outside (x 2 ) each of the M = 7 placentas for the optimal designs found under each of the four loss functions are shown in Table 2. Also shown are the conditions for the design proposed by the biologists.Figure 6 show the optimal (for each loss function) and proposed observation times.Each row shows 100 draws from u 1 (t) plotted against t for 100 draws from the prior distribution of θ and γ, for each placenta and initial concentrations shown in Table 2.
Under the self-information loss, Table 2 shows that the optimal design has replicates of the same initial concentrations of non-radioactive serine.Figure 6 shows that, compared to the other designs, these initial concentrations lead to a larger maximum value of u 1 (t).Note that the between-placenta variability of u 1 (t) in Figure 6 is due to the placenta-specific physical parameters, θ j .The optimal observation times under the self-information loss occur in two clusters, just before and after the peak in u 1 (t).
The designs under the squared and absolute error loss functions appear similar.In both cases they are superior to the proposed design.The initial concentrations in Table 2 lead to three distinct profiles of u 1 (t) (Placentas 1 and 2; 3 and 4; 5, 6 and 7).The profile for placentas 1 and 2 has a slow steady increase in u 1 (t) with respect to t. Placentas 3 and 4 have a steep initial increase and subsequent decrease in u 1 (t) with respect to t.Finally, placentas 5 to 7 has a steep initial increase in u 1 (t) with respect to t followed by a slow decrease.The optimal observation times are predominantly at the beginning of the observation window.
Figure 6: Plots summarising the results from the placenta example in Section 5. Shown are 100 draws from u 1 (t) plotted against t for 100 values drawn from the the prior distribution of θ, for each of the M = 7 placentas and experimental conditions given by Table 2.
The initial concentrations of the optimal design under the model selection loss result in two distinct profiles of u 1 (t).For placentas 2, 3 and 5 to 7, u 1 (t) has a slow steady increase in u 1 (t) with respect to t. Placentas 1 and 4 have a steep initial increase and subsequent decrease in u 1 (t) with respect to t. Opposite to the other loss functions, the optimal observation times are predominantly towards the end of the observation window.

Concluding Remarks
This paper introduces an extension of the ACE algorithm for Bayesian optimal design so that the challenging problem of experimental design for ODE models can now be attempted.
The method relies on the probabilistic solution to a system of ODEs as recently proposed by Chkrebtii et al. (2015) and is demonstrated on four examples where the goal of the experiment is estimating unknown physical parameters.
One key issue not addressed is model discrepancy (Kennedy and O 'Hagan, 2001).This is a systematic mis-match between the true physical process and the solution to the ODEs.Not taking account of this bias can lead to significant bias in the posterior estimates of the physical parameters (Brynjarsdottir and O 'Hagan, 2014).Future work will focus on Bayesian optimal design for physical models including model discrepancy.
A The ACE algorithm Convergence can be assessed informally using trace plots of the evaluations of either λ * or λC at step 2(e), dependent on whether the proposed design was accepted or not, respectively.
The ACE algorithm should be started from multiple different starting designs d 0 .Out of the resulting designs, the one with the lowest value of LB (d) is returned.
Chkrebtii et al. (2015) propose several MCMC algorithms for generating a sample from the posterior distribution of the model parameters ψ = (θ, γ) and the auxiliary parameters α and λ, given existing experimental responses y.This is accomplished by embedding the probabilistic solution into standard MCMC algorithms.

Figure 2 :
Figure2: Plots summarising the results from the compartmental model in Section 4.2.The top row show boxplots of 20 evaluations of the Monte Carlo approximation to the expected loss for the uniform design and the optimal designs (for the exact and probabilistic solution) found under each of the loss functions.The bottom plot shows the three designs found under each of the loss functions and the uniform design.In the background to the plot in the bottom row is 100 draws from the exact solution, u 2 (t), giving the amount of drug at time t, for 100 values drawn from the prior distribution of θ.

Figure 3 :
Figure3: Plots summarising the results from the FitzHugh-Nagumo equations in Section 4.3.The top row show boxplots of 20 evaluations of the Monte Carlo approximation to the expected loss for the uniform design and the optimal design found under each of the loss functions.The bottom plot shows the three designs found under each of the loss functions and the uniform design.In the background to the plot in the bottom row is 100 draws from the probabilistic solution, u 1 (t), giving the voltage at time t, for 100 values drawn from the prior distribution of θ.

Figure 4 :
Figure4: Plots summarising the results from the JAK-STAT example in Section 4.4.The top row show boxplots of 20 evaluations of the Monte Carlo approximation to the expected loss for the original design and the optimal design found under each of the loss functions.The bottom plot shows the three designs found under each of the loss functions and the original design.In the background to the plot in the bottom row is 100 draws from G 1 , G 2 and G 4 , at time t, for 100 values drawn from the prior distribution of θ, u 01 and ω.

Figure 5 :
Figure5: Plots summarising the results from the placenta example in Section 5. Shown are boxplots of 20 evaluations of the Monte Carlo approximation to the expected loss for the proposed design and the optimal design found under each of the loss functions and each value of M .

Table 1 :
Physical parameters, θ for the human placenta example described in Section 1.2.

Table 2 :
Initial concentrations (to nearest integer) of non-radioactive serine inside (u 02 = u 2 (0)) and outside (x 2 ) each of the M = 7 placentas for the optimal designs found under the four loss functions and the proposed design.