Bayesian Optimal Design for Ordinary Differential Equation Models With Application in Biological Science

Abstract Bayesian optimal design is considered for experiments where the response distribution depends on the solution to a system of nonlinear ordinary differential equations. The motivation is an experiment to estimate parameters in the equations governing the transport of amino acids through cell membranes in human placentas. Decision-theoretic Bayesian design of experiments for such nonlinear models is conceptually very attractive, allowing the formal incorporation of prior knowledge to overcome the parameter dependence of frequentist design and being less reliant on asymptotic approximations. However, the necessary approximation and maximization of the, typically analytically intractable, expected utility results in a computationally challenging problem. These issues are further exacerbated if the solution to the differential equations is not available in closed-form. This article proposes a new combination of a probabilistic solution to the equations embedded within a Monte Carlo approximation to the expected utility with cyclic descent of a smooth approximation to find the optimal design. A novel precomputation algorithm reduces the computational burden, making the search for an optimal design feasible for bigger problems. The methods are demonstrated by finding new designs for a number of common models derived from differential equations, and by providing optimal designs for the placenta experiment. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.


Introduction
The dynamics behind a complex physical process can often be described by a set of nonlinear ordinary differential equations, where the solution to these equations represents the evolution of system states with respect to time. It is common for the system of equations to depend on some unknown physical properties (parameters) of the process in question and, potentially, on some additional controllable variables. In this article, new methods are presented for designing experiments for the estimation of statistical models built on the solution to such a system of equations; that is, choosing the most informative combinations of time points and values of the controllable (design) variables at which observations of the physical process should be made. A decision-theoretic approach is adopted, and hence the quality of a design is measured via the expectation of a utility function chosen to encapsulate the aims of the experiment.
We assume equations with s system states u(t; x, θ ) = [u 1 (t; x, θ ), . . . , u s (t; x, θ )] T modeled as a function of time t and v design variables with values held in the treatment vector x ∈ X ⊂ R v . The p-vector θ ∈ ⊂ R p holds the physical parameters requiring estimation. For notational simplicity, the dependence of the system states on x and θ is usually suppressed, CONTACT Antony M. Overstall A.M.Overstall@southampton.ac.uk Southampton Statistical Sciences Research Institute, University of Southampton, Southampton SO17 1BJ, UK. Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/JASA.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA. These materials were reviewed for reproducibility.
with u(t) = u(t; x, θ ), unless multiple treatments or parameter vectors are being considered. We mostly find designs for initial value problems, with u(t) defined via equations u(t) = f (u(t), t, x; θ) for t ∈ T = [T 0 , T 1 ]; 0 ≤ T 0 < T 1 , u(T 0 ) = u 0 , (1) whereu(t) is the gradient vector of u(t) with respect to time t, u 0 = (u 01 , . . . , u 0s ) T ∈ R s denotes initial conditions and, for given θ, f : R s × T × X → R s is a continuous function satisfying the Lipschitz condition (see Iserles 2009, p. 3). This latter assumption ensures Equation (1) has a unique solution.
Here, the four physical parameters correspond to the maximum uptake (θ 1 ), the proportion of the reaction occurring through active transport (θ 2 ) and two reaction rates (θ 3 and θ 4 ). The values of these parameters are of scientific interest. See Panitchob et al. (2015) and Widdows et al. (2015) for further details of the model and experiment. To model experimental data from a physical process governed by (1), we build a statistical model linking the physical parameters to noisy observations of the system states, or functions thereof, via an assumed data-generating process dependent on the solution to the equations (see, e.g., Ramsay et al. 2007). We also assume that an experiment can be conducted where these observations are collected at various different times and, possibly, from multiple runs of the experiment with different combinations of values of the design variables. Let n denote the number of runs in the experiment, with the jth run being made for treatment x j = (x 1j , . . . , x vj ) T and initial conditions u 0j = (u j01 , . . . , u j0s ) T , with observations being made at time points t j = (t j1 , . . . , t jn j ) T (j = 1, . . . , n). At each time point, observations y jl ∈ Y jl ⊂ R c are taken on c ≤ s different responses. Let y T j = (y T j1 , . . . , y T jn j ) be the cn j -vector of observations from the jth run, and y = y T 1 , . . . , y T n T be the vector of observations from the whole experiment. We describe the experimental data y using statistical model with F a specified probability distribution, γ ∈ ⊂ R q a qvector of nuisance parameters, and d ∈ D a vector specifying the design, chosen from the space of possible designs D. The dependence of (3) on physical parameters θ and design d is through the solution to Equation (1). The most common form of this dependence, assumed in this article, is via the expected response, E y jl |θ, x j , t jl = g u(t jl ), θ , with g : R s × → Y jl an assumed function. However, the methodology developed here is also immediately applicable to other types of dependency.
Here, we find designs for experiments where one or more of the treatments x 1 , . . . , x n , observation times t j1 , . . . , t jn j , for j = 1, . . . , n, and initial conditions u 01 , . . . , u 0n are under the experimenters' control. In practice, some of these may be fixed by the protocol of the experiment. We also find designs where the initial conditions are unknown, and included in the vector of parameters.
In the human placenta experiment, the initial quantities of nonradioactive serine interior (u 02 ) and exterior (x 2 ) to the vesicle can be varied, with the initial quantities of radioactive serine (u 01 and x 1 ) fixed by the experimental protocol. The c = 1 observed response, y jl , is the amount of interior radioactive serine at time t jl (j = 1, . . . , n; l = 1, . . . , n j ). A statistical model is assumed where E y jl |θ, x j , t jl = u 1 (t jl ; x j , θ ). Hence, for this experiment g(u, θ ) = u 1 . The design consists of n combinations of initial quantities of exterior and interior nonradioactive serine, x 2j and u 02j , along with corresponding observation times t j1 , . . . , t jn j ; that is, d = [(x 21 , u 021 ) T , . . . , (x 2n , u 02n ) T , t T 1 , . . . , t T n ] T . Previous research on optimal design for models formed as the solution of ordinary differential equations has focussed on frequentist methods for models with additive normally distributed errors, with a design selected that maximizes a function of the Fisher information matrix for θ (e.g., Atkinson and Bogacka 2002;Rodríguez-Díaz and Sánchez-León 2014). The inverse of the Fisher information matrix provides an asymptotic approximation to the variance-covariance matrix for maximum likelihood estimators of θ . As is usual for nonlinear models, the information matrix depends on the value of θ , which is uncertain prior to the experiment. The most common methodology to overcome this dependence is the adoption of pseudo-Bayesian techniques, where a design is found that maximizes the expectation of the function of the information matrix with respect to a prior distribution for θ . Numerical methods are used to obtain the derivatives of the expected response with respect to θ that are necessary to obtain the information matrix. Most commonly, the "direct method" (Valko and Vajda 1984) is employed, with an additional set of differential equations being defined that then also require numerical solution. Many developments in this area have occurred in the chemical engineering literature, labeled "model-based design of experiments"; see Franceschini and Macchietto (2008) for a review.
In contrast to the above approaches, in this article, we present and apply the first methods for decision-theoretic Bayesian optimal design for models formed from ordinary differential equations. Although straightforward in principle, Bayesian optimal design faces a number of practical difficulties. Firstly, assessment of a given design requires evaluation of an expected utility depending on high-dimensional and typically intractable integrals. Secondly, maximization of the expected utility presents a high-dimensional and stochastic optimization problem. See Ryan et al. (2016) and Woods et al. (2017) for recent reviews.
To address the high-dimensional optimization problem, we extend and apply the approximate coordinate exchange (ACE) algorithm recently proposed by Overstall and Woods (2017). A brief description of the algorithm is provided in Section 4.1 and Appendix A.
The computational burden of optimal Bayesian design is exacerbated when the model evaluations (systems states) are only available as the numerical solution to the differential equations. In addition to increasing the computational expense of evaluating the expected utility, numerical solutions introduce an additional source of uncertainty through the numerical errors that result from finite discretization of the time interval T . We evaluate the expected utility by embedding within a Monte Carlo approximation scheme an adaption of the probabilistic solution to systems of differential equations proposed by Chkrebtii et al. (2016); see Section 2. In essence, this approach accounts for uncertainty due to discretization error by placing a joint Gaussian process (GP) prior on both the system states and time derivatives, and predicts future system states by conditioning on the derivatives. In Section 3, after introducing the foundations of Bayesian design, we propose innovative precomputation of variance and covariance quantities that substantially reduces the computational burden of incorporating the probabilistic solution into a Bayesian design strategy. Our approach makes it possible to search for multivariable designs which would otherwise be computationally infeasible.
We demonstrate the effectiveness for optimal design of the combination of Monte Carlo approximation, probabilistic numerics and cyclic descent for a variety of exemplar models in Section 4. The differing complexities of the problems addressed showcase the flexibility of the methodology. In Section 5, we apply the methodology to a realistic statistical model for the human placenta example, based on the solution to (2), and compare to designs proposed by the experimenters. We find designs for the goals of parameter estimation and model selection, where the aim is to determine if a simpler model with θ 3 = θ 4 (i.e., the two reaction rates equal) is an adequate description for the data.

Probabilistic Solutions to Ordinary Differential Equations
When working with numerical models implemented via computer code, it has become standard to build statistical approximations, or emulators, by performing a computer experiment to obtain model outputs at carefully selected input combinations. Most commonly, a GP prior is assumed to describe the output from the model, with the emulator formed from the updated posterior GP (conditioned on the model output from the computer experiment); see Sacks et al. (1989) and Santner, Williams, and Notz (2003). In contrast, central to the Chkrebtii et al. (2016) methodology is the adoption of a GP prior for the hth derivative functionu h (·), h = 1, . . . , s, defined via mean and covariance functionsṁ 0h (·) andĊ 0 (·, ·), where we assume a common covariance function for each of the s derivatives. Such a prior implies that for any finite collection of times t = (t 1 , . . . , t w ) T , the joint distribution ofu h (t) = [u h (t 1 ), . . . ,u h (t w )] T will be multivariate normal N(ṁ 0h (t),Ċ 0 (t, t)), with n-vectorṁ 0h (t) having lth entrẏ m 0h (t l ) and, for vector t = (t 1 , . . . , t w ) T , w×w matrixĊ 0 (t, t ) having lkth entryĊ 0 (t l , t k ). A joint GP prior for bothu h (·) and the solution function u h (·) then follows directly, implying the joint distribution with w-vector m 0h (t) having lth entry m 0h (t l ) = t l 0ṁ 0h (z) dz+ u 0h , w × w matrix C 0 (t, t ) having lkth entry C 0 (t l , t k ) = t l 0 t k 0Ċ 0 (z, z ) dzdz , and w × w cross-covariance matrix C 0 (t, t ) having lkth entryC 0 (t l , t k ) = t k 0Ċ 0 (t l , z) dz; see also Solak et al. (2003) and Holsclaw et al. (2013). Hence, solution vector u h (t) = [u h (t 1 ), . . . , u h (t n )] T follows the multivariate normal distribution N(m 0h (t), C 0 (t, t )). Note that definition of the covariance function of u h (t) via integration ensures C 0 (0, 0) = 0 and hence enforces the boundary condition u h (0) = u 0h .
Algorithm 1: Sequential updating and sampling for time points t = (t 1 , . . . , t w ) T of the joint Gaussian process for the derivative and solution for the s system states for initial values u 0 , treatment vector x, physical parameters θ and evaluation grid τ = (τ 1 , . . . , τ N ) T , with τ 1 = T 0 . (Adapted from Chkrebtii et al. 2016).
and compute For a given x and θ, this prior distribution can be updated using derivative evaluations on a grid τ = (τ 1 , . . . , τ N ) T of time points via Algorithm 1 by sequentially conditioning on f (u, τ r+1 , x; θ) calculated for solution state u h sampled from the posterior distribution at point τ r . The final marginal GP for u h (t) has mean and covariance functions given by for h = 1, . . . , s, where e h is the hth unit vector, and the N × s matrix of derivative evaluations F N and the updated N × N derivative covariance matrix B N are defined as in Algorithm 1. Chkrebtii et al. (2016) allowed covariance functionĊ 0 (t, t ) to depend on hyperparameters controlling the scale and length of the covariances. Given experimental data, a joint posterior distribution for the model parameters and hyperparameters can be sampled by embedding the probabilistic solution to the differential equations within a Markov chain Monte Carlo scheme. Chkrebtii et al. (2016) also suggested possible fixed values for the covariance hyperparameters. In Section 3.2, we demonstrate the computational savings that can be achieved for optimal design via precomputing of various posterior quantities when these parameters are fixed. Figure 1 presents 1000 draws from probabilistic solutions for the placenta example following equations (2). Updated GPs for u 1 (t) and u 2 (t) were generated using Algorithm 1 assuming a squared exponential covariance function forĊ(t, t ) (see Rasmussen and Williams 2006, p. 83). An evaluation grid τ with N = 501 evenly spaced time points was used, and the solution sampled for time t ∈ [T 0 , T 1 ] = [0, 600] sec with physical parameters θ = (200, 0.05, 100, 100) T , initial values u 0 = (0, 1000) T and treatment x = (7.5, 1000) T . Note how the uncertainty in the solution increases as t increases away from t = T 0 = 0 where we know, in this example, the true value of u(t).

Decision-Theoretic Bayesian Optimal Design
Design of experiments fits naturally within a Bayesian framework, with the decision on what design d to employ made before the data is collected. Hence, it is natural to use available prior information to inform this choice. This information includes the form of statistical model (3) including any underpinning physical theory, for example, as encapsulated in equations such as (1). It also includes any prior information on the values of the parameters θ , γ , captured via a prior density π(θ , γ ), which we assume is independent of the design. A decision-theoretic Bayesian optimal design, d , maximizes the expectation of a specified utility function φ(θ, y, d) with respect to the unknowns prior to experimentation, where the joint distribution of the unknown physical parameters and responses, conditional on the design used for data collection, can be decomposed as and hence, when regarded as a function of θ alone, is proportional to the posterior density. See the seminal review paper by Chaloner and Verdinelli (1995). The function φ(θ, y, d) quantifies the utility, relative to the aims of the experiment, for choosing design d when we obtain data y under physical parameters θ . Its choice should reflect the goals of the experiment. Here, we apply the following exemplar utility functions: 2 , with · p denoting the l p -norm and E(θ |y, d) the posterior mean, where expectation is taken with respect to the marginal density π(θ |y, d) = π(θ, γ |y, d) dγ . It can be shown that the expected utility simplifies to  Plots showing 1000 draws from the probabilistic solution of u 1 (t) and u 2 (t) against t for system of Equations (2) that describe the transport of serine in a human placenta.

Shannon information gain (SIG) for
where Maximizing the expectation of (4) is equivalent to maximizing the expected Kullback-Liebler divergence between the prior and posterior distributions (Chaloner and Verdinelli 1995).
For the human placenta example, we also employ two bespoke utility functions tailored to the problems of point estimation and model selection.
with 1ˇ the indicator function for the product seť That is, the utility is equal to 1 if, for all i = 1, . . . , p, the ith element of the posterior mean vector E(θ |y, d) lies within δ i of the corresponding element of θ .
For the final utility function considered we redefine the utility as a function of the chosen model m ∈ M.
where 1 m is the indicator function for the singleton set with element m and m ∈ arg max m∈M π(m|y) is the model with maximum posterior probability. For this utility, the expected utility is given by and θ (m) ∈ (m) and γ (m) ∈ (m) physical and nuisance parameters, respectively, for model m.
A barrier to the application of Bayesian design for most nonlinear models, including those considered in this article, is the analytic intractability of both the utility function (which typically depends on posterior quantities) and expected utility. Numerical methods are therefore required, with a double-loop Monte Carlo approximation being commonly employed (Ryan 2003). Such an approach uses an "inner" Monte Carlo sample of sizeB to approximate any necessary posterior quantities, and then an "outer" Monte Carlo sample of size B to approximate the expected utility with respect to the joint distribution of y and θ ; see also Overstall and Woods (2017).
We use the approximation with {θ b , y b } B b=1 a first (outer) sample from the joint distribution of the physical parameters and response, andφ(θ , y, d) a further Monte Carlo approximation to the utility function.
a sample from the prior distribution under model m with density π(θ (m) , γ (m) |m).
The above Monte Carlo approximationsφ to the utility functions will introduce some bias into the approximation of the expected utility, as the utilities are nonlinear functions of posterior quantities. In general, this bias will be inversely proportional to the value ofB, and hence can be made negligible for large inner samples.

Extensions to Ordinary Differential Equation Models
To apply the methodology outlined in the previous section to models built from systems of ordinary differential equations requires incorporation of further steps to account for discretization errors in the numerical solution to the equations, and to mitigate the additional computational cost of multiple evaluations of the numerical solution. The approximations to the expected utilities require repeated sampling of y from distribution (3), and evaluation of the corresponding density function π (y|θ , γ , d). When the distribution of y depends on the solution vector, the approximations require at least B +B evaluations of a numerical solution to u(t jl ; x j , θ ) for each j = 1, . . . , n and l = 1, . . . , n j . In addition to the computational cost of these repeated evaluations, the necessary discretization of the time domain by a numerical solver introduces an additional source of uncertainty that should be accounted for in both the design of the experiment and the subsequent inference.
The probabilistic solution of Chkrebtii et al. (2016), outlined in Section 2, fits naturally within a Monte Carlo approximation of the expected utility; for each generated value of the physical parameters θ, a solution path for u(t) is generated from an updated GP. The uncertainty introduced by the discretization of time is quantified, and updated, via the joint GP prior for the time derivatives and solution. Algorithm 2 outlines the steps in generating an approximation to a general utility function φ using double loop Monte Carlo. As given, Algorithms 1-2 depend on the initial values u 0j for the jth treatment, that is, the initial values are assumed known. In some situations, learning unknown initial values may be part of the inference problem, that is, prior distributions are assumed and updated to a posterior distribution in light of the experimental responses. This case can be incorporated into these algorithms by replacing all occurrences of u 0j by a value u 0jb generated from the prior distribution in Algorithm 2, in an analogy to how the physical parameters θ are handled.
Naive implementation of Algorithm 2 for approximating the expected utility presents a considerable computational challenge, with the matrix computations in steps 2(b) and 3 of Algorithm 1 being undertakenñ(B +B) times, withñ = n j=1 n j . In particular, calculation of matrix B N requires inversion of an Algorithm 2: Evaluation of the approximate expected util-ityˆ (d) when the distribution of the response depends on the solution to a ordinary differential equation.
Sample u s (t jl ; x j ,θb) using Algorithm 1 This leads to an algorithm with computational complexity O (ñN 3 (B +B)).
To reduce the computational cost of the algorithm, we can compromise on the choice of covariance functionĊ 0 (t, t ). Rather than tune the covariance through the selection of different parameter values for each choice of x and θ , we can fix these parameters (e.g., following recommendations in Chkrebtii et al. (2016); see Section 4 for our choices). This allows precomputation of various covariance matrices and vectors, see Algorithm 3. Such precomputation alleviates the need to invert B N when sampling u(t), reducing the computational complexity of the approximation to O(N 3 +ñN 2 (B +B)).
In fact, this precomputation can be performed just once, prior to any optimization routine being called. Hence for large experiments and Monte Carlo sample sizes, the computational complexity of the precomputation is essentially fixed, and the complexity of the approximation within the optimization becomes O (ñN 2 (B +B)). This computational savings makes the optimization feasible for experiment sizes, evaluation grids and Monte Carlo sample sizes for which designs could not otherwise be found.

Preliminaries
In this section we demonstrate the Bayesian design methodology for three common examples of models formed from the solution of ordinary differential equations: 1. a compartmental model (Section 4.2); 2. a model formed from the FitzHugh-Nagumo equations (Section 4.3); 3. a model of the JAK-STAT mechanism (Section 4.4).
Algorithm 3: Precomputation of variances C r ,Ċ r+1 , B r and covariances a r for evaluation grid τ = (τ 1 , . . . , τ N ) T For each, we use the methodology in Section 3.2 to approximate expected utilities for parameter estimation. Bayesian optimal (or near optimal) designs are found by embedding these Monte Carlo approximations within the ACE algorithm (Overstall and Woods 2017). The ACE algorithm is a cyclic descent, or coordinate exchange, algorithm (see Meyer and Nachtsheim 1995;Lange 2013, p. 171) that performs a sequence of conditional maximizations for each element (coordinate) of d in turn, keeping all other elements fixed. Each of these one-dimensional maximizations is performed by constructing a GP smoother, or emulator, for the Monte Carlo approximation as a function of the coordinate. Use of an emulator alleviates both the computational burden and lack of smoothness associated with the Monte Carlo approximations. This algorithm extends the optimal design via curve fitting methods originally presented by Müller and Parmigiani (1996) to high-dimensional design problems. The ACE algorithm is outlined in Appendix A and implemented in the acebayes R package Adamou 2018a, 2018b), available on CRAN.
To employ the probabilistic solution to the ordinary differential equations, a choice of covariance function is required for the GP prior on the derivative functions. The choice of covariance function should be determined by the assumed smoothness of the solutions u h (t). Chkrebtii et al. (2016) suggested two covariance functions, the squared exponential covariancė which is infinitely differentiable and hence suitable for smooth solutions, and the piecewise linear uniform covariancė where α, λ > 0. This latter function is nondifferentiable and hence suited to nonsmooth solutions. We employ these two functions, with fixed values of α and λ to facilitate the precomputation outlined in Section 3.2. Throughout, we assume the probabilistic solution is calculated on a grid τ = (τ 1 , . . . , τ N ) T of equally spaced points and, unless otherwise stated, set α = N and λ = 4(τ N − τ 1 )/N. The supplementary materials contain an R package called aceodes and a vignette. The vignette describes how aceodes can be used to reproduce the designs found in the remainder of this section and in Section 5.

Compartmental Model
In pharmacokinetics studies, compartmental models are used to describe the distribution of a drug inside a living body. Such models have been routinely used to demonstrate optimal experimental design methodology (see, e.g., Atkinson et al. 1993;Ryan et al. 2014;Overstall and Woods 2017). To compare designs found using the probabilistic solution to designs found using an exact solution, we use a simple example where an analytical solution to the differential equations is available. An open onecompartment model is considered with first-order absorption, described by the following system of s = 2 ordinary differential equations for t ∈ [0, 24] hṙ where u 1 (t) and u 2 (t) are, respectively, the amounts of drug outside and inside the body, D is the known initial dose, and θ = (θ 1 , θ 2 , θ 2 ) T are unknown parameters.
These equations define a homogeneous linear system with constant coefficients, resulting in the analytical solution Following Ryan et al. (2014), we assume D = 400 and log θ i ∼ N(μ i , 0.05), independently, for l = 1, 2, 3, with (μ 1 , μ 2 , μ 3 ) T = (log 0.1, log 1, log 20) T . The amount of drug inside the body, y l , is observed at observation time t l , and is modeled through assuming y l ∼ N u 2 (t l ), σ 2 + τ 2 u 2 (t l ) 2 , independently, where σ 2 = 0.1 and τ 2 = 0.01. The choice of design here only involves selecting n = 15 observation times: t 1 , . . . , t n . We impose the practically realistic constraint that the observation times have to be at least 15 minutes apart. Such a constraint is straightforward to incorporate into the ACE algorithm (see Overstall and Woods 2017). When applying the probabilistic solution, we assume squared exponential covariance (8) as the functions u(t) are known to be smooth and a discrete evaluation grid, τ , with N = 501.
For each of the NSEL, NAEL, and SIG utility functions from Section 3.1, we compare designs found under the exact and probabilistic solutions using ACE to a uniform design with n = 15 equally spaced time points in [0, 24] hr. Figure 2 presents boxplots of 20 evaluations of the Monte Carlo approximation to the expected utility for the uniform design and the optimal design found for each utility. There is negligible difference between the designs found under the exact and probabilistic solutions, and these designs are clearly superior to the uniform design. Figure 2 also gives the observation time points from each design being compared. The optimal designs appear to favor observation times near the peak of u 2 (t), at t ≈ 2.5 hr, and then a series of observation times toward the end of the time interval. The optimal design under SIG has two distinct sets of points just before and after the maximum of u 2 (t), whereas the designs under NSEL and NAEL have just one set of points, generally occurring just after the peak response.

FitzHugh-Nagumo Equations
The FitzHugh-Nagumo equations (FitzHugh 1961;Nagumo, Arimoto, and Joshizawa 1962) describe the behavior of spike potential in the giant axon of squid neuronṡ where u 1 (t) is the voltage across the axon membrane, u 2 (t) is the recovery variable giving a summary of outward current, θ = (θ 1 , θ 1 , θ 3 ) T , and t ∈ [0, 20] msec. These equations cannot be solved analytically.
As noted by Ramsay et al. (2007), the solution to the FitzHugh-Nagumo equations can alternate between smooth evolution and sharp changes of direction. Hence, we employ uniform covariance (9) for the probabilistic solution. The evaluation grid has size N = 200.
The design consists of the n = 21 observation times, t 1 , . . . , t n . Similarly to Section 4.2, we stipulate that the observation times must be at least 0.25 msec apart, and find designs under the NSEL, NAEL, and SIG utility functions. We compare these optimal designs to a uniform design with n equally spaced points in [0, 20] msec. Figure 3 presents boxplots of 20 evaluations of the Monte Carlo approximation to the expected utility for the uniform design and the optimal designs found via ACE under each utility function. In each case, there is a clear improvement to be made over using the uniform design. Also shown in Figure 3 are the four designs under comparison, along with realizations drawn from the solution u 1 (t). Both the NSEL and NAEL optimal designs have a substantial number of observations near the beginning of the experiment. Both Figure 3. Results from the FitzHugh-Nagumo equations in Section 4.3. Top row: boxplots of 20 evaluations of the Monte Carlo approximation to the expected utility for the uniform design and the optimal designs found under three different utility functions. Bottom plot: design points from each of the optimal designs and the uniform design, along with 100 draws from the probabilistic solution, u 1 (t), giving the voltage at time t, for values drawn from the prior distribution of θ.
these designs have around one-third of their observation times before 2.5 msec; the SIG and uniform designs only make three observations before this time. A feature of all of the optimal designs is that they make no observations between about 2.5 and 6 msec, where the voltage is expected to rapidly decrease. The remaining observation times are close to being evenly spaced. The initial phase of high frequency observations provides information about the steep increase in voltage for small t. The remaining observation times aid efficient parameter estimation, occurring within an interval within which different parameter values can produce very different model solutions. Chkrebtii et al. (2016), and authors referenced therein, considered Bayesian inference for the JAK-STAT mechanism. A system of s = 4 equations describes changes in the biochemical reaction states of STAT-5 transcription factors that occur in response to binding of the Erythropoietin hormone to cell surface receptors (Pellegrini and Dusanter-Fourt 1997)

JAK-STAT Mechanism
with u 01 ≥ 0 unknown and κ(t) an unknown forcing function. The transcription states return to the initial state after gene activation in the cell nucleus, modeled via the unknown time delay ω ≥ 0. This system is an example of a delay initial function problem. Swameye et al. (2003) conducted an experiment that made measurements on the nonlinear transformation of the states given by The experiment made n = 16 (noisy) observations on g 1 and g 2 at times t 1 , . . . , t 16 , one observation on each of g 3 and g 4 at t = 0 and t = t , respectively. The design (choices of time points) used in the experiment reported by Swameye et al. (2003) are given in Figure 4. The following statistical model is assumed independently, for l = 1, . . . , n, where A l = diag σ 2 1l , σ 2 2l . We design a follow-up experiment using information from this previous study, and choose values of t 1 , . . . , t n and t to maximize different expected utilities assuming, for simplicity, a single observation of y 3 will also be made at t = 0 (as in the original experiment). We use the posterior distributions from Chkrebtii et al. (2016) as priors for θ , ω and u 01 . These authors assumed the variance parameters were fixed. Instead, we assume σ 2 1l = σ 2 1 , σ 2 2l = σ 2 2 , for all l = 1, . . . , n, and σ 1 , σ 2 ∼ Uniform[0, 0.1], σ 3 ∼ Uniform[0, 20], and σ 4 ∼ Uniform[0, 0.1]. These prior distributions are consistent with the experimentally determined values used for previous analyses (see Raue et al. 2009). The forcing function κ(t) is assumed unknown but has been measured at 16 time points. We follow Chkrebtii et al. (2016) and assume these measurements are made without error and interpolate with a GP to allow a probabilistic prediction of κ(t) for any t ∈ [0, 60].
The nature of the delay initial function problem introduces an added complexity to our implementation of the probabilistic solution. At the end of step 2 of Algorithm 1, we compute where ω b is a value generated from the prior distribution of ω. If τ r+1 − ω b ≤ 0, then u 4 (τ r+1 − ω b ) = 0 as specified by the initial conditions of the system of equations. For τ r+1 − ω b > 0, the conditional distribution to the expected utility for the original design and the optimal designs found under three different utility functions. Bottom row: design points from each of the optimal designs and the original design at which noisy observations of g 1 (left), g 2 (center), g 4 (right) are made, along with 100 draws from g 1 , g 2 , and g 4 , at time t, for values drawn from the prior distribution of θ, u 01 and ω.
We employ uniform covariance (9) as the time delay can cause discontinuities in the derivative, as noted by Chkrebtii et al. (2016). The evaluation grid, τ , has size N = 500, and the auxiliary parameters are set to λ = 0.085 and α = 8000, consistent with the posterior distribution from the original analysis.
We use the methodology from Section 3.2 and the ACE algorithm to find designs that maximize each of the NSEL, NAEL, and SIG utilities. We compare these designs to the original design used by Swameye et al. (2003). As in the previous examples, we introduce the constraint that the observation times need to be at least 1 sec apart, a requirement also satisfied by the original experiment. Figure 4 presents boxplots of 20 evaluations of the Monte Carlo approximation to the expected utility for the original design and the optimal designs found under each utility function. Once again, in each case, the optimal designs are considerably more efficient. Also shown in Figure 4 are the four designs under comparison. The optimal designs favor having a dense set of points early in the observation window, and then a smaller set of times near the end of the experiment. This is especially true for the designs under NSEL and NAEL where 75% of the observation times occur before t = 15 sec, compared to about 60% for SIG design and 50% for the original design. Early observation times provide information about the peak in g 1 and the sharp decrease in g 2 at about 10 sec. For the single observation time, t * , on g 4 , the optimal designs clearly favor making a very early observation. Note that t * for each of the optimal designs is between 1 and 2 sec.

Application: Transport of Serine Across Human Placenta
We now use the methodology in Section 3 to redesign the experiment for the human placenta study introduced in Section 1. The experimental protocol specifies fixed initial amounts of radioactive serine interior (u 01 ) and exterior (x 1 ) to the placenta (0 and 7.5μl, respectively). The original design proposed by the experimenters used n = 7 placentas (runs) with differing amounts of nonradioactive serine interior (u 01 ) and exterior (x 2 ) to the placenta, see Table 1. Noisy observations on the amount of interior radioactive serine (u 1 ) were made at eight times, common to each of the seven placentas. The experimenters expected greater variability in the concentration of interior radioactive serine near the start of the experiment, before convergence to an equilibrium. Therefore, they choose a design containing a large number of early time points. We broadly follow this protocol, but find optimal designs using n = 2, . . . , 7 placentas with each having n t = 8 observations taken at common times, t 1 , . . . , t 8 , chosen from across the interval [0, 600]. A hierarchical statistical model is assumed for the observed responses y jl = u 1 (t l ; x j , θ j ) + ε jl , for j = 1, . . . , n; l = 1, . . . , n t , where x j = (x 1 , x 2j ) T , ε jl are independent and identically normally distributed with constant variance σ 2 , and θ j holds the p = 4 subject-specific parameters for the jth placenta with elements assumed to follow independent uniform distributions The goal of the experiment is estimation of the population physical parameters θ = (θ 1 , . . . , θ p ) T .
We expect the solution to system of equations (2) to be smooth, and so use squared exponential covariance (8) for the probabilistic solution. The evaluation grid, τ , has size N = 601 and we set auxiliary correlation parameter α = 10N.
Specifying a design corresponds to specifying the n experimental conditions x 21 , . . . , x 2n , initial values u 021 , . . . , u 02n , and the common n t = 8 observation times t 1 , . . . , t n t . Hence for n = 2, . . . , 7, the design space has between 12 and 22 dimensions. As for the examples in Section 4, we impose a constraint on the observation times and specify that they must be at least 5 sec apart.
We find designs for the NSEL, NAEL, 0-1 estimation, and 0-1 model selection utility functions defined in Section 3.1. For the 0-1 estimation utility, we set δ = (5, 5, 0.01, 5) T ; for utility φ(θ, y, d) to equal 1, the posterior mean for θ must lie in the box set 4 i=1 [θ i − δ i , θ i + δ i ], which contains 0.5% of the volume of the prior support. For the model selection utility, we suppose interest is in determining if the reaction rates are equal, that is, does θ 3 = θ 4 ? To answer this question, we define two models: m 1 (where θ 3 = θ 4 ) and m 2 (where θ 3 = θ 4 ). Figure 5 presents boxplots of 20 evaluations of the Monte Carlo approximation to the expected utility for the optimal design found under each utility function for n = 2, . . . , 7. We also present boxplots of the performance of the original design with n = 7. Unsurprisingly, the expected utility increases with n, and the optimal designs are clearly superior to the original design. For each utility function, the optimal design with n = 2 outperforms the original design with n = 7 placentas, with substantial differences in expected utility. Table 1 gives the treatments for each design found for n = 7. Figure 6 shows the observation times for the optimal designs under NSEL, NAEL, and 0-1 estimation utilities, along with realizations from the solution to u 1 (t), for each run of each design. The designs under NSEL and NAEL utilities have similar treatments and observation times. The initial concentrations in Table 1 lead to three distinct profiles of u 1 (t) (labeled placentas 1 and 2; 3 and 4; 5, 6, and 7; note though that the placentas are exchangeble). The profile for Figure 5. Results from the placenta example in Section 5. Boxplots of 20 evaluations of the Monte Carlo approximation to the expected utility for the original design and the optimal designs found under four different utility functions for n = 2, . . . , 7. Table 1. Treatments from the optimal and original designs with n = 7 runs for the placenta example in Section 5: initial concentrations (to nearest integer) of interior (u 02 = u 2 (0)) and exterior (x 2 ) nonradioactive serine for each run (placenta).

Figure 6.
Results from the designs found under SIG, NSEL, and NAEL utilities with n = 7 placentas in Section 5. Displayed are 100 draws from solution u 1 (t) plotted against t for values drawn from the the prior distribution of θ, for each of the n = 7 placentas and treatments given in Table 1. decrease in u 1 (t) with respect to t. Unlike the other optimal designs, the observation times are predominantly toward the end of the observation window. The u 1 (t) profiles are similar under both models, with the most substantial differences occurring in the inter-profile variability toward the middle of the time interval. This region is where the majority of observation times are located.
The original design proposed by the experimenters had an unequal spacing of observation times across the entire interval [0, 600]. There are more observations taken near the start of the interval, and the time points are not dissimilar to those in the optimal designs under NSEL, NAEL, and 0-1 estimation. However, the original design has treatments that are very different from any of the optimal designs, with an almost factorial structure and some treatments with high values of x 2 (exterior initial concentration of nonradioactive serine). None of the optimal designs include treatments with high x 2 , demonstrating how it is often difficult to predict by intuition the treatments in a Bayesian optimal design for a complicated nonlinear model. In addition, the designs for point estimation (under NSEL, NAEL, and 0-1 estimation utilities) are quite different to the design for model selection.

Concluding Remarks
This article introduces and demonstrates the first practical methodology for Bayesian optimal design of experiments for statistically nonlinear models formed from the solution to intractable ordinary differential equations. The work is motivated by a challenging design problem from the biological sciences, which we address through a combination of probabilistic solutions to the equations, simulation-based approximation to expected utilities and optimization via smoothing and cyclic descent. Our novel adjustments to the Chkrebtii et al. (2016) probabilistic algorithm are key to providing a computationally efficient solution to the optimal design problem. Through demonstration on a number of examples, including the Figure 7. Results from the designs under the 0-1 model selection loss with n = 7 placentas in Section 5. Displayed are 100 draws from solution u 1 (t) under model m 1 (θ 3 = θ 4 ) and model m 2 (θ 3 = θ 4 ) plotted against t for values drawn from the the prior distribution of θ, for each of the n = 7 placentas and treatments given in Table 1. motivating experiment on serine transport across placental membranes, we show the efficiency gains that can be made by use of optimal designs over obvious, and proposed, alternatives. We also show how it is often not possible to "second guess" via intuition the solutions to optimal problems for nonlinear models.
We have adopted the nested integration and optimization methods from Overstall and Woods (2017) to find optimal designs for the differential equation models in this article (namely the ACE algorithm). The Markov chain simulation schemes of Müller (1999) and Müller, Sanso, and De Iorio (2004), among other authors, would be an interesting alternative approach. Extension and application of such methods to the problems in the current article is an area for future research.
One key issue not addressed is model discrepancy (see, e.g., Kennedy and O'Hagan 2001;Plumlee 2017); the systematic mismatch between the true physical process and the solution to the ordinary differential equations. Not taking account of this error can lead to significant bias in posterior estimates of the physical parameters (Brynjarsdottir and O'Hagan 2014). Future work will focus on Bayesian optimal design for physical models subject to model discrepancy.
Some limited insight into the impact of model mismatch can be gained from a simple extension to the compartmental model in Section 4.2. For the purpose of finding designs, we assume the model That is, we simplify (10) by setting θ 1 and θ 2 equal to their prior means in the fraction that multiplies the exponential term. We still assume the exponential depends on unknown θ 1 , θ 2 , and assume the same prior distributions for all parameters as in Section 4.2.
To assess the impact of model mismatch, we find optimal designs under the SIG, NSEL, and NAEL utilities assuming the Figure 8. Results from the misspecified model example in Section 6. Boxplots of 20 evaluations of the Monte Carlo approximation to the expected utility for the optimal designs found under the correct model (10) and the misspecified model (11) under the SIG, NSEL, and NAEL utility functions. In each case, the correct model is assumed for evaluating the expected utility. misspecified model (11). We then assess these designs under the correct, more complex, model (10) and compare them to designs found under the correct model by evaluating the approximate expected utility under the correct model, see Figure 8. Differences in approximate expected utility between the designs found under the correct and misspecified models are comfortably within Monte Carlo error for SIG. However, assuming a misspecified model under the NSEL utility results in a loss of expected utility of around 6%; the differences are somewhat less for NAEL but still larger than Monte Carlo error. Clearly, the reduction in expected utility from assuming a misspecified model will depend on the models under consideration, the difference between the models and the choice of prior distributions, in addition to the choice of utility function. This is an important area for future research.

Supplementary Materials
The supplementary materials provide code for reproducing the results.