Efficient Global Optimization for Black-Box Simulation Via Sequential Intrinsic Kriging

Efficient Global Optimization (EGO) is a popular method that searches sequentially for the global optimum of a simulated system. EGO treats the simulation model as a black-box, and balances local and global searches. In deterministic simulation, EGO uses ordinary Kriging (OK), which is a special case of universal Kriging (UK). In our EGO variant we use intrinsic Kriging (IK), which eliminates the need to estimate the parameters that quantify the trend in UK. In random simulation, EGO uses stochastic Kriging (SK), but we use stochastic IK (SIK). Moreover, in random simulation, EGO needs to select the number of replications per simulated input combination, accounting for the heteroscedastic variances of the simulation outputs. A popular selection method uses optimal computer budget allocation (OCBA), which allocates the available total number of replications over simulated combinations. We derive a new allocation algorithm. We perform several numerical experiments with deterministic simulations and random simulations. These experiments suggest that (1) in deterministic simulations, EGO with IK outperforms classic EGO; (2) in random simulations, EGO with SIK and our allocation rule does not differ significantly from EGO with SK combined with the OCBA allocation rule.


Introduction
Optimization methods for black-box simulations-either deterministic or random-have many applications. Black-box simulation means that the input/output (I/O) function is an implicit mathematical function defined by the simulation model (computer code). In many engineering applications of computational fluid dynamics, the computation of the output (response) of a single input combination is usually time-consuming or computationally expensive. In most oper-ations research (OR) applications, however, a single simulation run is computationally inexpensive, but there are extremely many input combinations; e.g., a single-server queueing model may have one input (namely, the traffic rate) that is continuous, so we can distinguish infinitely many input values but in finite time we can simulate only a fraction of these values. In all these situations, it is common to use metamodels, which are also called emulators or surrogates. A popular method for the optimization of deterministic simulation is efficient global optimization (EGO), which uses Kriging as the metamodel; see the seminal article Jones et al. (1998). EGO has been adapted for random simulation with either homoscedastic noise variances (see Huang et al. (2006)) or heteroscedastic noise variances (see Picheny et al. (2013) and Quan et al. (2013)). EGO is the topic of much recent research; see Sun et al. (2014), Salemi et al. (2014), and the many references in Kleijnen (2015).
Our first contribution in this paper is the use of intrinsic Kriging (IK) as a metamodel for the optimization of deterministic simulation and random simulation. The basic idea of IK is to remove the trend from I/O data by linear filtration of data. Consequently, IK does not require the second-order stationary condition and it may provide a more accurate fit than Kriging; see Mehdad and Kleijnen (2015).
More specifically, for deterministic simulation we develop EGO with IK as the metamodel. For random simulation we develop stochastic IK (SIK) combined with the two-stage sequential algorithm developed by Quan et al. (2013). The latter algorithm accounts for heteroscedastic noise variances and balances two source of noise; namely, spatial uncertainty due to the metamodel and random variability caused by the simulation. The latter noise is independent from one replication to another; i.e, we suppose that the streams of pseudorandom numbers do not overlap. Moreover we assume that different input combinations do not use common (pseudo)random numbers (CRN), so these input combinations give independent outputs.
Our second contribution concerns Quan et al. (2013)'s two-stage algorithm. We replace the optimal computing budget allocation (OCBA) in the allocation stage of the algorithm by an allocation rule that we build on IK. Our rule allocates the additional replications over the sampled points such that it minimizes the integrated mean squared prediction error (IMSPE).
In our numerical experiments we use test functions of different dimensionality, to study the differences between (1) EGO variants in deterministic simulation; (2) two-stage algorithm variants in random simulation. Our major conclusion will be that in most experiments (1) for deterministic simulations, EGO with IK outperform EGO formulated in Jones et al. (1998); (2) for deterministic simulations, the difference between Quan et al. (2013)'s algorithm and our variant is not significant.
We organize the rest of this paper as follows. Section 2 summarizes classic Kriging. Section 3 explains IK. Section 4 summarizes classic EGO, the twostage algorithm, and our variant. Section 5 presents numerical experiments. Section 6 summarizes our conclusions.

Kriging
In this section we summarize universal Kriging (UK), following Cressie (1991, pp. 151-182). UK assumes where Y (x) is a random process at the point (or input combination) x, f (x) is a vector of p + 1 known regression functions or trend, β is a vector of p + 1 parameters, and M (x) is a stationary GP with zero mean and covariance function Σ M . This Σ M must be specified such that it makes M (x) in (1) a stationary GP; i.e., Σ M is a function of the distance between the points x i and x i with i, i = 0, 1, . . . , m where the subscript 0 denotes a new point and m denotes the number of old points. Separable anisotropic covariance functions use the distances along the d axes h i;i ;g = |x i;g − x i ;g | (g = 1, . . . , d). The most popular choice for such a function is the so-called Gaussian covariance function: ) denote the vector with the m values of the metamodel in (1) at the m old points. Kriging predicts Y at a (new or old) point x 0 linearly from the old I/O data (X, Y) where X = (x 1 , . . . , x m ) is the d × m matrix with m points x i = (x g;i ) (i = 1, . . . , m; g = 1, . . . , d): . ., f p (x 0 )) , and the condition for λ guarantees thatŶ (x 0 ) is an unbiased predictor. The optimal linear unbiased predictor minimizes the mean squared prediction error (MSPE), defined as Cressie (1991, pp. 151-157) shows how to use Lagrangian multipliers to solve this constrained minimization problem, which gives the optimal weights and the predictor: ) denoting the m-dimensional vector with covariances between the outputs of the one new point and the m old points, and Σ M denoting the m × m matrix with the covariances between the old outputs, so the element This Kriging is an exact interpolator; i.e., for the old points, (4) gives a predictor that equals the observed output. Important for EGO is the property that for the old points the Kriging variance (5) reduces to zero.
The Kriging metamodel defined in (1) can be extended to incorporate the so-called internal noise in random simulation; see Opsomer et al. (1999); Ginsbourger (2009); Ankenman et al. (2010); Yin et al. (2011). The resulting stochastic Kriging (SK) metamodel at replication r of the random simulation output at x is where ε 1 (x), ε 2 (x), . . . denotes the internal noise at point x. We assume that ε r (x) has a Gaussian distribution with mean zero and variance V (x) and that it is independent of M (x). We assumed that the external noise M (x) and the internal noise ε(x) in (6) are independent, so the SK predictor and its MSPE can be derived analogously to the derivation for UK in (4) and (5) except that Σ = Σ M + Σ ε replaces Σ M where Σ ε is a diagonal matrix (no CRN) with the variances of the internal noise V (x i )/n i on the main diagonal, and Σ M still denoting the covariance matrix of Kriging without internal noise. We also replace Y in (4) and (5) by the sample mean Z = ( Z(x 1 ),. . .,Z(x m ) ) .

Intrinsic Kriging
In this section we explain IK, following Mehdad and Kleijnen (2015). We rewrite (1) as with Y Y Y = (Y (x 1 ), . . . , Y (x m )) and M M M = (M (x 1 ), . . . , M (x m )) . We do not assume M is second-order stationary. Let Q be an m × m matrix such that QF = O where O is an m × (p + 1) matrix with all elements zero. Together, Q and (7) give Consequently, the second-order properties of QY Y Y depend on QM M M and not on the regression function Fβ.
To generalize the metamodel in (1), we need a stochastic process for which QM M M is second-order stationary; such processes are called intrinsically stationary processes. We assume that f j (x) (j = 1, . . . , p + 1) are mixed monomials x i1 1 · · · x i d d with x = (x 1 , . . . , x d ) and nonnegative integers i 1 , . . . , i d such that i 1 + . . . + i d ≤ k with k a given nonnegative integer. An intrinsic random function of order k (IRF-k) is a random process Y for which x i ∈ R d is second-order stationary, and λ = (λ 1 , . . . , λ m ) is a generalizedincrement vector of real numbers such that (f j (x 1 ), . . . , f j (x m )) λ = 0 (j = 1, . . . , p + 1).
IK is based on an IRF-k. Let M (x) be an IRF-k with mean zero and generalized covariance function K. Then the corresponding IK metamodel is: Cressie (1991, pp. 299-306) derives a linear predictor for the IRF-k metamodel defined in (8) with generalized covariance function K. So-given the old out- follows from minimizing the MSPE of the linear predictor: IK should meet the condition This condition is not introduced as the unbiasedness condition; instead the condition guarantees that the coefficients of the prediction error The IK variance, denoted by σ 2 IK is: In this section we assume that K is known, so the optimal linear predictor is obtained through minimization of (10) subject to (9). Hence, the IK predictor isŶ Now we discuss properties of K. Obviously K is symmetric. Moreover K must be conditionally positive definite so where the condition must hold for j = 1, . . . , p + 1. Following Mehdad and Kleijnen (2015) we choose K as where θ = (θ 0;1 , θ 1;1 , θ 0;2 , . . . , θ 0;d , θ 1;d ) ≥ 0. The function in (13) accepts different k for different input dimensions, so we have a vector of the orders k = (k 1 , . . . , k d ) .
In practice, K is unknown so we estimate the parameters θ. For this estimation we use restricted maximum likelihood (REML) estimator, which maximizes the likelihood of transformed data that do not contain the unknown parameters of the trend. This transformation is close to the concept of intrinsic Kriging. So we assume that Y in (8) is a Gaussian IRF-k. The REML estimator of θ is then found through minimization of the negative log-likelihood function where q = rank(F) and Finally, we replace K by K(θ)-in (11) to obtainŶ (x 0 ) and in (12) to obtain σ 2 IK . We could require REML to estimate the optimal (integer) k * , but this requirement would make the optimization even more difficult; i.e., we set k = 0. Mehdad and Kleijnen (2015) also extends IK to account for random simulation output with noise variances that change across the input space. The methodology is similar to the extension of Kriging to stochastic Kriging. The interpolating property of IK does not make sense for random simulation, which has sampling variability or internal noise besides the external noise or spatial uncertainty created by the fitted metamodel.
The extension of IK to account for internal noise with a constant noise variance has already been studied in the (geostatistics) literature as a so-called nugget effect. Indeed, Cressie (1991, p. 305) Mehdad and Kleijnen (2015) considers the case of heteroscedastic noise variances similar to SK, the stochastic IK (SIK) metamodel at replication r of the random output at x is where ε 1 (x), ε 2 (x), . . . denotes the internal noise at x. We again assume that ε(x) has a Gaussian distribution with mean zero and variance V(x) and that ε(x) is independent of M(x).
In stochastic simulation, the experimental design consists of pairs (x i , n i ), i = 1, . . . , m, where n i is the number of replications at x i . These replications enable computing the classic unbiased estimators of the mean output and the internal variance: Because M(x) and ε(x) in (15) are assumed to be independent, the SIK predictor and its MSPE can be derived similarly to the IK predictor and MSPE in (11) and (12)-except that K M will be replaced by K = K M + K ε where K ε is a diagonal matrix with V(x i )/n i on the main diagonal, and K M still denoting the generalized covariance matrix of IK without internal noise. We also replace Y Y Y in (11) and (12) (17) and its MSPE is We again use REML to estimate the parameters θ of the generalized covariance function, and replace K M by K M ( θ). We also need to estimate the internal noise V which is typically unknown. Let V(x i ) = s 2 (x i ) be the estimator of V(x i ), so we replace K ε by K ε = V(x 1 )/n 1 , . . . , V(x m )/n m . Finally, we re- (17) and (18). Next we explain how we choose the number of replications at each old point n i .
We are interested in an experimental design that gives a low integrated MSPE (IMSPE). Following Mehdad and Kleijnen (2015) MSPE(x 0 , n)dx 0 subject to n 1 m ≤ N , and n = (n 1 , . . . , n m ) where n i ∈ N. We try to compute n * i , which denotes the optimal allocation of the total number of replications N over the m old points. We then run into the problem that n * i is determined by both K M (external noise) and K ε (internal noise). To solve this problem, we follow Ankenman et al. (2010) and ignore K ε so K ≈ K M , and we relax the integrality condition. Altogether we derive where and J (ii) is a (p+1+m)×(p+1+m) matrix with 1 in position (p+1+i, p+1+i) and zeros elsewhere. Note that in (19) both the internal noise variance V(x) and the external noise covariance function K M affect the allocation.

Efficient global optimization (EGO)
In this section we first summarize EGO and three variants; next we detail one variant, which is the focus of our article.

EGO and three variants
EGO is the global optimization algorithm developed by Jones et al. (1998) for deterministic simulation. It uses expected improvement (EI) as its criterion to balance local and global search or exploiting and exploring. EGO is a sequential method with the following steps.
1. Fit a Kriging metamodel to the old I/O data. Let f min = min i Y (x i ) be the minimum function value observed (simulated) so far. Jones et al. (1998) derive

Estimate
whereŶ (x) is defined in (4) andσ 2 (x) follows from (5) substituting estimators for τ , and θ; Φ and φ denote the cumulative distribution function (CDF) and probability density function (PDF) of the standard normal distribution.
3. Simulate the response atx 0 found in step 2. Fit a new Kriging metamodel to the old and new points. Return to step 1, unless the EI satisfies a given criterion; e.g., EI is less than 1% of the current best function value. Huang et al. (2006) adapts EI for random simulation. That article uses the metamodel defined in (15) and assumes that the noise variances are identical across the design space; i.e., V (x) = V . That article introduces the following augmented EI (AEI) to balance exploitation and exploration: where x * is called the current 'effective best solution' and is defined as x * = arg min x1,...,xm [Ẑ(x) + MSPE(Ẑ(x))]; the second term on the right-hand side in (21) accounts for the diminishing returns of additional replications at the current best point. Picheny et al. (2013) develops a quantile-based EI known as expected quantile improvement (EQI). This criterion lets the user specify the risk level; i.e., the higher the values for the quantile are specified, the more conservative the criterion becomes. The EQI accounts for a limited computation budget; moreover, to sample a new point, EQI also considers the noise variance at future (not yet sampled) points. However, EQI requires a known variance function for the noise, and Quan et al. (2013) claims that Picheny et al. (2013)'s algorithm is also computationally more complex. Quan et al. (2013) shows that EI and AEI can not be good criteria for random simulations with heteroscedastic noise variances. The argument is that an EGO-type framework for random simulation with heteroscedastic noise faces three challenges: (1) An effective procedure should locate the global optimum with a limited computing budget. (2) To balance exploration and exploitation in random simulation, a new procedure should be able to search globally without exhaustively searching a local region; a good estimator of f min is necessary especially when there are several points that give outputs close to the global optimum.
(3) With a limited computing budget, it is wise to obtain simulation observations in unexplored regions in the beginning of the search and-as the budget is being expended toward the end-focus on improving the current best area. Quan et al. (2013)

Quan et al. (2013)'s algorithm
Now we explain Quan et al. (2013)'s algorithm in detail. After the initial fit of a SK metamodel, each iteration of the algorithm consists of a search stage followed by an allocation stage. In the search stage, the modified expected improvement (MEI) criterion is used to select a new point; see (22) below. Next in the allocation stage, OCBA distributes an additional number of replications over the sampled points. Altogether, the search stage locates the potential global minima, and the allocation stage reduces the noise caused by random variability at sampled points to improve the metamodel in regions where local minima exist and finally selects the global optimum. Quan et al. (2013)'s algorithm has a 'division of allocation heuristic'. The computing budget per iteration is set as a constant, but the allocation of this budget between the searching stage and the allocation stage changes as the algorithm progresses. In the beginning, most of the budget is invested in exploration (search stage). During the progress of the algorithm, the focus moves to identifying the point with the lowest sample mean (allocation stage).
In the search stage, we compute whereẐ min is the predicted response at the sampled point with the lowest sample mean, and Z p is a normal random variable with the mean equal toẐ(x) and the variance equal to the estimated MSPE(Ẑ(x)). The allocation stage addresses the random noise, so the search stage uses only MSPE(Ẑ(x)) with estimates of Σ = Σ M instead of Σ = Σ M + Σ ε . This helps the search to focus on the new points that reduce the spatial uncertainty of the metamodel. Ignoring the uncertainty caused by random variability, the MEI criterion assumes that the observations are made with infinite precision so the same point is never selected again. This helps the algorithm to quickly escape from a local optimum, and makes the sampling behavior resembles the behavior of the original EI criterion with its balancing of exploration and exploitation. The allocation stage reduces random variability through the allocation of additional replications among sampled points. Additional replications are distributed with the goal of maximizing the probability of the correct selection (PCS) when selecting a sampled point as the global optimum. Suppose that we have m sampled points x i (i = 1, . . . , m) with sample mean Z i and sample variance V (x i ) = s 2 (x i ). Then the approximate PCS (APCS) can be asymptotically maximized when the available computing budget N tends to infinity and where n i is the number of replications allocated to x i , x b is the point with the lowest sample mean, and ∆ b,i is the difference between the lowest sample mean and the sample mean at point x i ; Chen et al. (2000) proves (23) and (24). Given this allocation rule, at the end of the allocation stage, the sampled point with the lowest sample mean will be selected asẐ min . We summarize Quan et al. (2013)'s algorithm in our Algorithm 1. Before the algorithm begins, the user must specify T , B, m 0 , and r min where T is the total number of replications at the start, B is the number of replications available for each iteration, m 0 is the size of the initial space filling design, and r min is the minimum number of replications for a new point. The size of the initial design m 0 may be set to 10d where d is the number of dimensions (10d is proposed by Loeppky et al. (2009)); B and r min should be set such that there are sufficient replications available for the first allocation stage.
The starting parameters that determine the number of iterations, I = (T − m 0 B)/B , and the computing budget used per iteration B are set prior to collecting any data, so the starting parameter settings may turn out to be unsuitable for the problem. In step 2, leave-one-out cross validation can provide feedback regarding the suitability of the initial parameters. If one or more design points fail the cross-validation test, then the computing budget may be insufficient to deal with the internal noise. Possible solutions include (i) increasing

Algorithm 1 Quan et al. (2013)'s algorithm
Step 1. Initialization: Run a space filling design with m 0 points, with B replications allocated to each point.
Step 2. Validation: Fit a SK metamodel to the set of sample means. Use leave-one-out cross validation to check the quality of the initial SK.
Step 3. Set i = 1, r A (0) = 0 while i ≤ I do Step 3a. Search Stage: Sample a new point that maximizes the MEI criterion (Eq. 22) with r S (i) replications.
Step 3c. Fit a SK metamodel to the set of sample means i = i + 1 end if end while The point with the lowest sample mean at the end estimates the global optimum.
B, (ii) increasing the number of design points around the point(s) that fail the cross-validation test, and (iii) applying a logarithmic or inverse transformation to the simulation response.
After successful validation of the SK metamodel, the computing budget set aside for the allocation stage r A (i) increases by a block of (B −r min )/I replications in every iteration while r S (i) decreases by the same amount for the search stage. This heuristic gives the algorithm the desirable characteristic of focusing on exploration at the start and on exploitation at the end of the search.
Our modified variant of Quan et al. (2013)'s algorithm differs from Algorithm 1 as follows: (i) For the underlying metamodel we use SIK instead of SK. (ii) In the allocation stage, we use (19) instead of (23) and (24). So we use Quan et al. (2013)'s MEI defined in 22.

Numerical experiments
We experiment with deterministic simulation and random simulation. In these experiments we use a zero-degree polynomial for the trend (so p = 0 in (1)), so UK becomes ordinary Kriging (OK). For deterministic simulation we study the performance of EGO with OK versus EGO with IK. For random simulation we study the performance of Quan et al. (2013)'s algorithm versus our variant with SIK (instead of SK) and the minimum IMSE allocation rule (instead of OCBA).
We tried to use the MATLAB code developed by Yin et al. (2011)-which is a building block in Quan et al. (2013)'s algorithm-to experiment with the Kriging variants (OK for deterministic simulation and SK for random simulation), but that MATLAB code crashed in experiments with d > 1. We therefore use the R package DiceKriging to implement OK and SK; for details on DiceKriging we refer to Roustant et al. (2012). We implement our code for IK and SIK in MATLAB, as Mehdad and Kleijnen (2015) also does. In all our experiments we select k = 0.
Furthermore, we select a set of m c = 100d candidate points, and as the 'winning' point we pick the candidate point that maximizes EI or MEI. For d = 1 we select m 0 and m c equispaced points; for d > 1 we select select m 0 and m c space-filling points through Latin hypercube sampling (LHS), implemented in the MATLAB function lhsdesign.
As the criterion for comparing the performance of different optimization algorithms, we use the number of simulated input combinations needed to estimate the optimal input combination (say) m. As the stopping criterion we select m reaching a limit; namely, 11 for d = 1, 61 for the camel-back test function (d = 2), 65 for Hartmann-3 (d = 3), and 111 for Ackley-5 (d = 5). We select the number of starting points m 0 to be 3 for d = 1, 21 for d = 2, 30 for d = 3, and 51 for d = 5.
Next we experiment with several functions that are popular in optimization; see Dixon and Szego (1978) and http://www.sfu.ca/~ssurjano/index.html: (1) Six-hump camel-back with d = 2 (2) Hartmann-3 with d = 3 (3) Ackley-5 with d = 5; we define these three test functions in the appendix. Figure 1 illustrates EGO with IK for the d = 1 test function defined in (25). This function has a global minimum at x opt = 0.5486 with output f (x opt ) = -0.869; also see the curves in the left panels of the figure, where the (blue) solid curve is the true function and the (red) dotted line is the IK metamodel. We start with m = 3 old points, and stop after sequentially adding seven new points (shown by black circles); i.e., from top to bottom the number of points increases starting with three points and ending with ten points. These plots show that a small m gives a poor metamodel. The right panels display EI as m increases. These panels show that EI = 0 at the old points. Figure 2 displays f min (m) = min f (x i ) (1 ≤ i ≤ m), which denotes the estimated optimal simulation output after m simulated input combinations; horizontal lines mean that the most recent simulated point does not give a lower estimated optimal output than a preceding point. The square marker represents EGO with IK and the circle marker represents classic EGO with The results show that in most experiments, EGO with IK performs better than classic EGO with OK; i.e., EGO with IK gives a better input combination after fewer simulated points.

Random simulation experiments
In this subsection we compare the performance of Quan et al. (2013)'s algorithm with our variant which uses SIK as the metamodel and a different allocation rule. In both variants of the algorithm we select r min = 10, B = 40 (for d = 1), 130 (for d = 2 and 3), 310 (for d = 5). In all our experiments for random simulation we augment the deterministic test functions defined for our deterministic experiments with the heteroscedastic noise V(x i ) = (1 + |y(x i )|) 2 . Figure 3 illustrates our variant for the d = 1 test function. We start with m = 3 old points, and stop after sequentially adding seven new points. With small m and high noise (as x increases), the metamodel turns out to be a 'poor' approximation; compare the (blue) solid curves and the (red) dashed curves. Note that in the beginning the algorithm searches the region far from the unknown global optimum (namely, x = 0.5486), and after each iteration and careful allocation of added replications, the quality of the SIK fit in areas with high noise (when moving to the right) improves. The right panels display MEI as m increases.
Finally, we compare the two variants using 50 macro-replications. Figure 4 displays f min (m) = min i 50 t=1 f t (x i ) 50 , 1 ≤ i ≤ m; which denotes the estimated optimal simulation output averaged over 50 macro-replications. The circle marker represents Quan et al. (2013)'s original algorithm, and the square marker represents our variant. The results show that the two variants are not significantly different in most sampled points except for Hartmann-3 where the original variant performs significantly better for m = 32, . . . , 45 and our variant perform significantly better for m = 31, 59, . . . , 65. We note that this conclusion is confirmed by paired t-tests, which we do not detail here.

Conclusions
We adapted EGO replacing OK by IK for deterministic simulation, and for stochastic simulation we modified Quan et al. (2013)'s algorithm replacing SK by a SIK metamodel and introducing a new allocation rule. We numerically compared the performance through numerical experiments with several classic test functions. Our main conclusion is that in most experiments; (i) in deterministic simulations, EGO with IK performs better than Jones et al. (1998)'s EGO with OK; (ii) in random simulation, there is no significant difference between the two variants of Quan et al. (2013)'s algorithm.
In future research we may further investigate the allocation of replications in random simulation analyzed by Kriging metamodels. Furthermore, we would like to see more practical applications of our methodology. : Estimated optimal output (y-axis) after m simulated input combinations (x-axis) for four test functions, for random simulation A Test functions with dimensionality d > 1 We define the following three types of test functions.