Optimization by moving ridge functions: Derivative-free optimization for computationally intensive functions

A novel derivative-free algorithm, optimization by moving ridge functions (OMoRF), for unconstrained and bound-constrained optimization is presented. This algorithm couples trust region methodologies with output-based dimension reduction to accelerate convergence of model-based optimization strategies. The dimension-reducing subspace is updated as the trust region moves through the design space, allowing OMoRF to be applied to functions with no known global low-dimensional structure. Furthermore, its low computational requirement allows it to make rapid progress when optimizing high-dimensional functions. Its performance is examined on a set of test problems of moderate to high dimension and a high-dimensional design optimization problem. The results show that OMoRF compares favourably to other common derivative-free optimization methods, particularly when very few function evaluations are available.


Introduction
Derivative-free optimization (DFO) methods seek to solve optimization problems using only function evaluations -that is, without the use of derivative information. These methods are particularly suited for cases where the objective function is a 'black-box' or computationally intensive (Conn, Scheinberg, and Vicente 2009). In these cases, computing gradients analytically or through algorithmic differentiation may be infeasible and approximating gradients using finite differences may be intractable. Common applications of DFO methods include engineering design optimization (Kipouros et al. 2008), hyper-parameter optimization in machine learning (Ghanbari and Scheinberg 2017), and more (Levina et al. 2009). Derivative-free trust region (DFTR) methods are an important class of DFO which iteratively create and optimize a local surrogate model of the objective in a small region of the search space, called the trust region. Unlike standard trust region methods, DFTR methods use interpolation or regression to construct a surrogate model, thereby avoiding the use of derivative information.
However, acquiring enough samples for surrogate model construction may be computationally prohibitive for problems of moderate to high dimension. Fortunately, it has been shown that many functions of interest vary primarily along a low-dimensional linear subspace of the input space. For example, efficiency and pressure ratio of turbomachinery models (Seshadri et al. 2018), merit functions of hyper-parameters of neural networks (Bergstra and Bengio 2012), and drag and lift coefficients of aerospace vehicles (Lukaczyk et al. 2014) have all been shown to have low-dimensional structure. Functions that have this structure are known as ridge functions (Pinkus 2015), and may be written where f : D ⊆ R n → R, m : proj U (D) ⊆ R d → R, U ∈ R n×d , proj U (D) denotes the d-dimensional projection of domain D onto the subspace U, and d < n. If d n, then exploiting low-dimensional structure may lead to significant reductions in computational requirement. When using ridge function approximations, generally an assumption of a global subspace U is made. That is, it is assumed that the function f (x) varies along a constant linear subspace throughout the whole domain D. This assumption may limit the application of ridge function approximations to a few special cases.
In this article, a novel DFTR method for unconstrained and bound-constrained nonlinear optimization of computationally intensive functions is presented. This algorithm leverages dimension reduction via local ridge function approximations which move through the function domain. This article has three main contributions. First, a new algorithm, called optimization by moving ridge functions (OMoRF), is presented. An open source Python implementation of this algorithm has been made available for public use. Second, a novel sampling strategy for constructing local ridge function surrogate models in DFTR methods is proposed. Finally, OMoRF is applied to a variety of test problems, including a high-dimensional aerodynamic design optimization problem.
The rest of this article is organized as follows. A brief introduction to trust region methods is provided in Section 2. In Section 3, algorithms for constructing ridge function approximations are explored, and some limitations of using globally defined subspaces are presented. The OMoRF algorithm is presented in Section 4. In Section 5, the algorithm is tested against other common DFO methods on a variety of test problems. Finally, a few concluding remarks are provided in Section 6.

Trust region methods
Trust region methods replace the unconstrained optimization problem min x∈R n f (x) (2) with a sequence of trust region subproblems min s m k (x k + s) where x k is the current iterate, ∆ k is the trust region radius, and m k is a simple model which approximates f in the trust region The solution to the trust region subproblem (3) gives a step s k , with x k +s k a candidate for the next iterate x k+1 . The ratio is used to determine if the candidate is accepted or rejected and the trust region radius increased or reduced.

Derivative-free trust region methods
For standard trust region methods, a common choice of model is the Taylor series expansion centred around x k where B k is a symmetric matrix approximating the Hessian ∇ 2 f (x k ). Constructing this Taylor quadratic clearly requires knowledge of the function derivatives. Alternatively, DFTR methods may use interpolation or regression to construct m k . That is, using a set of p samples X = {x 1 , x 2 , . . . , x p } and q basis functions φ(x) = {φ 1 (x), . . . , φ q (x)}, the model is defined as where α j for j = 1, . . . , q are the coefficients of the model. In the case of fullydetermined interpolation, the number of samples is equal to the number of coefficients, i.e. p = q, so the coefficients may be determined by solving the linear system where Note that the current iterate x k is generally included in the interpolation set, so Global convergence of trust region methods which use the Taylor quadratic (6) has been proven (Conn, Gould, and Toint 2000). These convergence properties rely heavily on the well-understood error bounds for Taylor series. For these same guarantees to hold for DFTR methods, one must ensure the surrogate models satisfy Taylor-like error bounds for all x ∈ B(x k , ∆ k ), where κ 1 , κ 2 > 0 are independent of x k and ∆ k . Models which satisfy these conditions are known as fully linear. To ensure a model is fully linear, certain geometric conditions on the sample set X must be satisfied (Conn, Scheinberg, and Vicente 2009). In the case of polynomial surrogates, Conn, Scheinberg, and Vicente (2009) proved that these conditions on X were equivalent to bounding the condition number of the matrix In particular, it was shown that, if the condition number ofM is sufficiently bounded, then polynomial models constructed from X = {x 1 , x 2 , . . . , x p } are fully linear. Numerous algorithms for improving the geometry of X for polynomial interpolation are presented by Conn, Scheinberg, and Vicente (2009) (see Chapter 6).

Limitations of derivative-free trust region methods
Fully-determined polynomial interpolation of degree r in n dimensions requires p = n+r r function evaluations. DFTR methods cannot make progress until p points have been sampled, as there is no surrogate model m k for subproblem optimization (3) before this point. In the case of quadratic interpolation, this leads to a requirement of p = O(n 2 ) sample points. When n is small, this requirement may be easily met; however, as n increases, this requirement may become prohibitive, particularly in the case of functions which are expensive to evaluate. Many DFTR methods account for this computational burden by avoiding the use of fully-determined quadratic models. For instance, the optimization algorithm COBYLA (Powell 1994) uses linear models, requiring only n+1 samples. Although this greatly reduces the effect of increased dimensionality, linear models generally do not capture the curvature of the true function, so convergence may be slow (Wendor, Botero, and Alonso 2016). Other algorithms reduce the required number of samples by constructing under-determined quadratic interpolation models. For example, the bound-constrained optimization algorithm BOBYQA (Powell 2009) requires more samples than is necessary for linear models, but fewer than fully-determined quadratic models -typically 2n + 1 samples are used. Use of under-determined quadratic models has proven to be quite effective for many optimization algorithms (Powell 2009;Zhang, Conn, and Scheinberg 2010;Cartis et al. 2019). Nevertheless, this initial requirement may limit the efficacy of these approaches when the objective is both high-dimensional and computationally expensive.

Ridge function approximations
Ridge function approximations allow one to reduce the effective dimensionality of a function by determining a low-dimensional representation which is a function of a few linear combinations of the high-dimensional input. These approximations can be determined using a number of methods (Constantine 2015;Diez, Campana, and Stern 2015;Hokanson and Constantine 2017). This article will focus on two approaches: 1) derivative-free active subspaces, and 2) polynomial ridge approximation.

Derivative-free active subspaces
The active subspace of a given function f (x) has been defined by Constantine, Dow, and Wang (2014) as the d-dimensional subspace U ∈ R n×d of the inputs x ∈ R n which corresponds to the directions of strongest variability of f . To see how one may discover U, consider a bounded probability density function ρ(x) and let f be squareintegrable with respect to ρ. The active subspace of this function can be found using the covariance matrix In the derivative-free context, the approximate covariance matrix wheref is a surrogate model of f , may be used as a surrogate for C. This matrix is symmetric, positive semidefinite, so its real eigendecomposition is given by where Λ = diag(λ 1 , . . . , λ d , . . . , λ n ) and λ 1 ≥ · · · ≥ λ d ≥ . . . λ n ≥ 0. Partitioning W and Λ as results in the active subspace U ∈ R n×d and the inactive subspace V ∈ R n×(n−d) . The reduced coordinates y = U T x and z = V T x are known as the active and inactive variables, respectively. By construction, f shows greater variability along y than z. Moreover, the sum of the partitioned eigenvalues Λ 1 and Λ 2 quantifies this variation. This motivates the well-known heuristic of choosing the reduced dimension d as the index with the greatest log decay of eigenvalues (Constantine 2015).

Polynomial ridge approximation
Ridge function recovery allows one to find the subspace U and the coefficients of the ridge function surrogate model simultaneously. Hokanson and Constantine (2017) developed a method of doing this for the case in which m is a polynomial of dimension d and degree r. Their approach was to use variable projection to solve the minimization problem where P r (R d ) denotes the set of polynomials on R d of degree r, G(d, R n ) denotes the Grassmann manifold of d-dimensional subspaces of R n , and {x i } is a set of M samples.
(as seen in (8) . . , M }, allows one to formulate (16) as a nonlinear least squares problem in terms of the coefficients α with f ∈ R M such that f i = f (x i ) and q = d+r r . Using the fact that α may be easily discovered using the Moore-Penrose pseudoinverse, one may write (17) as the Grassmann manifold optimization problem over strictly U. To solve this problem, Hokanson and Constantine (2017) developed a novel Grassmann Gauss-Newton method for iteratively solving (18).

Limitations of using a global subspace
Ridge function approximations have been extensively used for a variety of purposes, such as reduced order modelling , integration (Glaws et al. 2017), sensitivity studies , and optimization (Lukaczyk et al. 2014;Gross, Seshadri, and Parks 2020). However, most studies assume that the function of interest varies along a global subspace U. Unless the function is an exact ridge function, this assumption may lead to a significant amount of information loss when projecting onto the subspace. If one instead considers local subspaces {U k }, with each U k corresponding to a small region of interest in the function domain, this information loss can be reduced.
As an example, consider the 10-dimensional Styblinski-Tang function defined over the domain x ∈ [−100, 100] 10 . In this example, 7 local regions have been defined by 10-dimensional hypercubes of radius 0.1 with centroids randomly scattered throughout the domain. Global and local subspaces of this function have been approximated using a well-known Monte-Carlo sampling technique (Constantine 2015) with 100,000 gradient samples found at uniformly distributed random locations within their respective domains of interest. The eigenvalues of the global and local subspaces are shown in Figure 1(a). From this figure, it is apparent that this function has no discernible global low-dimensional structure, as there is no large decay of the log of the eigenvalues for the global subspace. However, one can see a significantly stronger log decay in the eigenvalues from the first index for each of the local subspaces. This suggests that for each of these subspaces, the direction defined by the first eigenvector captures a significant amount of the variation of f in its local regions of interest. Moreover, from inspection of Figure 1(b), it is clear that these directions vary significantly, as the parameter weight values fluctuate widely for each region of interest. This observation motivates the use of local subspaces which are updated as the region of interest moves through the domain.
(a) Eigenvalue decay (b) Parameter weights from the first eigenvector Even when a global subspace captures most of the variability of f , there may be more benefits to periodically updating the subspace. To see this, consider the 2-dimensional function Although this function varies in both dimensions x 1 and x 2 , it has the most significant variation in the direction U T x with U = 1 1 . Using this as a global subspace, a very accurate 1-dimensional surrogate model may be constructed. Although this surrogate model may be very accurate, there are regions of interest where there is not insignificant variation in the second direction. This variation manifests as 'noise' when projected onto the 1-dimensional domain. In general, this noise may not be an issue if one is simply interested in reduced order modelling and the function strongly corresponds to a low-dimensional representation. However, this noise severely limits the use of ridge function approximations in model-based optimization, particularly for DFTR methods. This is because as the region of interest shrinks, this noise begins to dominate until the error associated with it is of the same order of magnitude as the variation in the function. At this point, the optimizer may struggle to progress as it stalls in a suboptimal region. Figure 2 provides a demonstration of this phenomenon. In the small region of interest (shown as a black box in Figure 2(a)), the noise associated with the 0.01(x 1 − x 2 ) 2 term dominates. When using the subspace U in this region, the optimizer is likely to stall as the noise is now of larger order of magnitude than the variation then the ridge function approximation m(U T x). Alternatively, switching to the subspace V = 1 −1 and constructing a model m(V T x) may allow the optimization algorithm to progress once again, as the function variation is now of larger magnitude than the noise.

OMoRF algorithm
The OMoRF algorithm is detailed in Algorithm 1. At each iteration, a local subspace U k is determined and a quadratic ridge function m k (U T k x) is constructed. To ensure the accuracy of this model, two separate interpolation sets are maintained: the set X k is used to calculated the local subspace U k , whileX k is used to determine the coefficients of m k . Next, the trust region subproblem is solved to obtain a candidate solution x k + s k . Just as for standard trust region methods, the ratio r k (5) is used to determine whether or not this candidate solution is accepted and if the trust region radius is decreased. Before decreasing the trust region, checks on the quality of X k andX k are performed, and if necessary, their geometry is improved by calculating new sample points. Remark 2. Just as in BOBYQA (Powell 2009), two trust region radii ∆ k and ρ k are maintained, where ∆ k determines the size of the trust region and ρ k is used as a lower bound when decreasing ∆ k .

14:
Accept/reject step and update trust region radius:

15:
Append x k + s k toX k and X k .

21:
end if 22: end for Remark 3. Lines 8-11 correspond to a 'safety' step (as is done by Powell (2009); Cartis et al. (2019)). During this safety step, a check is performed on the step s k to ensure it is sufficiently large before evaluating the candidate solution x k + s k .

Interpolation set management
The accuracy of ridge function models m k is dependent on two sources of error: information loss by projecting onto a subspace U k and the response surface error of m k . Ideally, a single interpolation set could be maintained which could be improved to reduce both sources of error. Unfortunately, such an approach would require a priori knowledge of the subspace U k . Therefore, OMoRF maintains two separate interpolation sets: X k of n + 1 samples for calculating U k , andX k of 1 2 (d + 1)(d + 2) samples for calculating the coefficients of m k .
The set X k is used to construct the subspace U k using either derivative-free active subspaces or polynomial ridge approximation. In either case, the first step is to build a fully linear n-dimensional linear interpolator (the importance of using a fully linear interpolator is explored in Appendix A). From this interpolator, a 1-dimensional subspace can be extracted using the approximate covariance matrix (13). If a greater dimensionality is required, polynomial ridge approximations are employed by using X k as a test set for the variable projection algorithm discussed in Section 3.2. Note, the 1dimensional subspace obtained from derivative-free active subspaces is still calculated in this case, as it is used to infer an initial starting point for the manifold optimization problem (18).
Once the subspace U k is known, one may be tempted to use the points Y k = {U T k x i | x i ∈ X k } to calculate the coefficients of m k . However, these projected samples generally insufficiently span the d-dimensional projected space, leading to poor surrogate models. Determining a more suitable set of 1 2 (d + 1)(d + 2) samplesX k is not only relatively cheap, but can dramatically improve the quality of the ridge function surrogate. To demonstrate this, Figure 3 provides an example of the 10-dimensional Styblinski-Tang function (19) projected onto a 2-dimensional subspace. In this example, 100,000 uniformly distributed random samples were used to calculate the R 2 values of both ridge function models. Although the set X k is well suited for linear interpolation in 10 dimensions, the projected set Y k clearly does not span the 2-dimensional space very well. In contrast, the projected setŶ k = {U T kx i |x i ∈X k }, spans this space much more effectively, which in turn gives a much more accurate ridge function model, as shown by the R 2 values in Figure 3.

Interpolation set updates
Algorithm 1 always includes new sample points as they become available by appending x k + s k to the interpolation sets X k andX k . When the iterate is successful, i.e. r k ≥ η 1 , a point is chosen from each set to be replaced before continuing the iteration. When the iterate is not successful, it is necessary to ensure the accuracy of the model before reducing the trust region radius ∆ k . In this case, not only is the previously calculated sample added, but another new geometry-improving point is calculated and subsequently added. Determining whether or not these sets need to be improved is done by checking the maximum distance of the samples to the current iterate. If this distance is too large, it signifies the interpolation set has not been updated recently, so it may need improvement. The full details of this process are provided in Algorithm 2.
Algorithm 2 Interpolation set update for OMoRF 1: Let x k be the current iterate, X k be a set of at least n + 1 samples,X k be a set of at least 1 2 (d + 1)(d + 2) samples, U k be the current subspace, and both ∆ k and ρ k be given. 2: Set values of algorithmic parameters 0 < α 1 < α ImproveX k+1 and set X k+1 = X k .

8:
Calculate U k+1 using X k+1 and set ρ k+1 = ρ k . 9: else 10: There are a few points to note about Algorithm 2. First, although other conditions may be used as a measure of the quality of an interpolation set, the maximum distance of the samples to current iterate gives a quick and simple means of determining whether or not to improve the interpolation set. Moreover, this approach has been successfully applied in other DFTR methods (Cartis et al. 2019). Second, to reduce the computational burden of each iteration, only a single geometry improving sample point is calculated during Algorithm 2, and improvements toX k are prioritized. This is becauseX k has fewer samples than X k , soX k can be updated more rapidly as the trust region moves through the function domain, allowing for greater adaptability. Finally, a pivotal algorithm, which has been modified from Algorithm 6.6 in Conn, Scheinberg, and Vicente (2009), has been used both to choose points to be replaced and calculate new geometry-improving points. The details of this modified algorithm are given in Algorithm 3 in Appendix B.

Choice of norm and extending for bound constraints
Unlike most trust region algorithms, the infinity norm · ∞ is used in this implementation of OMoRF. That is, the trust region is defined as the hypercube Although the more common choice of Euclidean norm · 2 would also be suitable, this choice was made in order to simplify the extension of OMoRF to the bound-constrained optimization problem To see how this choice simplifies matters, note that x − x k ∞ ≤ ∆ k is equivalent to where ∆ k is an n-dimensional vector of ones multiplied by ∆ k . In the case of the infinity norm, the feasible region at iteration k is then simply the intersection of Therefore, one may write this feasible region as where In the case of a Euclidean norm trust region, the feasible region is the intersection of The shape of this region does not lend itself to a simple formulation, so working with the Euclidean norm may be more cumbersome in the case of bound-constrained optimization problems. Note, some methods, such as BOBYQA (Powell 2009), handle the awkward shape of the feasible region by projecting the step obtained from a Euclidean trust region onto the hyperrectangle.

Testing methodology
Performance and data profiles (Moré and Wild 2009) have been used for comparing these algorithms on many of the following test problems. These profiles are defined in terms of three characteristics: the set of test problems P, the set of algorithms tested S, and a convergence test T . Given the convergence test, a problem p ∈ P, and a solver s ∈ S, the number of function evaluations necessary to pass the convergence test T was used as a performance metric t p,s . Moreover, the convergence test where τ > 0 is some tolerance, x 0 is the starting point, and f L is the minimum, attainable value of f with a given computational budget for problem p, was used.
The performance profile is defined as In other words, ρ s (α) is the proportion of problems in P in which solver s ∈ S attains a performance ratio of at most α. In particular, ρ s (1) is the proportion of problems for which the solver performs the best for that particular convergence criteria t p,s , and as α → ∞, ρ s (α) represents the proportion of problems which can be solved within the computational budget. Data profiles are defined as where n p is the dimension of problem p ∈ P. This represents the proportion of problems that can be solved -measured by convergence criterion t p,s -by a solver s within κ(n p + 1) function evaluations (or κ simplex gradients).

CUTEst problems
A subset of unconstrained and bound-constrained optimization problems from the CUTEst (Gould, Orban, and Toint 2015) test problem set was used to examine solver performance. From this set, two subsets were defined: 1) 70 problems of moderate dimension (10 ≤ n < 50), and 2) 61 problems of high dimension (50 ≤ n ≤ 100). A full list of these problems may be found in Appendix C. A computational budget of 10 simplex gradients (i.e. 10(n + 1) function evaluations) was specified, and these solvers were tested with a low accuracy requirement of τ = 10 −1 and a high accuracy requirement of τ = 10 −5 .

Moderate dimension problems
The data and performance profiles for the test set of problems of moderate dimension are shown in Figure 4. It is clear that all variants of OMoRF generally outperformed the other solvers when very few function evaluations were available. This is particularly true when fewer than 2 simplex gradients were available. In fact, both OMoRF (AS) and OMoRF (VP, d = 1) could solve over 70% of the problems to the convergence criterion for the low accuracy requirement of τ = 10 −1 within 2 simplex gradients.
This compares very favourably to BOBYQA (2n + 1), which could solve none of these problems within 2 simplex gradients. Even COBYLA and BOBYQA (n + 2) did not perform nearly as well in this low computational budget regime. Although these algorithms had lower initial cost, it is hypothesized that the surrogate models did not have as much adaptability as those of OMoRF, requiring significantly more samples before the quality of the interpolation set could be ensured. For the high accuracy requirement of τ = 10 −5 , the OMoRF solvers also generally outperformed the other solvers, with OMoRF (AS) and OMoRF (VP, d = 1) solving nearly twice as many problems as the other solvers within 2 simplex gradients. In addition, the performance profiles show that OMoRF generally had the most rapid convergence, with OMoRF (AS) and OMoRF (VP, d = 1) being the fastest solvers to reach the convergence criterion for the low accuracy requirement for over 60% of the problems; this was nearly triple the proportion of the next best solver. It is interesting to note the performance drop of OMoRF (VP, d = 2) when compared to the other two variants. This may be due to the larger comparable cost of updating the ridge function surrogate.
In particular, given a sufficiently accurate subspace, only 3 points need to be calculated (using Algorithm 2) to ensure the quality of the ridge function model when d = 1, but this number increases to 6 when d = 2. When the problem is only of moderate dimension, this increase can make a significant difference in performance.

High dimension problems
The data and performance profiles for the test set of problems of high dimension are shown in Figure 5. Just as for problems of moderate dimension, OMoRF generally had superior performance when very few function evaluations were available. Interestingly, as the problem dimension increased, the relative performance of OMoRF (VP, d = 2) also increased, with it solving nearly as many problems to the low accuracy convergence criterion as the other variants of OMoRF within 2 simplex gradients. This supports the earlier hypothesis that OMoRF (VP, d = 2) may be more suitable for high-dimensional problems than low or moderate-dimensional ones. Nevertheless, OMoRF (AS) was still generally the superior solver, with it being the fastest solver to achieve the low accuracy convergence criterion for over 70% of the test problems. Even for the high accuracy convergence criterion, OMoRF (AS) generally outperformed the other solvers, with it solving nearly 50% of the problems. In addition, it was the fastest solver to reach the high accuracy convergence criterion for over 30% of the problems. Although BOBYQA (2n + 1) had the greatest initial overhead, its performance improved for the high-dimensional problems, solving over 90% of the test problems by 10 simplex gradients to the low accuracy requirement. Moreover, it was often the fastest solver to reach the high accuracy convergence criterion, performing slightly better here than OMoRF (AS). Even so, the initial overhead in sample requirement may often be quite undesirable. In these cases, one may be tempted to use BOBYQA (n + 2) to achieve more rapid convergence. However, it is clear from inspection of Figure 5 that, although this solver performed well when seeking low accuracy solutions, its performance dropped significantly when looking for high accuracy solutions.

Aerodynamic design problem
To demonstrate the efficacy of OMoRF when optimizing high-dimensional functions which are computationally intensive, design optimization of the ONERA-M6 transonic wing, parameterized by 100 free-form deformation (FFD) design points, has been used as a test problem. The objective is to minimize inviscid drag subject to bound constraints on the FFD parameters. This problem has been adapted from an open source tutorial (Palacios and Kline 2017). Furthermore, it has been used for testing design optimization algorithms and approaches in multiple studies (Lukaczyk et al. 2014;Qiu et al. 2018). In this study, this problem has been formulated as with x denoting the FFD parameters and C D (x) the drag coefficient. The flight conditions are of steady flight at a free-stream Mach number of 0.8395 and an angle of attack of 3.06 • . The Euler solver provided by the open source computational fluid dynamics (CFD) simulation package SU 2 (Palacios et al. 2013) was used to evaluate each design. A single CFD simulation required approximately 5 minutes on 8 CPU cores of a 3.7 GHz Ryzen 2700X desktop. Given this computational burden, a strict limit of 500 function evaluations was specified when optimizing this problem. It is noted that, although derivatives of this objective function may be obtained using algorithmic differentiation, this problem still provides a useful test problem for high-dimensional, computationally intensive design optimization.  Figure 6 shows the convergence plot and Table 1 shows the final attained drag coefficients for this design optimization problem. Although BOBYQA (n + 2) shows very quick initial progress, achieving a drag coefficient of less than 3 × 10 −3 within 200 function evaluations, its progress afterwards stalls. In fact, OMoRF (VP, d = 2) outperforms it from 400 function evaluations onward, and from 450 function evaluations onward so do COBYLA and BOBYQA (2n + 1). This result is in agreement with the previous studies, where, although BOBYQA (n + 2) was able to quickly achieve low accuracy solutions, it struggled to achieve high accuracy ones. Not only does OMoRF (VP, d = 2) show very rapid progress, it ultimately outperforms all of the other solvers by achieving the smallest drag coefficient within the computational budget. Interestingly, although OMoRF (AS) and OMoRF (VP, d = 1) generally performed well in the previous test problems, their performance was significantly worse for this problem. It is hypothesized that, in this case, the underlying problem dimension is best described using more than one dimension. This follows from what was found in previous studies (Lukaczyk et al. 2014;Qiu et al. 2018). In particular, Lukaczyk et al. (2014) discovered that the drag coefficient response of the ONERA-M6 wing was best described using at least a 2-dimensional subspace. Moreover, as discussed for the previous tests, the cost of constructing 2-dimensional ridge function surrogates is considerably less noticeable for high-dimensional problems, making OMoRF (VP, d = 2) generally a more competitive algorithm as the problem dimension increases. In this case, the benefits of capturing the underlying 2-dimensional behaviour clearly outweighed the disadvantages of requiring more sample points to update the ridge function surrogate models.

Conclusion
A novel DFTR method, which leverages output-based dimension reduction in a trust region framework, is presented. This approach is based upon the idea that, by reducing their effective dimension, functions of moderate to high dimension may be modelled using fewer samples. Using these reduced dimension surrogate models for model-based optimization may then lead to accelerated convergence. Although many functions cannot be modelled to sufficient accuracy by globally defined ridge functions, the use of local subspaces allows for greater flexibility, while also maintaining the computational benefits of dimension reduction. The efficacy of this algorithm was demonstrated on a number of test problems, including high-dimensional aerodynamic design optimization. Future work will focus on providing theoretical statements on the convergence properties of this algorithm, extending this method to the case of general nonlinear constraints, and applying this approach to other design optimization problems.

Appendices
Appendix A. Using fully linear models for derivative-free active subspaces The efficacy of the derivative-free active subspaces approach is clearly dependent on the accuracy of the inferred gradients. To provide a theoretical guarantee of this statement, assume that for all x ∈ B for some domain B with ω h independent of x and wheref is a surrogate model for f and h is some controllable parameter. Note, iff is fully linear, then by definition, implying that fully linear models inherently satisfy this assumption as ∆ k → 0, i.e. as the trust region radius shrinks. Given this assumption, the following lemma (modified from Lemma 3.11 in Constantine (2015)) provides an error bound between the approximate covariance matrix C (13) and the true covariance matrix C (12).
Lemma A.1. Assume ∇f (x) is Lipschitz continuous with constant γ f . The norm of the difference between C andĈ is bounded by Proof. Let ∇f denote ∇f (x) and ∇f denote ∇f (x). First, observe that Finally, Lemma A.1 demonstrates the importance of choosing a surrogate modelf which provides accurate inferred gradients when using derivative-free active subspaces. Moreover, iff is fully linear, then the error bound (A2) is guaranteed.

Appendix B. Modified pivotal algorithm for updating interpolation sets
In Algorithm 2, a mechanism for choosing samples to be replaced and improving the geometry of interpolation sets was alluded to. This mechanism is presented in Algorithm 3 for clarity. This algorithm has been adapted from Algorithm 5.1 in Conn, Scheinberg, and Vicente (2008). It applies Gaussian elimination with row pivoting toM to place an upper bound on M −1 , in turn bounding the condition number ofM. The pivot polynomial basis µ i for i = 0, . . . , q − 1 are related to the rows of the upper triangular matrix U in the LU factorization ofM. Rows which have pivot values µ i (x i ) of small absolute value make large contributions to M −1 . Samples in S k correspond to rows ofM, so by replacing these samples with ones which have higher pivot values, the geometry of S k can be improved. The aim of the search (B2) is to find and prioritize samples which have high pivot values over those which have low pivot values. The term |µ i (x j )| is divided by max( x j − x k 4 /∆ 4 k , 1) such that preference is given to points which are within the trust region. This idea has been borrowed from Powell (2009) and seems to work very well in practice. If the geometry needs to be improved, the optimization problem (B1) is solved, returning the point in the trust region of maximal absolute pivot value.
There are a few points to note about Algorithm 3. First, when choosing a point to be replaced, lines 5-6 in Algorithm 3 are omitted. Second, when performing geometryimproving steps, the conditional in line 5 is not activated until the last iteration (i.e. i = q − 1). This means that at each iteration, only a single new geometry-improving sample point will be calculated, leading to incremental improvements in the quality of the interpolation set. Next, although the points which improveX k are different than for X k , Algorithm 3 may be applied to both of these sets by a suitable choice of dimension m, degree r, and polynomial basis φ(x). In the case of X k , the dimension n, degree 1, and the natural polynomial basis are used, whereas in the case ofX k , the dimension d, degree 2, and φ(U T k x), i.e. the natural polynomial basis defined on the subspace U k , are used. Finally, although this algorithm can be applied to any set S k , it has been found to be most effective when Algorithm 3 Modified pivotal algorithm for m-dimensional polynomial interpolation of degree r 1: Let S k = {x k , x 2 . . . , x q , . . . } be a set of at least q = m+r r samples. 2: Initialize the pivot polynomial basis µ 0 (x) = φ 0 (x k ) and is an m-dimensional polynomial basis of maximum degree r. 3: Set S k+1 = {x k } and remove x k from S k . 4: for i = 1, . . . , q − 1 do 5: if improving geometry then Find and remove x t from S k .

11:
Update the pivot polynomial basis µ i (x) = µ i (x t ) and for j = i + 1, . . . , q − 1. 12: end for 13: return S k+1 applied to the shifted and scaled set Table C1. Details of test problems of dimension 10 ≤ n < 50 taken from the CUTEst (Gould, Orban, and Toint 2015) test set, including the problem dimension n, the initial value f (x 0 ) and the minimum attained value found by any of the available solvers f L . Problems marked with a * have bound constraints.  Table C2. Details of test problems of dimension 50 ≤ n ≤ 100 taken from the CUTEst (Gould, Orban, and Toint 2015) test set, including the problem dimension n, the initial value f (x 0 ) and the minimum attained value found by any of the available solvers f L . Problems marked with a * have bound constraints.