A non-Secant quasi-Newton Method for Unconstrained Nonlinear Optimization

Abstract The Secant equation has long been the foundation of quasi-Newton methods, as updated Hessian approximations satisfy the equation with each iteration. Several publications have lately focused on modified versions of the Secant relation, with promising results. This study builds on that idea by deriving a Secant-like modification that uses non-linear quantities to construct Hessian (or its inverse) approximation updates. The method uses data from the two most recent iterations to provide an alternative to the Secant equation with the goal of producing improved Hessian approximations that induce faster convergence to the objective function optimal solution. The reported results provide strong evidence that the proposed method is promising and warrants further investigation.


Introduction
The techniques examined in this paper are those employed in the solution of problems of the form min f x ð Þ; wheref : R n ! R: Such issues are common in all types of engineering and manufacturing projects. The above problem is solved iteratively using quasi-Newton methods. For the current iteration i, the gradient of f at x i is indicated as g i and the matrix B i is approximates Gðx i Þ, the actual Hessian of f. The next iteration Hessian matrix approximation must traditionally satisfy the so-called Secant equation

PUBLIC INTEREST STATEMENT
Unconstrained optimization problems are common in industrial and engineering applications. Much research has been dedicated to devise efficient methods to solve such problems. Secant Quasi-Newton methods constitute the basis for such methods due to their attractive properties. This paper derives a new nonlinear variant of the classical Secant methods that efficiently computes solutions to unconstrained optimization problems.

B iþ1 s i ¼ y i ;
(1) for The new solution approximation is defined as where p i the direction vector, obtained through solving The step length α i in (2) is computed as the solution to by means of some line search methodology to satisfy conditions the following Wolfe-Powellconditions are satisfied (Fletcher (1987)) and s T i g iþ1 � 0:9 p T i g i : Satisfying conditions of the type in (4) and (5) is necessary to guarantee convergence.
The derivation and coding of the actual Hessian matrix is susceptible to human error and demands considerable storage for large problems. The identity matrix (or some weighted version of it) is usually chosen as the initial approximation to the Hessian, and it is updated step-by-step using the most recent available computed data. It has been demonstrated that such methods exhibit superlinear convergence under reasonable assumptions about the objective functions (Fletcher, 1970(Fletcher, , 1994Dennis & Schnabel, 1979). When the line searches are exact, the methods converge in at most n iterations on quadratic functions (Broyden, 1970).
The most numerically successful update formula used to approximate the Hessian on a stepwise basis is the BFGS update and is given as The global convergence of the BFGS method for convex objective functions has been established by some authors (see, Dai et al., 2002;Dennis & Schnabel, 1979;Fletcher, 1987;Shanno, 1978;Shanno & Phua, 1978;Xiao et al., 2006;Yuan et al., 2018Yuan et al., , 2017. Dai et al. (2002) demonstrate that when the line searches are inexact, the standard BFGS method may fail to converge on non-convex functions. Wei et al. (2004Wei et al. ( , 2006 show that even with accurate-line searches, the classical BFGS method can fail to converge. Li and Fukushima (2001) modified the classical BFGS formula to create a variant of the method that exhibits global convergence even when the objective function f lacks the convexity condition.

Variants of the Secant equation
Much research has been conducted in order to develop methods that outperform the classical BFGS update numerically and are globally convergent under reasonable assumptions. In this section, we provide a brief overview of some of the successful methods that have outperformed the classical BFGS. The algorithms' development is primarily based on the introduction of modifications to the Secant equation (1). The concept of using more of the data available at each iteration in the update of the Hessian approximation that would otherwise be disregarded motivates one particularly successful class of approaches documented in the literature. These data include, though not limited to, the readily computed step vectors as well as the gradient difference vectors in (1) from the m latest iterations (m > 1). The standard quasi-Newton methods utilize just from the most recent iteration. We next present a particularly successful class of methods known as the multi-step methods (Ford & Moghrabi, 1993, 1994, 1996.
Applying the Chain Rule is applied to g x τ ð Þ ð Þ, for a differentiable path in R n , X ¼ x τ ð Þ f g (for τ 2 RÞ, and differentiate with respect to τ, the result is In particular, if m is chosen to pass through the most recent point x iþ1 (so that x τ m ð Þ ¼ x iþ1 , say), then equation (6), known as the "Newton Equation" (see, Al-Baali, 1985;Broyden, 1970;Dennis & Schnabel, 1979), presents generally an alternative that the Hessian matrix G x iþ1 ð Þ satisfies. Equation (6) may be regarded as a basis for the Secant equation when m is chosen to be i + 1 (Broyden, 1970). In Ford and Moghrabi (1994), the path X is constructed as the polynomial interpoloating the m þ 1 most recent points fx iÀ mþkþ1 g m k¼0 . The polynomial (ĝ τ ð Þ, say) is used to approximate the vector g 0 x τ m ð Þ ð Þ that by differentiating the corresponding polynomial which interpolates the corresponding available gradient points fg x iÀ mþkþ1 ð Þg m k¼0 .
The scalar values fτ k g m k¼0 correspond to the points fx iÀ mþkþ1 g m k¼0 on the curve X ¼ x τ ð Þ f g as follows If B iþ1 is a given approximation to G x iþ1 ð Þ in (6) and we make the following definitions (where the vectors s i and y i are as defined in (1) and w i represents an approximation to g 0 x τ m ð Þ ð Þ, it is not ureasonable (by (6)) to force the matrix B iþ1 to satisfy a relation similar to the one in (6). We, thus, have There remains the issue of choosing values for the τ parameters. One possibile choice is (Ford & Moghrabi, 1993) τ 0 ¼ À jjs i2 jj þ jjs iÀ 12 jj ð Þ; τ 2 ¼ 0; andτ 1 ¼ À jjs i2 jj : This choice has the merit of reflecting the distances among the iterates in the space of the variables and updated for each new point.
The multi-step BFGS Hessian update is given by The main advantage of (10) is in exploiting several recent steps as well as available gradient vectors in contrast to the standard BFGS method that utilizes only latest one-step and the corresponding gradient difference vectors.
Other non-secant methods are motivated by the fact that the classical BFGS formula is confined to using single gradient and step vectors in the construction of the Hessian approximation while neglecting other potentially useful readily computed data. For example, function values might prove to be valuable in the updating process. So, focusing on building better 'quality' updates, variants of the classical Secant relation (1) have been explored (see, Ford & Moghrabi, 1996;Wei et al., 2004;Ortiz et al., 2004;Yuan et al., , 2017Yuan et al., , 2018Zhang et al., 1999). Wei et al. (2006), using Taylor's series, derived a modification to (1) as follows: G.  used another alternative, given as: Some of the recent and well-known Secant-like methods that are motivated by the need to derive accelerated convergence variants of the classical BFGS may be summarized in Table 1.
Although the methods listed in Table 1 represent some of the well-known published Secantvariant formulae that in addition to incorporating gradient and step vectors function values obtained from the latest iterate are exploited in the update of the Hessian approximation. To differ, the methods derived in Moghrabi (2017) and Ford and Moghrabi (1993), utilize nonlinear interpolating polynomials, and introduce a completely different Secant-like equation utilizing available data from the two/three most recent steps. The methods of Moghrabi (2017) are based on the idea of performing multiple updates on each iteration so that multiple Secant and Secantvariant conditions are observed. The updates are performed in a manner that ensures the first update obeys (1) while subsequent updates satisfy where v t ¼ r iÀ mþt and u t ¼ w iÀ mþt , for r and w as in (7) and (8).
A sample of the work done on the derivation of modified Secant equations is summarized in Table 1. For interested readers, there are many other similar methods available in the literature Xiao et al., 2006;Yuan et al., 2018;Yuan et al., 2017Zhang et al., 1999). Several of the approaches mentioned here have had their convergence qualities investigated. Yuan et al. (2018), for example, demonstrate global convergence using a less stringent variant of the Powell-Wolfe line search criteria (4) and (5).
Such methods may prove useful in various domains in which the classical BFGS has been a serious contender. One possible domain is in sustainable transportation planning for dealing with the issue of minimizing air pollution, congestion, and noise (Sayyadi & Awasthi, 2018). Another possible application could be in the bi-objective inventory model created by Gharaei et al. (2020). The goal of to optimize the lot-sizing of the replenishment while meeting stochastic limitations and calculating the optimal number of lots and volume of each lot. Gharaei et al. (2015) aim to optimize incentives in single machine scheduling in reward-driven systems such that total reward is maximized while limitations such as total reward limitation for earliness and learning, independence from earliness and learning, and others are met. The goal is to maximize overall rewards in the mentioned system by using quadratic rewards for both learning and earliness, where non-secant methods may prove useful. Gharaei et al. (2019) present a mathematical model that combines the buyers' and vendors' total costs (TC) in a supply chain (TC) under penalty, green, and quality assurance (QC) rules, as well as a VMI-CS agreement. The goal is to find the optimal batch-sizing policy in the integrated SC with the lowest TC that finds both the number of the vendor's batches for each of the shipped product lines and the size of the batches shipped to the buyers in order to minimize the integrated SC's TC while satisfying the stochastic barriers. Due to the complexity of the optimization model, the possibility of employing the methods proposed in this paper for solving the integrated supply chain problem. Another attractive application might be in areas where quality control and green production policies are considered when designing a bi-objective multi-product limited and integrated economic production quantity model. There are stochastic constraints in the model mentioned above. Such research aims to find the best total inventory cost and total profit while keeping stochastic restrictions in mind (see, Gharaei et al. (2021)). Awasthi & Omrani, (2019) propose a goal-oriented strategy for prioritizing sustainable mobility projects based on fuzzy axiomatic design. The affinity diagram technique is used to establish sustainability evaluation attributes. The criteria and alternatives are rated using the fuzzy Delphi technique. Fuzzy axiomatic design provides final rankings for urban transportation initiatives, and the best project(s) are picked for implementation. Sensitivity analysis is used to examine the  Wei et al. (2006) Li and Fukushima Li and Fukushima (2001) Yuan et al.
Hassan (2020) Hassan and Mohammed Hassan and Mohammed (2020) Hassan and Ghada Hassan and Ghada (2020) impact of changes in input parameters on the model's stability, and this is a possible venue for the method proposed in this paper to prove its usefulness.
Next a new method that is motivated by the same concept of incorporating data available from several of the latest iterations for deriving a new variant to the Secant equation (1) is developed.

A New Non-Secant equation
We will use data from three most recent iterations to derive the new Secant-like relationship that the updated Hessian (or its inverse) must fulfil, just as we did with the nonlinear multi-step quasi-Newton methods (equations (5-10)) discussed earlier, thus choosing m ¼ 2 in (6). In particular, an interpolation of the points x iÀ 1 , The corresponding objective function is modelled as ω τ ð Þ;f x τ ð Þ ð Þ; using Taylor's expansion relation around the point τ 2 ¼ 0 (corresponds to x iþ1 Þ as follows for the τ-values chosen in (10).
Using (12), we have where x 0 τ ð Þ; dx dτ and x 00 τ ð Þ; d 2 x dτ 2 : Choosing the Lagrange representation for x τ ð Þ, we have (obtained by setting m = 2 in (7) and (8)), we obtain and for and, for all τ, the following quantities hold For the τ-values selected in (10), we have γ ¼ jjs i jj jjs iÀ 1 jj : It is not unreasonable then to introduce a generalization of γ obtained by plugging in a scaling constant, μ � 0 (see, Moghrabi, 2017). Such scaling gives an easier more convenient mechanism to switch to the standard one-step Secant update method by setting γ ¼ 0. Therefore, δ ¼ μ jjs i jj jjs iÀ 1 jj : Should τ 0 be chosen to be used in (12) if the true Hessian at x iþ1 be substituted by its approximation B iþ1 , as is the case in the implementation of the standard quasi-Newton methods, it is reasonable to require that or, alternatively, For the quantities x 0 0 ð Þ and x 00 0 ð Þ as in (14) and (15), respectively. It follows that Now, from (13) and (19), we obtain for # and δ are as defined in (14) and (15), respectively.
Upon defining the quantities r i ;s i À ρs iÀ 1 u i ;s i À γs iÀ 1 and #; τ 0 þτ1 τ 1 , then from (19) we have Equation (21) can be rewritten as for z i ;#r i , w i ;y i À ρy iÀ 1 and The equation obtained in (22) suggests a new Secant-like equation of the form for z i as in (21) and The computed search direction is downhill if B iþ1 is positive definite. By analogy with the standard Secant equation (1), B iþ1 is positive definite if and only if B i is already so and z T i v i >0: As this cannot be guaranteed in this new formulation, (23) is replaced with for some positive constant γ. Relation (24) is an appropriate replacement to (23) since (1) Stop if the size of the gradient satisfies ||g i || � ε (convergence threshold).
(1) Compute the new iterate x iþ1 ¼ x i þ α i p i . If (24) is satisfied, update H i using (10) with w i replaced by � v i and r i replaced by z i in (23), else set z i ¼ s i and � v i ¼ y i in (23) and update to satisfy condition (1) (an O(n 2 ) operation).
The overall cost of the above method, asymptotically, is the same as the standard quasi-Newton algorithm though practically the recorded savings in both iteration, function/gradient evaluations count differentiate the new method as opposed to others in the same category.

Convergence properties
The convergence analysis presented next rests on the following assumptions:

A2.
The objective function f is twice continuously differentiable on D and in an open set M containing D, there exists a constant ƺ > 0 such that g x ð Þ À g y ð Þ j j j j � ƺ||x-y||, for all x; y 2 M.
Given that the sequence f i f g is a decreasing sequence, then the sequence x i f g generated by the new method belongs to D, and there exists a constant f � such that:

A3.
The objective function f is uniformly convex, in that there are positive constants l 1 and l 2 such that holds for all x 2 Dandp 2 R n .
Theorem 1. Let f satisfy assumptions 1 and 2 and x i f g be generated by algorithm NBFGS and there exist constants m 1 and m 2 such that jjB i s i jj � m 1 jjs i jj and s T i B i s i � m 2 jjs i jj 2 ; "i; then the following holds lim i!1 inf jjg i jj ¼ 0:  (27) holds. By the Powell-Wolfe condition (5) and Assumption 2, we obtain It follows that, for any Using (26), one gets and, thus, By the Powell-Wolfe condition (4), it follows that and, consequently, and (28) follows, given (27).
We now proceed to prove that algorithm NBFGS has global convergence.
Theorem 2. Let f satisfy assumptions 1 and 2 above and x i f g generated by algorithm NBFGS. Then, the following holds Proof. As per Theorem 2, it suffices to prove that (29) holds for all i. By contradiction, assume this is not the case. Thus, there exists a positive constant such that, for all i, It is then easy to show that From (25) and the fact that Thus, from (30) and (31), we obtain for ρ; θ 2 γ� 2 σ=2 : The sequence B i f g, by theorem 2, for there exist constants m 1 and m 2 such that (29) applies for all i. The proof completes on the basis of the following (see Theorem 2.1 in Dai et al. (2002)).
If there are two positive constants l 1 and l 2 such that for all i such that for any positive integer j (28) holds for no less than ½j=2� iterations of i 2 1; ::; j f g.

Non-asymptotic Superlinear Convergence of the New Method
The analysis presented here is quite similar to that of Jin and Mokhtari (2021). The convergence results presented by the authors apply primarily to the standard BFGS method. We follow a similar approach to prove the superlinear convergence of NMBFGS. To do so, the following assumptions are needed:

A3.
The objective function f is twice differentially continuous. It is also convex with convexity parameter σ>0 and a gradient g(x) which is Lipschitz continuous with parameter ρ>0. Thus, we have σI � G x ð Þ � ρI; "x 2 R n :

A4. The Hessian matrix G(x) of f satisfies the condition
for some constant δ>0: Theorem 3. Consider the method described in Algorithm NMBFGS and suppose the objective function f satisfies the conditions in Assumptions A3 and A4. Choose two real parameters, for any arbitrary α in (0, 1), choose 2 > 0 and θ > 0 such that F : Should the start point x 0 and the initial inverse Hessian approximation H 0 satisfy the relations then the sequence x i f g 1 i¼0 generated by the new method converges to the optimal point x � superlinearly at a rate of ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Proof. Similar to the proof done in Jin and Mokhtari (2021).
Theorem 4 suggests that the sequence of points generated by NMBFGS advance to the optimal solution at a superlinear rate of ε 1 ffi ffi δ p þε 2 δ � � δ in the local neighbourhood of the optimal solution. The constants ε 1 and ε 2 depend on the parameter of the problem being solved. For sufficiently large δ, There is a compromise between the speed the method converges superlinearly and the size of the local neighbourhood in that increasing the size of 2 and θ results in an increase in the size of the vicinity the method converges at a superlinear rate. However, this will lower the speed of convergence since (32) increases therefore, by selecting different values for the parameters α; 2 and δ both the speed and region of superlinear convergence change.

Numerical test results and discussions
Summaries of the numerical results are tabulated in this section. Table 2 summarizes the list of test problems. The chosen test set is for issues of varying dimensions, allowing for the examination of various test problem sizes. The functions that were tested, and hence the results that were reported, relate to problems that were categorized by size/dimension. Those problem sizes fall into four categories, low (2 ≤ n ≤ 20), medium (21 ≤ n ≤ 40), moderately high (41 ≤ n ≤ 1000) and high (n >1000). The findings result from experiments on 17 different test problems with dimensions ranging from 2 to 100,000. Each of the problems in the list has been tested from four different perspectives. A total of 950 test items were obtained. Moré et al. (1981), Fletcher (1987), Tajadod et al. (2016), and Xiao et al. (2006) provided the test problems.
In (8) (see, Ford & Moghrabi, 1993), (Ford & Moghrabi, 1994, 1996, the new algorithm NMBFGS is compared to Yuan et al. (2017) (see Table 1) and the multi-step BFGS (MSBFGS; see, Ford & Moghrabi, 1993). The tests are run using Yuan's technique as a benchmark. Table 3 presents a summary of the overall findings. Table 4, 5, 6 and 7 show the results of the small, medium, moderately large, and huge problems, respectively. Because there isn't enough room to provide the figures for each dimension separately, the findings are summarized by dimension category. Iteration, function/gradient assessments, and total execution times are all included in the scores. The average optimality error is also reported for each dimension category. Figures 1, 2 and 3 provide a diagrammatic summary for each of function/gradient evaluations, iteration count and timing, respectively. The implementation code is C++ carried out on a 64-bit machine with i7-3770, 3.4 GHZ CPU.
For all the methods tested here, the next point x iþ1 is obtained from x i by employing a line search algorithm that applies safeguarded cubic interpolation with step-doubling, such that the conditions in (4) are satisfied.
For the standard BFGS method, it is well known that a necessary and sufficient condition for ensuring that the generated updated matrices H i f g are positive definite and hence the computed search direction is downhill, is that s T i y i >0 (Broyden, 1970). By analogy, for the new method MSBFGS, condition (24) is imposed in order to guarantee that � v T i z i is "sufficiently" positive to circumvent any     potential numerical instability in the construction of H iþ1 . Should condition (24) fail to hold, the algorithm reverts to using an MSBFGS iteration on that particular instance. An initial scaling is applied to the start inverse Hessian approximation using the methods introduced in Shanno and Phua (1978).
It is generally known that s T i y i >0 is a necessary and sufficient condition for assuring that the generated updated matrices  downhill for the typical BFGS method (Broyden, 1970). By analogy, condition (24) is applied for the new technique MSBFGS in order to ensure that that � v T i z i is "sufficiently" positive to avoid any potential numerical instability in the building of H iþ1 . If condition (24) fails, the algorithm falls back to utilizing an MSBFGS iteration on that specific instance. Using the approaches proposed in Shanno and Phua (1978), an initial scaling is applied to the start inverse Hessian approximation. Figure 4 shows the graphical results of the test problems in terms of the most expensive component of the minimization process, the overall number of function/gradient evaluations. That is, we depict the proportion p τ ð Þ of problems for which each technique is within a factor of τ the best number of evaluations for each approach. The algorithm that performs better in terms of evaluations that is within a factor of the best number of function evaluations is shown on the top curve. Figure 4 shows that NMBFGS outperforms the other two methods in terms of number of iterations and evaluations, thus making it the fastest and best performer.

Conclusions
In this study, Secant-like approaches based on modifications to the classical Secant relation (1) are investigated, and a new non-Secant method is devised. To increase the quality of the computed  Hessian approximations and, as a result, expedite convergence to a solution for a particular issue, the approach employs nonlinear quantities that integrate data readily available from the previous three iterations. The outperforming algorithm has been justified both theoretically and practically. The primary motivation to exploring alternatives to the standard Secant equation has been to obtain possibly better Hessian approximations that expedite the convergence of quasi-Newton methods. The proposed algorithm's convergence behaviour has been investigated. The numerical data achieved above have shown promising results, drawing attention to the methods' potential practical significance. The findings demonstrate that past iteration data, such as function values, gradient points, and step vectors, yield performance gains practically, and hence provide incentives to explore them further.

Future work
Future research ought to concentrate on modified Secant methods for various τ-parameter choices in order to investigate the numerical behaviour of these methods' sensitivity to such choices. It would also be interesting to study these methods' ability to offer equivalent improvements when used to solve systems of non-linear equations, as well as their convergence characteristics.
Similar approaches are also being investigated, in which the computed search direction d i is adjusted before the line search is performed. This strategy looks to be effective. A variation of the technique is also being considered, in which the update of the Hessian approximation is skipped every other iteration (or more frequently). Also, the sensitivity of the algorithmic behaviour to the accuracy of the line search deserves investigation. The numerical performance of such methods when replacing the line search with Trust Region methods is also worth exploring.