Semismooth Newton-type method for bilevel optimization: Global convergence and extensive numerical experiments

We consider the standard optimistic bilevel optimization problem, in particular upper- and lower-level constraints can be coupled. By means of the lower-level value function, the problem is transformed into a single-level optimization problem with a penalization of the value function constraint. For treating the latter problem, we develop a framework that does not rely on the direct computation of the lower-level value function or its derivatives. For each penalty parameter, the framework leads to a semismooth system of equations. This allows us to extend the semismooth Newton method to bilevel optimization. Besides global convergence properties of the method, we focus on achieving local superlinear convergence to a solution of the semismooth system. To this end, we formulate an appropriate CD-regularity assumption and derive suffcient conditions so that it is fulfilled. Moreover, we develop conditions to guarantee that a solution of the semismooth system is a local solution of the bilevel optimization problem. Extensive numerical experiments on $124$ examples of nonlinear bilevel optimization problems from the literature show that this approach exhibits a remarkable performance, where only a few penalty parameters need to be considered.


INTRODUCTION
We consider the standard optimistic bilevel optimization problem Throughout the paper, the functions F, f : R n × R m → R, G : R n × R m → R p , and g : R n × R m → R q are assumed to be twice continuously differentiable. As usual, we call F (resp. f) the upper-level (resp. lower-level) objective function, whereas G and g are called upper-level and lower-level constraint functions, respectively. Finally, x and y represent upper-level and lower-level variables.
In order to derive optimality conditions for the bilevel optimization problem or to treat it numerically, two main approaches for reformulating (1.1) as a single-level problem exist. One approach is to replace the lower-level problem by its Karush-Kuhn-Tucker (KKT) conditions. This leads to a mathematical program with complementarity constraints (MPCC). However, quite strong assumptions are needed to show that a local (or global) optimal solution of the MPCC yields a local (or global) optimal solution of the corresponding bilevel optimization problem; for details, the reader is referred to [7]. Therefore, we do not pursue this approach here. Nevertheless, we would like to underline that solving an MPCC is challenging, and only a few Newton-type methods with global or local fast convergence exist [27,33]. Even if we disregard the discrepancies between (local or global) solutions of the bilevel problem and the MPCC for a moment, it is an open question which conditions in the context of bilevel problems are implied by assumptions needed for the local convergence analysis of a Newton-type method for the corresponding KKT reformulation.
The second main approach to transform a bilevel program into a single-level optimization problem is the lower-level value function (LLVF) reformulation [38,53], which provides the basis for the developments in this paper. More in detail, the LLVF ϕ : R n → R is given by (1.4) where, for the sake of simplicity, we assume throughout the paper that S(x) = ∅ for all x ∈ R n so that ϕ is indeed finite-valued on R n . Then, w.r.t. local and global minimizers, the bilevel program (1.1) is equivalent to the optimization problem min x,y F(x, y) s.t. G(x, y) ≤ 0, g(x, y) ≤ 0, f(x, y) ≤ ϕ(x).
(1. 5) In general, this is a nonconvex constrained optimization problem containing the typically nondifferentiable LLVF ϕ. Even if all the functions involved in (1.1) are fully convex (i.e., convex w.r.t. (x, y)), the feasible set of problem (1.5) is generally nonconvex. Several algorithms for computing a stationary point of the LLVF reformulation were already designed and analyzed. For the case where the feasible set of the lower-level problem does not depend on the upper-level variable x such methods can be found in [34,51,52]. The algorithms in [9,10] were suggested to solve special cases of problem (1.5), where relaxation schemes are used to deal with the value function (1.4). Finally, the authors of [31] proposed numerical methods to solve special bilevel programs by exploiting a connection between problem (1.5) and a generalized Nash equilibrium problem. In addition, we note that the LLVF reformulation has been used recently for methods for certain classes of mixed integer bilevel programming problems, see, e.g., [22,35,49].
In this paper, we develop a framework for solving problem (1.5), which does not rely on the direct computation of the LLVF (1.4) or its derivatives, as it is the case in [34,51,52]. Thanks to this framework, we are able to extend the semismooth Newton method to bilevel optimization, for the first time in the literature. The ingredients used to construct the framework and to establish global convergence of the method are well-known in the literature. At first, we use partial calmness [53] to move the value function constraint, i.e., f(x, y) ≤ ϕ(x), to the upper-level objective function, by means of partial exact penalization. For the resulting problem, necessary KKT-type optimality conditions are derived and reformulated as a square nonsmooth semismooth system of equations which depends on the penalty parameter. For the aforementioned system of equations, global convergence of a semismooth Newton algorithm is established based on [4], see also [39,40,41] for related results. Obtaining local superlinear convergence of a semismooth Newton algorithm usually relies on the nonsingularity of some generalized Jacobian of the system of equations, see [17] and more general results in [14,30,42]. For our setting, we will derive conditions that ensure an appropriate nonsingularity property. To this end, upper-and lower-level linear independence conditions as well as a certain SSOSC (strong second order sufficient condition)-type condition will come into play. In standard nonlinear optimization, a similar setup guaranties that the reference point, where these conditions are satisfied, also corresponds to a locally optimal solution [43]; this is known as Robinson condition. We will derive a Robinson-type condition which guaranties that the reference point corresponds to a locally optimal solution for the penalized bilevel program. For this, the LLVF (1.4) is required to be second order directionally differentiable in the sense of [1,47].
For the algorithm studied in this paper, we have conducted detailed numerical experiments using the BOLIB (Bilevel Optimization LIBrary) [56] made of 124 examples of nonlinear bilevel optimization problems from the literature. The true optimal solutions are known for 70% of these problems. We were able to recover these solutions using a selection of just 9 values of the involved penalization parameter. However, it is important to emphasize that the primary goal of the method is to compute stationary points based on problem (1.5). For each of the 9 values of the penalization parameter, the method converges for at least 87% of the problems. The algorithm also exhibits a good experimental order of convergence (at least greater or equal to 1.5 for at least about 70% of the problems) for different values of the parameters. To the best of our knowledge, this is the first time where an algorithm is proposed for nonlinear bilevel optimization, with computational experiments at such a scale and such a level of success.
The paper is organized as follows. In the next section, we recall some basic notions and properties, centered around the generalized first and second order differentiation of the LLVF (1.4). Section 3 discusses the stationarity concept that will be the basis for the semismoth system suggested later on. There, we recall some basic tools and the general framework for deriving optimality conditions for the LLVF reformulation (1.5). In Section 4, based on the reformulation in [17], we suggest a semismooth system of equations to rewrite the stationarity conditions and establish a semismooth Newton method for computing stationary points. In Section 5, we derive sufficient conditions for the CD-regularity of this semismooth system at a solution in order to guarantee superlinear or quadratic convergence. Finally, Section 7 presents results of a numerical study of the semismooth Newton method on the test problems in the BOLIB library [56].

PRELIMINARIES
We first introduce some basic notation that will be used throughout. Depending on the situation, both y and z will be used as lower-level variables. According to this, for some (x,ȳ) and (x,z), the index sets (2.1) of the active upper-level constraints and for the active lower-level constraints are defined. Moreover, given a multiplierū ∈ R p for the upper-level constraint function G and multipliersv,w ∈ R q , each for the lower-level constraint function g, then the index sets are defined as subsets of the upper-level indices {1, . . . , p} and, similarly, η 2 := η g (x,ȳ,v), θ 2 := θ g (x,ȳ,v), and ν 2 := ν g (x,ȳ,v), η 3 := η g (x,z,w), θ 3 := θ g (x,z,w), and ν 3 := ν g (x,z,w) (2.4) are defined as subsets of the lower-level indices {1, . . . , q}. The lower-level Lagrangian function defined from R n × R m × R q to R is given by For a function depending on upper-level and (or) lower-level variables, we will often use the notations ∇ 1 and ∇ 2 to denote the gradients of this function w.r.t. the upper-level (resp. lower-level) variable, to avoid any potential confusion. Similarly, ∇ 2 2 , ∇ 2 1 , and ∇ 2 12 could be used when referring to second order derivatives. Since the optimal value function ϕ (1.4) is nondifferentiable in general, we need generalized concepts of differentiability. Let us first recall the usual notion of the directional derivative. For a function ψ : R n → R, its directional derivative atx ∈ R n in the direction d ∈ R n is the limit  [45]. The optimal value function ϕ (1.4) can be directionally differentiable without being differentiable or convex. To underline this, we are going to recall a result from [23], that will play an important role in this paper. To proceed, we will say that the lower-level linear independence constraint qualification (LLICQ) holds at a point (x,z) if the family of vectors is linearly independent. For any (x, z), the set of lower-level Lagrange multipliers is given by Also, let us define the set-valued map K : R n ⇒ R m by where, for any z ∈ S(x), w z is the unique element in Λ(x, z).
Note that the set of lower-level Lagrange multipliers Λ(x, z) is a singleton for all z ∈ S(x) because LLICQ is assumed for all these z. If the lower-level problem is convex (i.e., f and g i , i = 1, . . . , p are convex functions), a version of formula (2.10) is given in [25] without requiring uniqueness of lower-level Lagrange multipliers. In the absence of convexity, corresponding formulas of ϕ (x; d) not requiring singlevaluedness of Λ are given in [44]. However, in the latter case, some second order assumptions are imposed on problem (1.2). Also, the uniform compactness of K can be replaced by other types of assumptions, i.e., the inner semicontinuity or compactness of the lower-level solution map S (1.3); see [37] for details.
As it is shown in [1], one can define a second order directional derivative for ψ : R n → R atx in directions d and e by ψ (x; d, e) := lim provided the limit exists. Next, we recall a formula for the second order directional derivative of the LLVF ϕ (1.4) developed in [47]. For this purpose, a few more assumptions are needed. A point (x,z,w) is said to satisfy the lower-level strict complementarity condition (LSCC) if The lower-level submanifold property (LSMP) is said to hold at (x, d) if for allz such that the restriction of S(x) on a neighborhood ofz is a smooth submanifold of z | g j (x, z) = 0 for all j ∈ I 3 . Finally, the lower-level second order sufficient condition (LSOSC) is said to hold at (x, d) if e ∇ 2 2 (x, z, w z )e > 0 for all z ∈ S 1 (x; d) and all e ∈ [T • (z)] ⊥ \ {0} such that ∇ 2 g j (x, z) e = 0 for all j ∈ I 3 . Here T • (z) denotes the tangent space [32] of S(x) at z. For further details and discussions on these assumptions and related results, including the following one, see [2,47]. (2.14) with S 1 (x; d) given in (2.13) while ξ d (x, z) is defined by (2.15) Note that ∇ 2 stands for the Hessian w.r.t. the vector made of the first and second variables of the function (2.5). We can simplify the assumptions needed in this theorem if we impose that the lower-level problem (1.3) has a unique optimal solution for x :=x; cf. the following theorem from [46].
Then, we have the expression where ξ d (x,ȳ) is defined as in (2.15), with z =ȳ, and Z d (x,ȳ) given by To extend the concept of the directional derivative to a wider class of functions the notion of a generalized directional derivative was introduced in [3] for a function ψ : R n → R by where "co" stands for the convex hull, D ψ represents the set of points where ψ is differentiable, and ∂ B ψ(x) is called B-subdifferential of ψ atx. For vector valued functions ψ, the notions ∂ B ψ(x) and ∂ψ(x) in (2.19) remain valid with ∇ψ denoting the Jacobian of ψ at points where ψ is differentiable. Then, the set ∂ψ(x) is called generalized Jacobian of ψ atx.
To complete this section, we briefly review the concept of semismoothness [36], which is used for the convergence result of the Newton method to be discussed in this paper. Let a function ψ : R n → R m be Lipschitz continuous aroundx. Then, ψ is called semismooth atx if the limit exists for all d ∈ R n . If, in addition, holds for all V ∈ ∂ψ(x + d) with d → 0, then ψ is said to be strongly semismooth atx.
The function ψ will be said to be ψ is SC 1 if it is continuously differentiable and ∇ψ is semismooth. Also, ψ is called LC 2 function if ψ is twice continuously differentiable and ∇ 2 ψ is locally Lipschitzian.

NECESSARY CONDITIONS FOR OPTIMALITY
The standard approach to derive necessary optimality conditions for the LLVF reformulation (1.5) of the bilevel optimization problem (1.1) is to consider the partial penalization min x,y F(x, y) + λ(f(x, y) − ϕ(x)) s.t. G(x, y) ≤ 0, g(x, y) ≤ 0, (3.1) where λ ∈ (0, ∞) denotes the penalization parameter. To deal with the fact that no standard constraint qualification holds for problem (1.5) [53], the authors of the latter paper introduced the partial calmness concept and showed its benefit for obtaining KKT conditions for (1.5). More in detail, problem (1.5) is said to be partially calm at one of its feasible points (x,ȳ) if there exist λ ∈ (0, ∞) and a neighborhood U of (x,ȳ, 0) ∈ R n × R m × R such that Based on this, the following result can be easily derived.
Using this theorem, we are now going to establish the necessary optimality conditions that will be the basis of the Newton method in this paper. To proceed, we first need two constraint qualifications. The upper-level Mangasarian-Fromowitz constraint qualification (UMFCQ) will be said to hold at (x,ȳ) if there exists d ∈ R n+m such that ∇G i (x,ȳ) d < 0 for all i ∈ I 1 and ∇g j (x,ȳ) d < 0 for all j ∈ I 2 . (3.3) The lower-level Mangasarian-Fromowitz constraint qualification (LMFCQ) is satisfied atx if, for any z ∈ S(x), there exists d ∈ R m such that Further note that a real-valued function (x, y) → ψ(x, y) will be said to be fully convex if it is convex w.r.t. to all variables. The lower-level problem is usually said to be convex if f and g j (j = 1, . . . , q) are convex just w.r.t. the lower-level variable. One can easily check that if a function (x, y) → ψ(x, y) is fully convex, then it is convex w.r.t. x and it is convex w.r.t. y.
Theorem 3.2. Let (x,ȳ) be a local optimal solution of problem (1.5), assumed to the partially calm at (x,ȳ). The functions f and g 1 , . . . , g q are assumed to be fully convex. Furthermore, suppose that LMFCQ holds atx, and that UMFCQ holds at (x,ȳ). Then, there exist λ ∈ (0, ∞), u ∈ R p , (v, w) ∈ R q × R q , and z ∈ R m such that the following system holds for (x, y) = (x,ȳ): Proof. Note that similar proof techniques for related results can be found in [8,12,53]. Since the functions f, g 1 , . . . , g q involved in the lower-level problem are fully convex and sufficiently smooth, the optimal value function ϕ (1.4) is convex [50, Lemma 2.1] and, hence, locally Lipschitz continuous around x (|ϕ(x)| < ∞ for any x was assumed throughout). Therefore, for any λ > 0, problem (3.1) is a Lipschitz continuous optimization problem. Moreover, under the partial calmness condition, Theorem 3.1 guarantees that λ > 0 existst such that (x,ȳ) is a local optimal solution of problem (3.1). Now, applying the necessary optimality conditions for locally Lipschitz optimization problems to (3.1) and taking into account that UMFCQ (3.3) holds at (x,ȳ), we obtain the existence of u ∈ R p and v ∈ R q such that (3.8), (3.9), and hold for (x, y) = (x,ȳ). It is clear that (3.6) for (x, y) = (x,ȳ) follows from the last m components of this inclusion. Moreover, considering the first n components of (3.11), we get . Now, since the lower-level functions are fully convex and LMFCQ holds atx, it follows from [50, Theorem 2.1] that we can find some z ∈ S(x) and w ∈ Λ(x, z) such that (3.5) holds for (x, y) = (x,ȳ). Furthermore, observe that the conditions (3.7) and (3.10) for (x, y) = (x,ȳ) result from the definition of w ∈ Λ(x, z); cf. (2.8).
Remark 3.3. In the literature, for example, see [8,12] and references therein, an upper-level regularity condition is often used in combination with LMFCQ to derive (3.8), (3.9), and (3.11). Here, we employ UMFCQ instead as problem (1.1) involves coupled upper-level constraints (i.e., depending on both the upper and lower-level variables), which is not the case in the aforementioned papers. Moreover, due to the convexity of the lower-level problem, the explicit use of inclusion z ∈ S(x) is avoided, given that its fulfillment follows from the conditions (3.7) and (3.10).
Remark 3.4. If y = z in the optimality conditions in Theorem 3.2, then we arrive at another well-known type of conditions consisting of (3.6)-(3.10) and with z replaced by y in (3.7) and (3.10). Instead of the full convexity assumption, the inner-semicontinuity concept can also allow one to derive these conditions. To see this, note that if the LMFCQ and inner semicontinuity both hold at (x, z), then the Clarke subdifferential of ϕ can be estimated as see [8,12] for related details and references. Note that various other stationarity concepts for the bilevel programs based on the LLVF reformulation are possible; see the latter references for related details. However, it is important to note for any of these conditions, the convexity assumption will still be required for the lower-level problem for inclusion y ∈ S(x) to be ignored in these conditions.
For conditions ensuring that the assumptions in Theorem 3.2, in particular partial calmness, we refer to [8,12,13,53,54] and references therein. Keeping the penalty parameter λ > 0 fixed, the optimality conditions in Theorem 3.2 can be regarded as a mixed complementarity system and lead to an equivalent square system of nonsmooth equations which is dealt with a semismooth Newton method [4,17,30,40,39,42]. Dealing with other optimality conditions, like in [8,12,53], in a similar way, one is led to more difficult nonsmooth systems for which more sophisticated Newton-type methods [15,20,19] might be helpful.

THE ALGORITHM
To apply the semismooth Newton method from [4] to system (3.5)-(3.10), for some fixed λ > 0, the latter system is reformulated by means of the complementarity function φ : R 2 → R [17] given by It can be easily checked that φ(a, b) = 0 if and only if a ≥ 0, b ≥ 0, ab = 0 is valid. Therefore, to rewrite the complementarity system (3.8)-(3.10), we first define functions φ G : R n × R m × R p → R p and respectively. Furthermore, we introduce the Lagrange-type functions L λ : where λ > 0 is the fixed penalty parameter. Based on these definitions, we now introduce the mapping where ∇L λ denotes the gradient of L λ w.r.t. (x, y, z). Now, keeping (4.2) and (4.3) in mind, it can be easily seen that the optimality conditions (3.5)-(3.10) in Theorem 3.2 can equivalently be written as Obviously, this is a square system of N equations and N variables. Moreover, the mapping Φ λ is strongly semismooth at any solution of (4.4). In particular, we can apply the semismooth Newton method in [4] with its favorable combination of global and local convergence properties. The latter means superlinear or quadratic convergence based on (strong) semismoothness of Φ λ and a regularity property of the generalized Jacobians ∂ B Φ λ or ∂Φ λ at a solution ζ * of (4.4). Sufficient conditions for ∂Φ λ (ζ * ) containing only nonsingular matrices will be developed in Section 5. For global convergence, the complementarity function φ has the property that φ 2 is differentiable with Lipschitz continuous derivative. Due to this, the merit function is continuously differentiable. In particular, this enables to overcome situations where the Newton direction for (4.4) does not exist or its descent is insufficient. For other complementarity and merit functions as well as their properties we refer to [18,48], for example. We now present the semismooth Newton algorithm from [4] applied to equation (4.4) or, in other words, to the optimality conditions from Theorem 3.2.
Some remarks on Algorithm 4.1 are in order. Compared to [4], we are now dealing with a complementarity system instead of a pure complementarity problem. The penalization parameter λ > 0 has to be chosen in Step 0 and is fixed throughout the algorithm. For the definition of the generalized Jacobian ∂ B Φ λ (ζ), see Section 2. If all matrices in ∂ B Φ λ (ζ) are nonsingular, the function Φ λ is called BD-regular at ζ. Following [4], global and local convergence properties of Algorithm 4.1 can be derived as follows.
Step 2: Choose W k ∈ ∂ B Φ λ (ζ k ) and compute a solution d k of the system
The proof of [4,Theorem 11] can be easily extended to the complementarity system Φ λ (ζ) = 0. Moreover, it is known that the continuity of the second-order derivatives of the problem functions F, G, f, and g (as assumed in Section 1) suffices instead of the semismoothness of these derivatives (as used in [4] for showing that eventually the unit stepsize α k = 1 is accepted).
The assumption that Φ λ is BD-regular atζ used in Theorem 4.2 can be replaced by the stronger CDregularity of Φ λ atζ. The latter requires the non-singularity of all matrices in ∂Φ λ (ζ). In the next section, we focus on the derivation of sufficient conditions for CD-regularity. The latter also allows a nice connection to the Robinson condition that we are going to discuss in Section 6. Let us finally note that conditions ensuring that a stationary point of a merit function is a solution of the underlying equation were extensively studied for complementarity problems. We do not want to dive into this subject but note that if just one element of ∂Φ λ (ζ) is nonsingular then ∇Ψ λ (ζ) = 0 implies Φ λ (ζ) = 0, for example see [14,Section 4].

CD-REGULARITY
To provide sufficient conditions guarantying that CD-regularity holds for Φ λ (4.3), we first provide an upper estimate of the generalized Jacobian of Φ λ in the sense of Clarke (2.18). Recall that the index sets η i , ν i and θ i with i = 1, 2, 3 are defined in (2.3) and (2.4).
Theorem 5.1. Let the functions F, G, f, and g be twice continuously differentiable at ζ := (x, y, z, u, v, w). If λ > 0, then Φ λ is semismooth at ζ and any matrix W λ ∈ ∂Φ λ (ζ) can take the form where the matrices A ij and B ij are respectively defined by In the next result, we provide conditions ensuring that the function Φ λ is CD-regular. To proceed, first note that, similarly to the UMFCQ (3.3) and analogously to the LLICQ (2.7), the upper-level linear independence constraint qualification (ULICQ) will be said to hold at (x,ȳ) if the following family of vectors is linear independent: Furthermore, let us introduce the cone of feasible directions for problem (3.1) , and d 1,3 defined similarly. Recall that for i = 1, 2, 3, ν i is defined as in (2.3)-(2.4). By ∇ 2 L λ (ζ) and ∇ 2 (ζ), we will denote the Hessian of the Lagrangian functions L λ and w.r.t. (x, y) and (x, z), respectively.
Note that the strict complementarity condition imposed here is restricted to the lower-level problem and does not necessarily imply the local differentiability of the lower-level optimal solution map as known in earlier results on the Newton method, see, e.g., [28,29] or in the literature on nonlinear parametric optimization, see, e.g., [16]. As we also have the LLICQ, the differentiability of lower-level optimal solution is usually guarantied when a strong second order sufficient condition (SOSSC)-type condition restricted to the lower-level problem is satisfied as well. The LSCC here only allows us to deal with the minus sign appearing on the lower-value function in problem (1.5) and is responsible for many of the stationarity concepts for the problem; cf. [8,12,53]. Next we discuss two possible scenarios to avoid imposing the LSCC. In the first case, we assume that the lower-level feasible set is unperturbed.
Proof. Considering the counterpart of (5.7)-(5.12) when g is independent of x and proceeding as in the proof of the previous theorem, we have (5.13), (5.14) and for j ∈ P 1 1 , j ∈ P 2 1 , and j ∈ P 3 1 , respectively. Here, c 1 j for j ∈ P 1 1 and c 2 j for j ∈ P 2 1 are defined as in (5.14) while c 3 j := − while taking into account (5.24) and the counterparts of (5.13) and (5.14). Hence, it follows from assumption (5.22) that d 1 = 0, d 2 = 0, d 3 = 0, d 4 j = 0 for j ∈ P 1 1 , d 5 j = 0 for j ∈ P 2 1 , and d 6 j = 0 for j ∈ P 3 1 . The rest of the proof then follows as that of Theorem 5.2.
For the next result, the LSCC is also not needed, and g does not necessarily have to be independent of the upper-level variable.

ROBINSON-TYPE CONDITION
For a standard nonlinear optimization problem with twice continuously differentiable functions, the Robinson condition [43] is said to hold at one of its KKT points if LICQ and a strong second order sufficient condition (SSOSC) are satisfied. It was shown in [40] that if a standard nonlinear optimization is SC 1 and satisfies the Robinson condition, then the CD-regularity condition holds for the corresponding counterpart of function Φ λ (4.3). Considering the structure of the results from the previous section, it can be argued that the combination of assumptions in Theorems 5.2, 5.3 or 5.4 corresponds to an extension of Robinson's condition to the context of bilevel optimization. However, another implication of Robinson's condition, i.e., precisely of the SSOSC, is that it ensures that a given point is a strict local optimal solution for the corresponding nonlinear programming problem.
The aim of this section is to enhance the second order assumption in the previous section so that it can guaranty that points computed by our algorithm are strict local optimal points. To proceed, we introduce the following cone of feasible directions where {w} := {w(z)} = Λ(x, z) for a fixed z ∈ S(x), as the LLICQ (2.7) will be assumed to hold at (x, z) for all z ∈ S(x). Also note that as in (5.5), d can be written as d := ((d 1 ) , (d 2 ) ) . Furthermore, we will use the following modified version of the upper-level Lagrangian function (4.1) L λ κ (x, y, u, v) := κ (F(x, y) + λf(x, y)) + i∈I 1 (d) where the set I 1 (d) (resp. I 2 (d)) represents the set of indices i ∈ I 1 (resp. j ∈ I 2 ) such that we have ∇G i (x,ȳ) d = 0 (resp. ∇g j (x,ȳ) d = 0). Recall that I 1 and I 2 are given in (2.1) and (2.2). In the next result, we first provide slightly general SSOSC-type condition for problem (3.1).
Similarly to (6.21), the upper-and lower-level constraint functions satisfy the second order epiregularity conditions. Subsequently, condition (6.8) holds.
Assuming that the lower-level optimal solution mapping S (1.1) is single-valued atx can lead to a much simpler result, closely aligned to Theorem 5.2, and subsequently to the derivation of a Robinson-type condition for problem (3.1). To proceed, we define C(x,ȳ) := Q(x,ȳ,ȳ); cf. (5.5).
It is clear that under the framework of this theorem, the conclusion of Theorem 5.2 is valid, while also guarantying that the resulting point (x,ȳ) is a strict local optimal solution of problem (3.1). Hence, Theorem 6.2 can be seen as an extension of the Robinson condition to bilevel optimization, in the sense discussed at the beginning of this section.
A common point between Theorems 6.1 and 6.2 is that the SSOSC-type condition, i.e., (6.2) and (6.23), respectively, are the main assumptions, as the remaining ones (except from the USCC and LSCC in Theorem 6.2) are mostly technical, helping to ensure that the LLVF ϕ (1.4) is second order directionally differentiable. Observe that the USCC and LSCC in Theorem 6.2 help to ensure that points satisfying (6.3)-(6.4) coincide with stationarity points in the sense (3.5)-(3.10).
For examples where the assumptions in Theorem 6.1, ensuring the second order differentiability of ϕ (1.4), hold without the uniqueness of the lower-level optimal solution, see, e.g., [47,2].
To close this section, we provide an example confirming that the SSOSC-type condition (6.23) is essential in guarantying that a point is locally optimal for problem (3.1), and potentially also a necessary condition. The lower-level solution set-valued mapping is single-valued; i.e., precisely, We can however check that all the other assumptions of Theorem 6.2 are satisfied for (1, 1, 1, 2λ − 5, 2).
Further analysis on the necessity of condition (6.23) and the corresponding condition in Theorem 6.1, will be studied more closely in a future work. Different types of sufficient optimality conditions for bilevel optimization problems can be found in the papers [5,11].

NUMERICAL EXPERIMENTS
Based on our implementation of Algorithm 4.1 in MATLAB (R2018a), we report and discuss test results obtained for the 124 nonlinear bilevel programs from the current version of the BOLIB library [56].
Recall that the necessary optimality conditions (3.5) -(3.10) and their reformulation (4.4) as nonsmooth system of equations contain the penalization parameter λ > 0. Based on the construction process of the system (see, e.g., [8,53]), there is no specific rule on how to choose the best value of this parameter. Rather, one may try all λ from a certain finite discrete set in (0, ∞), solve the corresponding optimality conditions, and then choose the best solution in terms of the upper-level objective function value. For our approach, it turned out that a small set of λ-values is sufficient to reach very good results. To be precise, for all our experiments, we just used the nine values of λ in Λ := λ ∈ {2 −1 , 2 0 , · · · , 2 6 , 2 7 }. In addition to the stopping criterion Φ λ (ζ k ) ≤ used in Algorithm 4.  [56], values of the leader's objective function F. The column F known shows the best known F-values from the literature. Such a value was not available for 6 of the test problems. This is marked by "unknown" in the column Status. For 83 examples, the best known F-value is even optimal (with status labelled as "optimal"). For the remaining 35 test problems, the known F-value might not be optimal and its status is just set to "known". The columns below F new show the F-values obtained by Algorithm 4.1 for the nine penalization parameters λ.
Note that three examples contain a parameter that should be provided by the user. They are examples No 14, 39, and 40. The first one is associated with ρ ≥ 1, which separates the problem into 4 cases: (i) ρ = 1, (ii) 1 < ρ < 2, (iii) ρ = 2, and (iv) ρ > 2. The results presented in Table 1 correspond to case (i). For the other three cases, our method still produced the true global optimal solutions. Example No 39 has a unique global optimal solution. The result presented in Table 1 is for c = 0. We also tested our method when c = ±1, and obtained the unique optimal solutions as well. Example No 40 contains the parameter M > 1, and the results presented in Table 1 correspond to M = 1.5.
Let us first note that evaluating the performance of an algorithm for the bilevel optimization problem (1.1) is a difficult task since the decision whether a computed point is (close to) a global solution of (1.1) basically requires computing the LLVF ϕ. Therefore, instead of doing this, we suggest the following way of comparing our results for Algorithm 4.1 with the results from literature known for the test problems. For an approximate solution (x, y) obtained from Algorithm 4.1, we first compute where F known as above is the best known F-value from literature and f known the lower-level function value which corresponds to F known . Moreover, we set In the latter case, δ can become negative. This means that both F(x, y) and f(x, y) are smaller than the values for the point with best F-value known from literature. In the last column of Table 1, marked by δ * we provide the smallest δ-value among those obtained for all λ ∈ Λ for the corresponding test problem.

No Examples
HatzEtal2013 optimal 0.00    We then report the performance of Algorithm 4.1 for λ ∈ Λ. Let K denote the iteration number, where Algorithm 4.1 is terminated. The first row in Table 2 shows the number of examples for which Algorithm 4.1 failed, i.e., it did not reach Φ λ (ζ K ) ≤ . Failures were observed only for a very small portion of the 124 examples. The second row in Table 2 reports for how many examples the last step was a full Newton step, i.e., α K = 1. This behavior could be observed for more than 100 examples regardless of λ ∈ Λ.
We also want to consider whether y K ≈ z K occurs. As mentioned in Remark 3.4, y = z generates another well-known type of stationary conditions, i.e., (3.6)-(3.10) and (3.12). Moreover, w is a multiplier associated to the lower-level constraint function g in connection to the lower-level problem, v is associated to the same function g, but in connection to the upper-level problem. The analogy in the complementarity systems (3.9) and (3.10) raises the question whether v and w coincide in some examples. In Table 2, the fifth and sixth rows list how many examples, for each λ, were solved with y K ≈ z K or v K ≈ w K . The symbol "≈" used here means that y K − z K /max(1, z K ) ≤ 0.01 and w K − v K /max(1, v K ) ≤ 0.01 respectively. More information about the test runs and time of Algorithm 4.1 is provided by the last two rows of Table 2. The average iterations and CPU time (in seconds) are less than 500 and 0.5 seconds, respectively, over all runs without failure. Finally, in order to estimate the local behaviour of Algorithm 4.1 on our test runs, Figure 1 reports on the experimental order of convergence (EOC) defined by . More details on the numerical results discussed here can be found in the supplementary material [21].