Accelerated gradient methods with absolute and relative noise in the gradient

In this paper, we investigate accelerated first-order methods for smooth convex optimization problems under inexact information on the gradient of the objective. The noise in the gradient is considered to be additive with two possibilities: absolute noise bounded by a constant, and relative noise proportional to the norm of the gradient. We investigate the accumulation of the errors in the convex and strongly convex settings with the main difference with most of the previous works being that the feasible set can be unbounded. The key to the latter is to prove a bound on the trajectory of the algorithm. We also give a stopping criterion for the algorithm and consider extensions to the cases of stochastic optimization and composite nonsmooth problems.


Introduction
We consider convex optimization problem on a closed convex (not necessarily bounded) set Q ⊆ R n : min x∈Q f (x). ( We assume that the objective f is L f -smooth and strongly convex with the parameter µ 0, i.e., for all x, y ∈ Q: This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138.
In the convergence rate analysis of different first-order methods these assumptions are typically used in the form of an upper and lower quadratic bounds [4,6,9,14,20,23,24,27,28,37,40,51,53] for the objective: Note that the last relation is a consequence of the L f -smoothness and, in general, is not equivalent, to it [28,52].In many applications, instead of an access to the exact gradient ∇f (x) an algorithm has access only to its inexact approximation ∇f (x).Typical examples include gradient-free (or zeroth-order) methods which use a gradient estimator based on finite differences [7,11,47], and optimization problems in infinite-dimensional spaces related to inverse problems [29,34].The two most popular definitions of gradient inexactness in practice are [45] as follows: for all x ∈ Q it holds that ∇f (x) − ∇f (x) 2 δ, (absolute error) or (3) ∇f (x) − ∇f (x) 2 α ∇f (x) 2 , α ∈ [0, 1) (relative error).
Under assumption (3), many results exist for non-accelerated and accelerated firstorder methods, see, e.g., [1,10,12,45].These results are in a sense pessimistic in general with the explanation going back to the analysis in [44].We can explain this by a very simple example.Consider the following problem where 0 µ = λ 1 λ 2 ... λ n = L f , L f ≥ 2µ.Clearly, the solution of this problem is x * = 0. Assume that the inexactness takes place only in the first component x 1 , i.e., instead of ∂f (x)/∂x 1 = µx 1 we have access to ∂f (x)/∂x 1 = µx 1 − δ, where δ is the error.For the simple gradient descent we can conclude that if x 1 0 0, then for all k ∈ N large enough, i.e., k L/µ, it holds that Hence, From this result, we see that it may be problematic to approximate f (x * ) with any desired accuracy, especially in the ill-conditioned setting when the strong convexity constant µ is smaller than the desired accuracy ε.For accelerated gradient methods the situation may be even worse since they are more sensitive to the gradient errors and such errors may even be accumulated by the algorithm [15,21,28].This drawback may be overcome by proposing a certain stopping rule so that the algorithm does not try to minimize below some threshold given by the gradient error or by adding a strongly convex regularizer with coefficient µ of the same order as the desired accuracy ε, see [28,38,44,45].Roughly speaking, for non-accelerated algorithms it was proved in [44,45] that if δ is of the order ε 2 , then it is possible to reach ε-accuracy in the objective residual function in almost the same number of iterations as in the exact case δ = 0 by applying a computationally convenient stopping rule.
In this paper, we analyze an accelerated gradient method in both convex and strongly convex settings and estimate how the gradient error defined in (3) influences the convergence rate.An important part of our contribution is that our analysis is made without an assumption that the feasible set Q is bounded.The main key for this development is a recurrent estimate for the distance between the current iterates and the optimal solution closest to the starting point.In particular, our results imply that it is sufficient to assume that δ is of the order ε in order to obtain objective residual of the order ε.We also present a stopping rule and prove that if it is satisfied at some iteration, the algorithm solves problem (1) with certain accuracy.Moreover, we prove that until this rule is fulfilled, the trajectory of the algorithm is bounded (which helps us to treat the setting of possibly unbounded set Q) and that it is fulfilled for sure in a number of iterations which is optimal for the class of smooth convex optimization problems.
Under assumption (4), non-accelerated gradient method for strongly convex problems is shown in [45] to have linear convergence with condition number O 11−α • Lf µ , i.e. 1  1−α times worse than in the exact case.Yet, convergence to any small error is guaranteed unlike the case of inexactness (3).This result holds also under the relaxed strong convexity assumption [28] known as Polyak-Lojasiewicz or gradient domination condition.We are not aware of any such results for accelerated gradient methods.
In this paper, we analyze an accelerated gradient method under inexact gradients satisfying (4) and answer the question of what is the maximum value of α such that the accelerated algorithm with inexact gradients converges with the same rate as the exact accelerated algorithm.For the case µ = 0 our answer is that α should satisfy α =

O µ
Lf .We hypothesise that this bound can be improved to α = O µ Lf and, for the case µ = 0, the iteration-dependent value α k should satisfy where k is the iteration counter.Numerical experiments demonstrate that, in general, for α larger than the mentioned above thresholds the convergence may slow down a lot up to divergence for the considered accelerated method.
Close results with the bound α = O in the case µ ε were recently obtained using another techniques in stochastic optimization with decision dependent distribution [17] and policy evaluation in reinforcement learning via reduction to stochastic variational inequality with Markovian noise [36].In [17,36], the authors assumed that Since x * is a solution, when Q = R n , we have ∇f (x * ) = 0. Therefore, Thus, if (4) holds, then (7) also holds with B = αL f .

Absolute noise
Important results on gradient error accumulation for first-order methods were developed in a series of works by O. Devolder, F. Glineur and Yu.Nesterov 2011-2014 [13][14][15][16].In these works, the authors were motivated by inequalities (2).Their idea was to relax (2), assuming inexactness in the gradient, introducing the inexact gradient ∇f (x), satisfying for all x, y ∈ Q This assumption allows to develop a theory for error accumulation for first-order methods.In particular, they obtained the following convergence rates for non-accelerated gradient methods: and for accelerated methods: where R is such that x start − x * 2 R, i.e., an estimate for the distance between the starting point x start and a solution x * .If x * is not unique, one may take x * to be the closest point to x start .Both of these bounds are unimprovable [15,16].See also [14,21,35] for "intermediate" situations between accelerated and non-accelerated methods and extensions for stochastic optimization.
Following [16], it is possible to make a reduction of the "absolute noise" inexactness in the sense of (3) to the inexactness in the sense of ( 8) by setting and setting L f,(8 The key observations here are that From this reduction, we see that when µ > 0, for non-accelerated methods.the result ( 9) is almost the same as in the example in (5).We see also, that, if the error can be controlled, to guarantee that f (x k ) − f (x * ) ε for non-accelerated method when 2 µ = Ω(ε) we should set δ (3) = O(ε), which is an expected result.Unfortunately, for accelerated methods, such reduction leads to the bound δ (3) = O(ε 3/2 ), which is worse than our bound indicated in Section 1.The key to our improvement is a more refined version of (8).
In the works [15,18,19,50,51], the following refined version of ( 8) is used: These inequalities lead to the following counterparts of ( 9) and (10), respectively, for non-accelerated gradient methods: and for accelerated gradient methods [15,19]: where R is the maximum distance between the sequences of iterates generated by the algorithm and the solution x * closest to the starting point.
2 If µ ε, we can regularize the problem and guarantee that µ = Ω(ε), see [28].Another advantage of strong convexity is the possibility to use the norm of inexact gradient for the stopping criteria, see [28,44].Yet, regularization requires [28] some prior knowledge about the distance to the solution.Since we typically don not have such information the procedure becomes more difficult via applying the restart technique, see [27,28].
From ( 13), (14), we see that if R is bounded,3 then by setting we obtain the desired result: it is possible to guarantee Previous works mainly rely on the assumption that R is bounded.As we may see from example (5), in general, when the strong convexity parameter µ is small compared to the desired accuracy ε, only a bound is possible to obtain [28].This bound leads to very pessimistic estimates.Moreover, the growth of R is observed in different numerical experiments and in theoretical estimates caused by error accumulation.In our work, we investigate this problem and, in particular, propose an alternative to regularization4 approach that is based on "early stopping"5 of the considered iterative procedure by developing proper stopping rule.

Relative noise
We now explain a way of reduction of the relative inexactness in the sense of (4) to the inexactness in the sense of (8), which allows us to apply (10) when µ ε.Since f (x) has Lipschitz gradient, from ( 4), (8), we can derive that after k iterations (where k is greater than L f /µ by a logarithmic factor log L f R 2 /ε with ε being the desired accuracy in terms of the objective residual): (10), (11) , we guarantee that the following restart condition holds When the restart condition holds, we restart the method.Then, after log (∆f /ε) restarts we can guarantee the desired ε-accuracy in terms of the objective residual.In ill-conditioned setting, i.e., when µ is small, the calculations are more involved.Yet, the main idea remains the same and replacing L f /µ with k (cf.(10)) we obtain that the inequality α k 1 k 3/2 allows us to obtain the same convergence rate as in the exact gradients case.Among many types of accelerated gradient methods, we choose to consider methods with one projection (Similar Triangles Methods (STM)), see [10,23,30,32,50] and references therein.We choose this type of accelerated methods since: 1) it is primaldual [22,30]; 2) it is possible to bound R in the absence of noise [30,40,50] and when the noise is present [32,33]; 3) has previously been intensively investigated, see [23] and references therein.

Some motivation for inexact gradients
In this section, we describe two, among many others, research directions where inexact gradients play an important role.We emphasise that, although the results below are not new, the way they are presented is of some value in our opinion and can be useful for the specialists in these directions.

Gradient-free methods
In this subsection, we consider convex optimization problem: where Q is a convex and closed set.In some applications we do not have an access to the gradient ∇f (x) of the objective function, but we can calculate the value6 of f (x) with accuracy δ f [11], i.e., one can evaluate f (x) s.t.
An interesting question in this setting is as follows.If the accuracy δ f of the approximation can be controlled, how should it be chosen in order to guarantee a desired accuracy ε when solving problem (1)?A related question is what is the largest level of noise δ f such that the algorithm can still achieve a desired accuracy ε?
In the considered setting, a number of options exists for approximating the gradient, see, e.g., [7] and references therein.We consider the following examples, assuming that f has L p -Lipschitz p-th order derivatives w.r.t. the Euclidean norm.
• (p-th order finite-differences).In this case, the gradient approximation is constructed via finite differences of inexact values f (x), which, e.g., in the case of p = 2 lead to the following approximation to the i-th partial derivative where e i is the i-th coordinate vector and h > 0 is a parameter.For general values of p, we have that (3) holds with see [7].The optimal choice of h guarantees that δ=O √ nδ p p+1 f .From Section 1, we know that it is possible to solve problem (1) with accuracy ε = O(δ) in terms of the objective value.Hence, in order to guarantee ε-accuracy, we should choose .
Unfortunately, such a simple idea does not allow one to reach the following lower bound in the class of algorithms that have sample complexity Note that, instead of the finite-difference approximation approach, in some applications we can use the kernel approach [3,46] which has recently a got renowned interest [2,42].• (Gaussian Smoothed Gradients).In this case, the approximate gradient is formally defined as where e ∈ N (0, I n ) is the standard Gaussian random vector.This implies that (3) holds with see [7,41].The optimal choice of h guarantees that δ=O (nδ f ) p p+1 . Hence, in order to guarantee ε-accuracy, we should choose This bound does not match the lower bound (16) as well.Moreover, here (and in the next approach) we have an additional difficulty since ∇f (x), in general, is not possible to evaluate exactly and only an inexact approximation is possible, for example, by the Monte Carlo approach [7], which leads to additional computational price for the better quality of approximation.
• (Sphere Smoothed Gradients).In this case, the approximate gradient is formally defined as where e is random vector with uniform distribution in the unit sphere in R n with the center at 0. This implies that (3) holds with see [7].The optimal choice of h guarantees δ = O (nδ f ) . Hence, in order to guarantee ε-accuracy, we should choose This bound does not match the lower bound ( 16) as well.It may seem that the this and the previous approaches are almost the same, but below we give a more accurate result for the Sphere smoothing.We are not aware of a way to obtain such a result for the Gaussian smoothing.The result is as follows [15,47]: for the Sphere smoothed gradient, we have that (8) holds with where L 0 is the Lipschitz constant of f and in (8) L f = min L 1 , 7L 2 0 h when p = 1 and L f = 7L 2 0 h , when p = 0.The bound (17) is more accurate than the previous bounds since it corresponds to the first part of the lower bound (16).Indeed, by choosing a proper h in (17) we obtain ε ∼ δ ∼ n 1/4 δ 1/2 f .Hence, in order to guarantee ε-accuracy, we should choose The other part of the lower bound (16), i.e., the case when δ f = O ε n , is also tight, see [5].Here we can also repeat the remark that the sphere smoothed gradient approximation ∇f (x), in general, is not available and needs to be approximated by a stochastic inexact gradient.In Section 6, we describe an extension of our analysis of accelerated gradient method with absolute noise in the gradient to the setting of stochastic gradients.
The bound in (17) and its consequences additionally illustrate that the inexactness and algorithms we describe in Section 2 and develop below are also tight (optimal) enough.Otherwise, it would not be possible to achieve the lower bound using the reduction of gradient-free methods to gradient methods with inexact oracle and the proposed analysis of the error accumulation for gradient-type methods.

Inverse problems
Another rather important research direction where gradients are typically available only approximately is optimization in Hilbert spaces [54], arising, in particular, in inverse problems theory [34].
We start by recalling a way to calculate a derivative in a general Hilbert space.Let J(q) := J(q, u(q)), where u(q) is the unique solution of the equation G(q, u) = 0. Assume that the partial q-derivative G q (q, u) of the operator G(q, u) is invertible.Then, we have G q (q, u) + G u (q, u)∇u(q) = 0, and ∇u(q) = − [G u (q, u)] −1 G q (q, u).
Therefore, ∇J(q) = J q (q, u) + J u (q, u)∇u(q) = J q (q, u) The same result can be obtained by considering the Lagrange functional L(q, u; ψ) = J(q, u(q)) + ψ, G(q, u) with L u (q, u; ψ) = 0, G q (q, u) = 0, and ∇J(q) = L q (q, u; ψ).Indeed, by simple calculations, we can connect these two approaches by setting Next, we demonstrate this technique on an inverse problem based on an elliptic initial-boundary-value problem.Let u(x, y) be the solution of the following problem, which we refer to as (P) Here we use subscripts x, y to denote the corresponding partial derivatives.The first two relations constitute the system of equations G(q, u) = Ḡ • (q, u) = 0, and the last two ones constitute the feasible set Q.
Hence, by (20), we have that Thus we reduced the calculation of ∇J(q)(y) to the solution of two correct initialboundary-value problems for elliptic equation on a square, namely problems (P) and (D) [34].This result can be also interpreted in a slightly different manner if we introduce a linear operator A : q(y) := u(1, y) → u(0, y).
Here u(x, y) is the solution of problem (P).It was shown in [34] that

The conjugate operator is [34]
Here ψ(x, y) is the solution of the conjugate problem (D).Thus, considering we have which completely corresponds to the scheme as described above: 1. Based on q(y) we solve (P) and obtain u(0, y) = Aq(y) and define p(y) = 2 (u(0, y) − b(y)).
2. Based on p(y) we solve (D) and calculate ∇J(q)(y) = A * p(y) = ψ x (1, y).Summarizing, the inexactness in the gradient ∇J(q) arises since we can solve (P) and (D) only numerically up to some accuracy.
The described above technique can be applied to many different inverse problems [34] and optimal control problems [54].Note that, for optimal control problems, in practice another strategy is widely used.Namely, instead of approximate calculation of the gradient, optimization problem is replaced by an approximate one (for example, by using finite-differences schemes).For this approximate (finite-dimensional) problem the gradient is typically available precisely [25].Moreover, in [25] the described above Lagrangian approach is used to explain the core of automatic differentiation, where the function calculation tree is represented as a system of explicitly solvable interlocking equations.

Absolute noise in the gradient
In this section, we consider problem (1) in the absolute noise setting (see ( 3)), i.e., we assume that the inexact gradient ∇f (x) satisfies uniformly in We underline that Q can be unbounded, for example R n .Under this assumption, we present several important relations concerning "inexact smoothness" and "inexact strong convexity".Then, we present and analyze an accelerated gradient method, study its error accumulation, and propose a stopping rule.

Auxiliary facts
We start with some auxiliary facts and assumptions.Let x start be some starting point for an algorithm and assume that there is a constant R such that where x * is a solution to problem (1).If x * is not unique we take x * that is the closest to x start .We assume that the function f has Lipschitz gradient with constant L f , i.e., is L f -smooth: This implies the inequality In what follows, we use the following simple lemma.
Let us introduce several constants, which will be used below in this section: From the L f -smoothness assumption, we obtain the following upper bound for the objective through the inexact oracle.
Claim 4.1.For all x, y ∈ Q, the following estimate holds: where Proof.The proof is given by the following chain of inequalities: We also assume that f is strongly convex with parameter µ 0, where the case µ = 0 corresponds to just convexity of f .This means that for all x, y ∈ Q: Based on this assumption and our assumption on the inexactness of the oracle, we can obtain two lower bounds for the objective.The first one is given by the following result.
Proof.Using the Cauchy inequality and ( 24) we obtain: For the second estimate, we assume that µ = 0 and introduce Claim 4.3.For all x, y ∈ Q, if in (24) µ = 0, the following estimate holds: where Using Lemma 4.1, we obtain: To unify the derivations based on Claims 4.2 and 4.3, we use the notation µ τ , τ ∈ {1, 2}, where τ = 1 and µ 1 = µ correspond to the bound in Claim 4.2 and τ = 2 and µ 2 = µ 2 correspond to the bound in Claim 4.3 and the case when µ = 0.

Similar Triangles Method and its properties
In this section, we introduce a variant of accelerated gradient method called Similar Triangles Method (STM).The design of STM is similar to that of the algorithm in [30] with the main difference being that here we use inexact gradient with absolute inexactness instead of exact gradient.This change required us to modify accordingly the analysis in order to take into account the presence of absolute inexactness in the gradient and possible unboundedness of the feasible set Q.
or equivalently 12: z k = argmin y∈Q ψ k (y), 13: x k = Ak−1xk−1+αkzk Ak .14: end for 15: Output: x N .Ak (z k − z k−1 ), i.e., the triangles (z k−1 , x k−1 , z k ) and (x k , x k−1 , x k ) are similar.When Q = R n , the main step of the algorithm can be simplified to using the first-order optimality condition in the definition of the point z k .This method is quite simple to implement, since it requires only one projection, which can be eliminated in the absence of constraints, and also has a geometric interpretation.Functions ψ k (x) contain first-order information, and are also chosen in such a way that the inequalities guaranteed by convexity or strong convexity can be used to estimate the objective from below, providing an estimating functions sequence.Moreover, since the functions ψ k (x) accumulate the first-order information from the previous iterations, the update of the variable z k can be seen as a momentum step that leads to the accelerated convergence rate.As it will be seen in Remark 6.2 and Section 6.1, this method can be modified for composite nonsmooth optimization problems and stochastic problems.
In the analysis, we use the following identities that easily follow from the construction of the algorithm: The following is the main technical result which will be used later in the analysis.
Lemma 4.2.For all k 1, the following inequality holds: Proof.By the definition of ψ k , we have Further, by construction, the function ψ k−1 has its minimum at the point z k−1 , which implies, by the optimality condition, We also have the identity Combining the above, we have Applying the identity and the definition of the sequence {A k }, we finally get Remark 4.1.In the case when µ = 0, we obtain the following particular case of the result of Lemma 4.2: We finish this subsection by a series of technical results that estimate the growth of the sequence {A k } and related sequences.Claim 4.4.If µ = 0, then for all k 1 the following inequality holds: where Proof.Using the relation between α k , A k , A k−1 : we obtain a quadratic equation for A k :

Solving this equation, we get
Using that, for x < 1, 1 + x e x 2 , we obtain the following result.
Claim 4.5.If µ = 0, then for all k 1 the following inequality holds: Proof.Using the previous claim, we get A k A k−j λ j µτ ,L , and, hence, Ak−j Ak λ −j µτ ,L .This gives Claim 4.6.If µ = 0, then for all k 1, Proof.If µ = 0, then A k = Lα 2 k , and, solving the quadratic equation, we get Then, by induction, it is easy to see that Proof.The proof follows from the simple calculations since {A k } is non-decreasing:

Convergence rates under the absolute inexactness
In this section we obtain main convergence rate results for Algorithm 1.We will use the following sequence Rk = max Proposition 4.4.The sequences {x k }, {x k }, {z k } generated by Algorithm 1 satisfy for all k 0 the inequality Proof.We prove the result by induction.The induction basis for k = 0 follows from the facts that A 0 = α 0 = 1 L and which imply, by Claim 4.1, and since x 0 = z 0 , that To make the induction step, we start from the following corollary of Claim 4.1: Using equations (25), this gives By the induction hypothesis and since µτ 2 z k − xk 0, we further obtain Using Lemma 4.2, we get which finishes the induction step and the proof.
Using the definition of { Rk } and {A k }, we obtain the following simple corollary of the above proposition: We note that the above estimates hold both in the case of µ = 0 and in the case of µ = 0.
The proof of the following result repeats verbatim the proof of Proposition 4.4, except for Claim 4.2 being replaced by Claim 4.3.
Proposition 4.6.Assume that the oracle error δ satisfies δ = 0 and that x0 −x * 2 R for some R.Then, for all k 1, Rk R.
Proof.We first prove that, for all k 0, z k − x * 2 R. Let us fix k 0. By Proposition 4.4, we have A k f (x k ) ψ k (z k ).Further, ψ k (x) is strongly convex with the constant at least 1.At the same time, by the strong convexity of f , we have Using these three facts and the definition of ψ k (x), we obtain: For the remaining two sequences, {x k } and {x k } the proof is organized by induction.Clearly, x0 − x * R. Since, by construction, x 0 = z 0 , we have x 0 − x * R.Then, by construction of the algorithm and the induction hypothesis, we have In the same way, we obtain xk − x * R using the definition Using the above results, we obtain the following convergence rate result for the STM algorithm.
If µ = 0, the sequences {x k }, {x k }, {z k } generated by Algorithm 1 satisfy for all N 0 the inequality where the sequence { Rk } is defined in (29).
Proof.The proofs of the first and second inequalities are nearly the same with the only difference that the proof of the first inequality is based on Proposition 4.4 and Claim 4.2, whereas the proof of the second inequality is based on Proposition 4.5 and Claim 4.3.Thus, we give only the proof of the first inequality.From (30), by the definition of {z N } and {ψ N (•)}, and Claim 4.2, we have ) Using Claim 4.4 with Corollary 4.3 and Claim 4.5 we get: Repeating the same steps and using Claim 4.6 and 4.7, we prove the third inequality.
Commenting on the results obtained in Theorem 4.7, we can conclude that in the case of strong convexity and the presence of absolute noise, STM converges in terms of the objective value up to some limiting accuracy.Namely, the convergence rate bound is the sum of the convergence rate of the optimal method for the class of strongly convex and Lipschitz-smooth problems and the term characterizing the limiting error caused by the noise is In the case when µ = 0, we obtain a weaker convergence rate statement since in the estimate we see a linear accumulation of the noise in the term N δ 2 , as well as in the term 3 RN δ 1 (Note that Proposition 4.6 gives the estimate for RN in the absence of noise).This motivates us to use the regularization technique to make a reduction of the convex case to the strongly convex case, which is considered in the next Remark 4.2.
Another way to deal with the noise accumulation is to introduce a stopping rule, which is done below in Section 4.4.
Remark 4.2.We can make a reduction of the setting when µ = 0 to the setting when µ = 0. Indeed, suppose that µ = 0 and consider the following regularized problem: Then, we have Clearly, f µ (x) has Lipschitz gradient.Indeed, ∀x, y ∈ Q: Since µ L, we have that f µ (x) is L µ -smooth with L µ = 4L f = 2L.Moreover, f µ (x) is strongly convex and we can apply the derivations corresponding to the case τ = 2. Using Theorem 4.7, and setting x * µ = argmin x∈Q f µ (x), R µ s.t.x * µ − x0 2 R µ , we obtain the following inequalities Translating this to the original objective f , we obtain By the strong convexity of the function f µ , we get: Finally, we get the convergence rate as follows: To obtain an error ε in the r.h.s., we choose µ = 2 3 ε R 2 .

Stopping rule under the absolute inexactness
In this subsection, we consider the setting with τ = 1 and µ = 0.In this case, a possible drawback of the convergence rate obtained in Theorem 4.7 can be that the sequence { RN } may increase as N increases.To overcome this, we formulate a certain condition (stopping rule) and prove that if it is satisfied at iteration N , the algorithm solves problem (1) with certain accuracy, and if it is not satisfied at iteration N , then RN R.Moreover, we estimate the maximum number of iterations to satisfy this condition.
Theorem 4.8.Consider the setting τ = 1 and µ = 0 and assume that for some R, Let ε > 0 be the desired solution accuracy.Let N be the first iteration such that Then, for all k ∈ {0, . . ., N − 1}, we have that Rk R.Moreover, Proof.Fixing any k 0, applying Proposition 4.4, the fact that 1-strongly convex function ψ k (•) attains its minimum at the point z k , the definition of this function, and Claim 4.2, we obtain Whence, Setting k = 0, since x0 − x * 2 R and, by the Theorem assumption, inequality (32) does not hold for k N − 1, we obtain where we also used that α 0 = A 0 .Thus, we obtain that z 0 − x * 2 R, and, since R. Let us now assume that for some k N − 1, Rk−1 R (see (29) for the definition of { Rk }).Then, by the definition of xk in Algorithm 1 and convexity of the norm, we have that xk − x * 2 R. Further, since k N − 1, we have that inequality (32) does not hold.Thus, from (35), we have: This implies that z k − x * 2 R, and, by the definition of x k and the convexity of the norm, that x k − x * 2 R. Hence, Rk R. In summary, we obtain by induction that for all k ∈ {0, . . ., N − 1}, Rk R.This also implies that xN − x * 2 R. We now prove the second statement of the Theorem.Let us assume the opposite, i.e., N > N max .We use (34) with k = N − 1 and obtain, since RN−1 R, that where we used Claim 4.6 and that N > N max .Thus, we see that after N − 1 iterations, inequality (32) holds.This is a contradiction with the definition of N as the first iteration number for which this inequality holds.This finishes the proof.
Combining (32) with Claim 4.7 and the fact that N N max , we obtain that Thus, if we redefine ε → ε 3 , and set δ 2 In some situations we have at our disposal the value of f (x * ) or its estimate.For example, when solving systems of linear equations by reformulating them as minimization problems: if a solution exists, we have f * = f (x * ) = 0.This allows us, based on (34), to change the inequality (32) to a more adaptive version, which can be checked online and which can be fulfilled much earlier than (32).Such counterpart of (32) reads as If this inequality is not fulfilled at iteration k, we have that Rk Moreover, we also obtain that (36) holds after no more than N max iterations.

Relative noise in the gradient
In this section, we consider problem (1) in the relative noise setting (see ( 4)), i.e., we assume that the inexact gradient ∇f (x) satisfies uniformly in x ∈ Q the inequality As in the previous section, we assume that f is L f -smooth.We also assume that f is strongly convex with µ = 0 and that Q = R n .For this setting, we analyze a slightly different version of accelerated gradient method, adopted from [50].
Since Q = R n , the main step of the algorithm can be simplified to Combining Definition 3.3 of [50] with Claims 4.1, 4.3 and particular choice V [y](x) = 2Lf + δ 2 µ = δ 2 + δ 3 , where we used that µ L f and that µ 2 = µ/2.Further, L in Definition 3.3 of [50] can be set to L in our paper, and µ in Definition 3.3 of [50] can be set to µ 2 = µ/2 in our paper.Algorithm 2 is a particular case of Algorithm 2 in [50].Since in this section, we are in the setting of relative inexactness (4), in each iteration of this algorithm we have a different error δ k = α ∇f (y k ) 2 , which gives the following expression for δ k in Algorithm 2 in [50]: . Applying Theorem 3.4 of [50], we obtain the following convergence rate for all N 0: Since we assumed that Q = R n , we have that ∇f (x * ) = 0 and that, for all , where we used (23) and our definition L = 2L f .Then, using convergence rate for {u k }, we obtain .
Using the convexity of f and the definition of the sequence {y k } we get: Our next goal is to estimate αN+1 AN+1 • L 4(1+Akµ2) from above.Using the inequalities Ak 1+µ2Ak 1 µ2 and

√
x + y √ x + √ y, and the definition of the sequence {α k }: we have This gives us the following estimate where we used that L/µ 2 1 and the definition of κ.
Since f is L f -smooth, L = 2L f and ∇f (x * ) = 0, we obtain for any Whence, using the previous bound, Introducing the following notations λ where we add the term corresponding to k = 0 to the sum to simplify the proof that will follow.Analyzing this recurrence, we obtain.
Claim 5.1.For all k 1 it holds that Proof.The induction basis k = 1 is obvious.Induction step: This gives us the following result By the definition of θ = 15α 2 4 , we obtain, that, if we choose α as α 1 7 Combining this with Claim 4.4 and Corollary 4.3, we obtain that As a result, we get the following theorem.
Theorem 5.1.Assume that the objective f is L f -smooth and strongly convex with µ = 0, that the inexactness in the gradient is described by (4), and that Q = R n .Also assume that α is chosen according to (37).Then, for all k 1, the sequence {y k } generated by Algorithm 2 satisfies

Extensions
In this section, we extend the analysis of Algorithm 1 with absolute noise to two settings.The first extension is an extension to stochastic optimization setting where the error in the gradient has stochastic nature.The second one is the extension to structured nonsmooth setting of composite minimization, where the objective is given as a sum of smooth part with inexact gradient and a simple convex function.In both cases, the analysis mainly follows the lines of Section 4. Thus, we underline the differences and skip in the proofs some steps that are similar to the analysis in that section.

Random additive noise in the gradient
In this subsection, we extend the analysis of Algorithm 1 for the setting of random absolute noise in the gradient.We assume that an algorithm can use the stochastic gradient ∇f (x, ξ), which is assumed to have bounded variance for all, possibly random, x ∈ Q: Similarly to Section 4, we assume L f -smoothness and µ-strong convexity of f , i.e., that ( 23),( 24) hold.As before, we set L = 2L f and where the latter quantity is defined whenever µ > 0.
One of the main motivations for such stochastic problems is machine learning.For example, Empirical Risk Minimization problem with the finite-sum structure of the objective can be considered as a stochastic optimization problem with stochastic gradient where ξ is a random subset of {1 . . .M }.It should be noted that the error δ 2 can be reduced by the use of mini-batches.Namely increasing the size of ξ from 1 to m decreases the variance from δ 2 to δ 2 m .The first step of the analysis is to obtain the counterparts of Claims 4.1, 4.2, and 4.3 in the stochastic setting.Claim 6.1.Assume that x, y are random vectors.Then, Proof.Using the L f -smoothness, we obtain where L = 2L f .Taking the full expectation of both sides, we get the required.
Using the same steps as in the proof of Claim 6.1, we get the counterparts of Claims 4.2, 4.3.Claim 6.2.Assume that x, y are random vectors.Then, Claim 6.3.Assume that x, y are random vectors and that µ = 0.Then, The following sequence is the counterpart of the sequence { Rk }: Using the above, we obtain the following counterparts of Propositions 4.4 and 4.5 under the assumptions of this subsection.
Proposition 6.1.The sequences generated by Algorithm 1 satisfy for all k 0 the inequality: Proposition 6.2.If µ = 0, the sequences generated by Algorithm 1 satisfy for all k 0 the inequality: The proofs of these propositions repeat the same induction steps as in the proofs of Propositions 4.4, 4.5, but using the new Claims 6.1, 6.2, and 6.3.Using the last two propositions, we finally obtain the following counterpart of convergence Theorem 4.7 for the stochastic setting.
Then, if µ = 0, the sequence {x N } generated by Algorithm 1 satisfy for all N 0 the inequalities If µ = 0, the sequences {x k }, {x k }, {z k } generated by Algorithm 1 satisfy for all N 0 the inequality Here the sequence { Bk } is defined in (40).
As we see, Algorithm 1 has the same convergence rate in the stochastic setting as in the deterministic setting.The proof of the above theorem repeats the same steps as the proof of Theorem 4.7.Thus, we omit the proof.Remark 6.1.Usually, in the context of stochastic optimization, the analysis of algorithms relies also on the assumption of unbiased stochastic gradient: Our analysis does not require this assumption.

Nonsmooth objective
In this subsection, we consider the problem of structured optimization, usually referred to as composite minimization, We assume that the function L is L f -smooth and µ-strongly-convex (see (22), ( 24)), the function r(x) is convex and relatively simple.We further assume that inexact gradient ∇L(x) with absolute noise (cf.(3)) is available for L. This setting is motivated, in particular, by machine learning problems, for example, logistic regression loss minimization problem with the l 1 regularization and dataset {(X i , y i )} K i=1 , where y i ∈ {0, 1} for i ∈ {1, K}.For this problem, we have In the setting of composite minimization, Algorithm 1 requires only one change in the definition of the function sequence {ψ k (•)} as follows: For such modified algorithm, in the concept of absolute noise (3), the convergence result remains the same.However, some intermediate statements, such as Lemma 4.2, require a different analysis.Therefore, we make a different analysis to obtain an estimate in the spirit of Proposition 4.4.
Lemma 6.4 (auxiliary statement for ψ k 's).Under the assumptions of this subsection, for the modified sequence {ψ k (•)}, we have Proof.The function ψ k defined in ( 43) is 1+µτ Ak

2
-strongly-convex.Thus, since z k is its minimizer, we have Using the recurrent definition of ψ k+1 we obtain the required by induction.
Using Lemma 6.4 instead of Lemma 4.2 and convexity of the function r(x), we can obtain a result similar to Proposition 4.4.Proposition 6.5.The sequences {x k }, {x k }, {z k } generated by Algorithm 1 modified for structured nonsmooth optimization satisfy for all k 0 the inequality Proof.The induction basis k = 0 is obvious and repeats the proof of Proposition 4.4 since Let us consider iteration k > 0. Since r(x) is convex, by the definition of x k , we get: By this inequality, Claim 4.1 applied to L(x), the definition of the sequences {x k }, {x k }, we have where in the last inequality we used the equation in Step 6 of Algorithm 1 and Claim 4.2 applied to L. By the induction hypothesis and since µτ 2 z k −x k 2 2 0, we further obtain Using Lemma 6.4, we can finish the proof in a similar way as in the proof of Proposition 4.4.
We finally obtain the following counterpart of Theorem 4.7 for composite minimization problems.Theorem 6.6.Let the modified Algorithm 1 be applied to composite problem (42), where the function L(x) is L f -smooth and µ-strongly-convex and the function r(x) is convex.If µ = 0, the sequences {x k }, {x k }, {z k } generated by the modified Algorithm 1 satisfy for all N 0 the inequalities If µ = 0, the sequences {x k }, {x k }, {z k } generated by the modified Algorithm 1 satisfy for all N 0 the inequality where the sequence { Rk } is defined in (29).
As we see, for composite problems, modulo a small modification of the algorithm, the main result is the same as in the smooth case.

Conclusions and observations
In this section, we give a number of remarks in order to discuss the obtained results.In particular, the convergence rate results obtained so far explicitly include the oracle inexactness, and we can look at these results from a little bit different angle of controlling the inexactness.In particular, if the oracle error can be controlled, we can estimate how small should be the oracle error if our goal is to obtain an ε-solution to the problem.Such bound also give an estimate for the largest tolerable error not preventing the algorithm from obtaining an ε-solution.
Remark 7.1.In Sections 6.1, 6.2, we considered the extensions of Algorithm 1 with absolute noise to the settings of stochastic optimization and structured nonsmooth optimization.We strongly believe that it is possible to combine these two extensions into one since the analysis in both cases follows the same lines as the analysis in Section 4. We believe that the same can be also done with the analysis of Algorithm 2 under the relative noise in the gradient (see stochastic version of this condition in [55]).We leave these developments for the future work.
Remark 7.2.The results of Theorem 4.7 and Proposition 4.6 are obtained for possibly unbounded feasible set Q.If this set is compact, we can set R = diam(Q), i.e., the diameter of the set Q.This simplifies the results and derivations since, in this case, by the construction of Algorithm 1, Rk R for all k 0. Remark 7.3.When considering the absolute noise, in Section 4.1, we had two possibilities for dealing with "inexact strong convexity": according to Claim 4.2 when µ 0 and according to Claim 4.3 when µ > 0. This resulted in two different bounds in Theorem 4.7 in the setting when µ > 0. Recalling that and comparing the two bounds in Theorem 4.7, we see that if L , then the model corresponding to τ = 2, that is described in Claim 4.3, leads to a smaller term in the convergence rate bound due to the error accumulation than the model corresponding to τ = 1, that is described in Claim 4.2 The above results are valid for uncontrolled and unknown values of the error δ in the model of absolute noise.At the same time, in some cases, it may happen that the error δ can be controlled and made as small as one desires.For example, in the setting of Section 3.2, the gradient can be approximated using finite-difference solution of primal and adjoint systems of equations, and δ can be decreased by decreasing the discretization step.In the setting of Section 6.1, the error δ can be made smaller by the means of using mini-batches of stochastic gradients.Thus, a natural question is how small should one choose the accuracy δ if the goal is to find an ε-approximate solution, i.e., guarantee f (x k )−f (x * ) ε? A similar question could be as follows: given a target accuracy ε, how large is the error δ that can be tolerated by an algorithm still guaranteeing the target accuracy ε? This, in particular, allows one to compare the robustness of different algorithms with respect to the noise.In the following series of remarks, we address these questions by deriving the relations between δ and ε.
Thus, choosing ; Remark 7.5.Let us consider the setting of Remark 4.2, where we made a reduction of the convex case µ = 0 to the strongly convex case by introducing a quadratic regularization with regularization parameter µ.Recall that this led to the bound where R is such that x0 − x * 2 ≤ R. We choose the regularization parameter µ, the error level δ, and the number of iterations N such that each of the three terms in this bound are smaller than ε 3 .Then, choosing Remark 7.6.Let us apply Theorem 4.8 for solving linear inverse problems.Let A ∈ R n×n be such that det(A) = 0 and consider the following linear system for finding xR n : Ax = b.Solving this problem is equivalent to solving the convex optimization problem: If we solve the latter problem with accuracy ε = ε 2 0 2 , then we guarantee that Ax−b 2 ε 0 .
Let us assume that the solution x * satisfies x * 2 R * and that Algorithm 1 starts from the point 0.Then, we can take R = R * .According to Theorem 4.8, given a target accuracy ζ > 0, we have that Algorithm 1 stops after N stop iterations such that since {A j } is an increasing sequence., where χ = L µ , the Triple Momentum Method converges with a linear rate as well.At the same time, their convergence rate depends on the noise level α, is no better than the accelerated rate, and is equal to it only in the case α = 0. Figure 2 illustrates the situation for two different values of the condition number χ.The black dashed line shows the threshold, below which STM with relative inexactness in the gradient has linear convergence rate similar to exact STM, and the latter rate is denoted by the orange line.Green line shows the convergence rate of the gradient method.Finally, the blue line shows the dependence of the convergence rate in [26] on the inexactness level α.As we see, it can be even worse than that of the gradient method for large values of α.As our experiments show, STM is more robust in the relative noise setting, that is, numerically estimating the dependence of the largest possible α = α * for given problem parameters µ, L, we get a larger upper bound.More detailed information can be found in Section 8.This leads us to the hypothesis that the condition α O µ L for inexact STM may be weakened.

Numerical experiments
In this section, we provide a series of numerical experiments to the practical performance of the considered algorithms under absolute and relative noise.The noise was generated as independent random uniform and unbiased.
We start with the experiments in the setting of µ = 0 using the following objective function described in [40, p. 69] and known as the worst-case function for first-order methods: The next two plots show the convergence of STM at the first 50 000 and 10 000 iterations, respectively, in the absolute noise setting with different values of δ.We can observe that, as predicted by Theorem 4.7, we see that the increasing third term in the convergence rate (31) at some point starts to overweight the first decreasing term.
We further compare the convergence in two different settings of the noise: absolute and relative.As was expected from the theory, for sufficiently small α, the convergence of inexact method is very close to the convergence of the exact method.Since in this experiment the noise is stochastic, this effect can be possibly explained using the theoretical results obtained in [55]: under the strong growth condition (SGC) E ξ ∇f (x, ξ) 2 2 ρ ∇f (x) 2  2 , L f -smoothness and convexity, SGD with Nesterov's acceleration has the following convergence rate: i.e., similar to the deterministic method despite that the gradients are stochastic.SGC can be translated into the relative noise condition (4), making them related.Although a different method is used in our paper, the obtained results make it reasonable to expect a similar convergence in the concept of relative noise as in the absence of any noise.
The next plot illustrates the convergence of STM in the setting of µ = 0 and relative noise in the gradient for different values of the parameter α.As we see, for α 0.71, the convergence of the method does not deteriorate and the value α * ≈ 0.71 can be seen as a threshold above which the method diverges.
We next explore the strongly convex setting with µ > 0 using the worst-case function [40, p.   Next, similarly to the degenerate case µ = 0, we consider the behavior of the method for different parameters α when a relative noise is present in the gradient.Note that, in the strongly convex case, we observe a similar effect as in the degenerate case: the algorithm converges for α-values smaller than a certain threshold value α * .Finally, we compare STM and triple momentum method.Figures 11 and 12 show, that for the same parameters of the problem, STM is capable of converging at a much higher noise level than triple momentum algorithm.

Figure 1
Figure 1 illustrates the iterates of the algorithm and justifies the name Similar Triangles Method (STM): by construction x k − xk = αk Ak (z k − z k−1 ), i.e., the triangles (z k−1 , x k−1 , z k ) and (x k , x k−1 , x k ) are similar.When Q = R n , the main step of the

1 2 x − y 2 2 ,
we have that δ in Definition 3.3 of[50] can be set to δ =

Theorem 6 . 3 ( 2 R
Convergence rate of stochastic STM).Let x0 − x * for some R, function f be L f -smooth and strongly convex with parameter µ 0. Let the stochastic gradient ∇f (x, ξ) satisfy

Remark 7 . 7 . 3 √
9R * , we guarantee that f (x Nstop ) − f (x * ) ε and, hence, Ax−b 2 ε 0 .Moreover, the number of iterations to guarantee such a solution is bounded as In the setting of relative noise in the gradient, Theorem 5.1 says that whenever α O µ L , STM converges linearly in the same way as accelerated gradient method in the exact setting, i.e., with the rate O 1 − µ L k which is faster than the convergence rate O 1 − µ L k of gradient descent.Here k is the iteration counter.The paper[26] considers, in particular, accelerated method, called the Triple Momentum Method, in the presence of relative noise in the gradient.They show that when α < √ χ+1 4χ−χ+1 = O µ L

Figure 2 .
Figure 2. Comparison of the convergence rate of Triple Momentum Method and STM

Figure 3 .
Figure 3. First test -the performance of STM for µ = 0 for the first 50 000 iterations.

Figure 4 .
Figure 4. First test -the performance of STM for µ = 0 for the first 10 000 iterations.

Figure 5 .
Figure 5. Second test -the performance of STM for µ = 0 with relative and additive types of noise.

Figure 6 .
Figure 6.Third test -the performance of STM with relative noise and µ = 0 for different values of α.
We first consider the performance of STM with absolute noise for different values of δ.Dashed lines represent the corresponding theoretical bound.

Figure 8 .
Figure 8. Fifth test -mean of 30 tests, level of approximation and required number of steps.

Figure 9 .
Figure 9. Sixth test -the performance of STM with relative noise and µ > 0 for different values of α.

Figure 10 .
Figure 10.Seventh test -the performance of STM with relative noise and µ > 0 for different values of α.

Figure 12 .
Figure 12.Nineth test -threshold α * for different L and µ = 0.1, for the Triple Momentum Method