An incremental mirror descent subgradient algorithm with random sweeping and proximal step

ABSTRACT We investigate the convergence properties of incremental mirror descent type subgradient algorithms for minimizing the sum of convex functions. In each step, we only evaluate the subgradient of a single component function and mirror it back to the feasible domain, which makes iterations very cheap to compute. The analysis is made for a randomized selection of the component functions, which yields the deterministic algorithm as a special case. Under supplementary differentiability assumptions on the function which induces the mirror map, we are also able to deal with the presence of another term in the objective function, which is evaluated via a proximal type step. In both cases, we derive convergence rates of O(1/k) in expectation for the kth best objective function value and illustrate our theoretical findings by numerical experiments in positron emission tomography and machine learning.


Introduction
We consider the problem of minimizing the sum of nonsmooth convex functions where C ⊆ R n is a nonempty, convex and closed set and, for every i = 1, . . . , m, the so-called component functions f i : R n → R := R ∪ {±∞} are assumed to be proper and convex and will be evaluated via their respective subgradients. Implicitly we will assume that m is large and it is therefore very costly to evaluate all component functions in each iteration. Consequently, we will examine algorithms which only use the subgradient of a single component function in each iteration. These so-called incremental algorithms, see [1,2], have been applied for large-scale problems arising in tomography [3], generalized assignment problems [2] or machine learning [4]. We refer also to [5] for a slightly different approach, where in the spirit of incremental algorithms only the gradient of one of the component functions is evaluated in each step, but gradients at old iterates are used for the other components. Both subgradient algorithms and incremental methods usually require decreasing stepsizes in order to converge, which makes them slow near an optimal solution. However, they provide a very small number of iterations a low accuracy optimal value and possess a rate of convergence which is almost independent of the dimension of the problem. We refer the reader to [6] for a subgradient algorithm designed for the minimization of a nonsmooth nonconvex function under the making use of proximal subgradients.
When solving optimization problems of type (1), one might want to capture in the formulation of the iterative scheme the geometry of the feasible set C. This can be done by a so-called mirror map, that mirrors each iterate onto the feasible set. The Bregman distance associated with the function that induces the mirror map plays an essential role in the convergence analysis and in the formulation of convergence rates results (see [7,8]). So-called mirror descent algorithms were first discussed in [9] and more recently in [8,10,11] in a very general framework, in [4,12] from a statistical learning point of view and in [13] for the case of dynamical systems. The mirror map can be viewed as a generalization of the ordinary orthogonal projection on C in Hilbert spaces (see Example 2.4), but allows also for a more careful consideration of the structure of the problems, as it is the case when the objective function is subdifferentiable only on the relative interior of the feasible set. In such a setting, one can design a mirror map which maps not onto the entire feasible set but only on a subset of it where the objective function is subdifferentiable (see Example 2.5).
There exists already a rich literature on incremental algorithms dealing with similar problems. In [1,2] incremental subgradient methods with a random selection of the component functions and even projections onto a feasible set are considered, but no mirror descent. Incremental subgradient algorithms utilizing mirror descent techniques are investigated in [3]; however, an additional projection onto the feasible set is required which thus excludes the case where dom f ⊇ C (this is taken care of in our case by the weak assumption that im(∇H * ) ⊆ dom f ). Furthermore, the results appearing in Section 4 discussing Bregman proximal steps appear to completely novel for this kind of problems and are only known from a forward-backward setting [7].
The basic concepts in the formulation of mirror descent algorithms are recalled in Section 2. We also provide some illustrating examples, which present some special cases, as the general setting might not be immediately intuitive. In Section 3 we formulate an incremental mirror descent subgradient algorithm with random sweeping of the component functions which we show to have a convergence rate of O(1/ √ k) in expectation for the kth best objective function value. In Section 4 we ask additionally for differentiability of the function which induces the mirror map and are then able to add another nonsmooth convex function to the objective function which is evaluated in the iterative scheme by a proximal type step. For the resulting algorithm, we show a similar convergence result. In the last section, we illustrate the theoretical findings by numerical experiments in positron emission tomography (PET) and machine learning.

Elements of convex analysis and the mirror descent algorithm
Throughout the paper, we assume that R n is endowed with the Euclidean inner product ·, · and corresponding norm · = √ ·, · . For a nonempty convex set C ⊆ R n we denote by ri C its relative interior, which is the interior of C relative to its affine hull. For a convex function f : and as ∂f (x) := ∅, otherwise. We will write f (x) for an arbitrary subgradient, i.e. an element of the subdifferential ∂f (x).

Problem 2.1: Consider the optimization problem
where C ⊆ R n is a nonempty, convex and closed set, f : R n → R is a proper and convex function, and H : R n → R is a proper, lower semicontinuous and σ -strongly convex function such that C = dom H and im(∇H * ) ⊆ int(dom f ).
We say that H : R n → R is σ -strongly convex for σ > 0, if for every x, x ∈ R n and every λ It is well known that, when H is proper, lower semicontinuous and σ -strongly convex, then its conjugate function H * : R n → R, H * (y) = sup x∈R n { y, x − H(x)}, is Fréchet differentiable (thus it has full domain) and its gradient ∇H * is 1/σ -Lipschitz continuous or, equivalently, H * is Fréchet differentiable and its gradient ∇H * is σ -cocoercive, which means that for every y, y ∈ R n it holds σ ∇H * (y) − ∇H * (y ) 2 ≤ y − y , ∇H * (y) − ∇H * (y ) . Recall that im(∇H * ) := {∇H * (y) : y ∈ R n }.
The following mirror descent algorithm has been introduced in [10] under the name dual averaging.

Algorithm 2.2:
Consider for some initial values x 0 ∈ int(dom f ), y 0 ∈ R n and sequence of positive stepsizes (t k ) k≥0 the following iterative scheme: We notice that, since the sequence (x k ) k≥0 is contained in the interior of the effective domain of f, the algorithm is well defined. The assumptions concerning the function H, which induces the mirror map ∇H * , are not consistent in the literature. Sometimes H is assumed to be a Legendre function as in [7], or strongly convex and differentiable as in [8,11]. In the following section, we will only assume that H is proper, lower semicontinuous and strongly convex.

Example 2.3:
For H = 1 2 · 2 we have that H * = 1 2 · 2 and thus ∇H * is the identity operator on R n . Consequently, Algorithm 2.2 reduces to the classical subgradient method:

Example 2.4:
For C ⊆ R n a nonempty, convex and closed set, take H(x) = 1 2 x 2 , for x ∈ C, and H(x) = +∞, otherwise. Then ∇H * = P C , where P C denotes the orthogonal projection onto C. In this setting, Algorithm 2.2 becomes: This iterative scheme is similar to, but different from the well-known subgradient projection algorithm, which reads:
The following result will play an important role in the convergence analysis that we will carry out in the next sections. Lemma 2.6: Let H : R n → R be a proper, lower semicontinuous and σ -strongly convex function, for σ > 0, x ∈ R n and y ∈ ∂H(x). Then it holds Proof: The functionH(·) := H(·) − (σ/2) · 2 is convex and y − σ x ∈ ∂H(x). Thus Rearranging the terms, leads to the desired conclusion.

A stochastic incremental mirror descent algorithm
In this section, we will address the following optimization problem.

Problem 3.1: Consider the optimization problem
where C ⊆ R n is a nonempty, convex and closed set, for every i = 1, . . . , m, the functions f i : R n → R are proper and convex, and H : R n → R is a proper, lower semicontinuous and σ -strongly convex function such that C = dom H and im(∇H In this section, we apply the dual averaging approach described in Algorithm 2.2 to the optimization problem (3) by only using the subgradient of a component function at a time. This incremental approach (see, also, [1,2]) is similar to but slightly different from the extension suggested in [8]. Furthermore, we introduce a stochastic sweeping of the component functions. For a similar strategy, but in the random selection of coordinates we refer to [14].

Algorithm 3.2: Consider for some initial values x
, y m,−1 ∈ R n and sequence of positive stepsizes (t k ) k≥0 the following iterative scheme: where i,k is a {0, 1} valued random variable for every i = 1, . . . , m and k ≥ 0, such that i,k is independent of ψ i−1,k and P( i,k = 1) = p i for every i = 1, . . . , m and k ≥ 0.
One can notice that in the above iterative scheme y i,k ∈ ∂H(ψ i,k ) for every i = 1, . . . , m and k ≥ 0.
In the convergence analysis of Algorithm 3.2 we will make use of the following Bregman-distancelike function associated to the proper and convex function H : R n → R and defined as We notice that d H (x, y, z) ≥ 0 for every (x, y) ∈ R n × dom H and every z ∈ ∂H(y), due to subgradient inequality. The function d H is an extension of the Bregman distance (see [4,7,11]), which is associated to a proper and convex function H : R n → R fulfilling dom ∇H := {x ∈ R n : H is differentiable at x} = ∅ and defined as For y outside this set the conclusion follows automatically.
For every i = 1, . . . , m and every k ≥ 0 it holds Rearranging the terms, this yields for every i = 1, . . . , m and every k ≥ 0 to From here we get for every i = 1, . . . , m and every k ≥ 0 which, by using that H is σ -strongly convex and Lemma 2.6, yields Using the fact that Since all the involved functions are measurable, we can take the expected value on both sides of the above inequality and get due to the assumed independence of i,k and ψ i−1,k for every i = 1, . . . , m and every k ≥ 0 Since E( i,k ) = p i , we get for every i = 1, . . . , m and every k ≥ 0 Summing the above inequality for i = 1, . . . , m and using that or, equivalently, Thus, for every k ≥ 0, On the other hand, by using the Lipschitz continuity of ∇H * it yields for every Combining (6) and (7) gives for every k ≥ 0 Since ψ m,k = x k+1 and y m,k = y 0,k+1 we get for every k ≥ 0 that By summing up this inequality from k = 0 to N−1, where N ≥ 1, we get and this finishes the proof.

Remark 3.4:
The set from which the variable y is chosen in the previous theorem might seem to be restrictive, however, we would like to recall that in many applications dom H is the set of feasible solutions of the optimization problem (3). Since im(∇H * ) = dom ∂H := {x ∈ R n : ∂H(x) = ∅} ⊆ dom H, the inequality in Theorem 3.3 is fulfilled for every y ∈ im(∇H * ).

Remark 3.5:
Note furthermore that so far we have not made any assumptions about the stepsizes in Theorem 3.3. It is, however, clear from the statement that in the case where y = x * for an optimal solution x * and the stepsizes (t k ) k∈N fulfil the classical condition that ∞ k=1 t k = +∞ and ∞ k=1 t 2 The optimal stepsize choice, which we provide in the following corollary, is a consequence of [8, Proposition 4.1], which states that the function . . , d} and D ∈ R d×d is a symmetric positive definite matrix, attains its minimum on R d ++ at z * = (2cσ /b T D −1 b)D −1 b and this provides 2c/σ b T D −1 b as optimal objective value.  (3) and (x k ) k≥0 be a sequence generated by Algorithm 3.2 with optimal stepsize Then for every N ≥ 1 it holds

Remark 3.7:
In the last step of the proof of Theorem 3.3, one could have chosen to use the following inequality given by the convexity of m i=1 f i (·) in order to prove convergence of the function values for the ergodic sequencex k := (1/ k i=0 t i ) k i=0 t i x i for all k ≥ 0. This would lead for every N ≥ 1 and every and for the optimal stepsize choice from Corollary 3.6 to and might be beneficial, as it does not require the computation of objective function values, which are by our implicit assumption of m being large expensive to compute.

A stochastic incremental mirror descent algorithm with Bregman proximal step
In this section, we add another nonsmooth convex function to the objective function of the optimization problem (3) and provide an extension of Algorithm 3.2, which evaluates in particular the new summand by a proximal type step. However, this asks for supplementary differentiability assumption on the function inducing the mirror map.

Problem 4.1: Consider the optimization problem
where C ⊆ R n is a nonempty, convex and closed set, for every i = 1, . . . , m, the functions f i : R n → R are proper and convex and g : R n → R is a proper, convex and lower semicontinuous function, and H : R m → R is a proper, σ -strongly convex and lower semicontinuous function such that For a proper, convex, lower semicontinuous function h : R n → R we define its Bregman-proximal operator with respect to the proper, σ -strongly convex and lower semicontinuous function H : R n → R as being Due to the strong convexity of H, the Bregman-proximal operator is well defined. For H = ( 1 2 ) · 2 it coincides with the classical proximal operator.
We are now in the position to formulate the iterative scheme we would like to propose for solving (9). In case g = 0, this algorithm gives exactly the incremental version of the iterative method in [8], actually suggested by the two authors in this paper.

Algorithm 4.2:
Consider for some initial value x 0 ∈ im(∇H * ) and sequence of positive stepsizes (t k ) k≥0 the following iterative scheme: where i,k is a {0, 1} valued random variable for every i = 1, . . . , m and k ≥ 0, such that i,k is independent from ψ i−1,k and P( i,k = 1) = p i for every i = 1, . . . , m and k ≥ 0. In what follows we prove that this is the case also for ψ 0,k , for every k ≥ 0. To this aim, it is enough to show that x k ∈ im(∇H * ) for every k ≥ 0. For k = 0 this statement is true by the choice of the initial value. For every k ≥ 0 we have that which, according to int(dom H) ∩ dom g = ∅, is equivalent to Thus, x k+1 ∈ dom ∂H = im(∇H * ) for every k ≥ 0 and this concludes the proof.

Example 4.4:
Consider the case when m = 1, 1,k = 1 for every k ≥ 0 and H(x) = 1 2 x 2 for x ∈ C, while H(x) = +∞ for x / ∈ C, where C ⊆ R n is a nonempty, convex and closed set. In this setting, ∇H * is equal to the orthogonal projection P C onto the set C. Algorithm 4.2 yields the following iterative scheme, which basically minimizes the sum f 1 + g over the set C: The difficulty in Example 4.4, assuming that it is reasonably possible to project onto the set C, lies in evaluating prox H t k g , for every k ≥ 0, as this itself is a constraint optimization problem When C = R n , the iterative scheme (10) becomes the proximal subgradient algorithm investigated in [15].
For y outside this set the conclusion follows automatically.
As in the first part of the proof of Theorem 3.3, we obtain instead of (8) the following inequality which holds for every i = 1, . . . , m and every k ≥ 0 As pointed out in the proof of Lemma 4.3, for every k ≥ 0 we have The three point identity leads to or, equivalently, for every k ≥ 0. Since the involved functions are measurable, we can take the expected value on both sides and obtain for every k ≥ 0 Combining (11) and (12) gives for every k ≥ 0 By adding and subtracting E( m i=1 f i (x k+1 )) and by using afterwards the Lipschitz continuity of By the triangle inequality, we obtain for every k ≥ 0 which, due to Young's inequality and the strong convexity of H, leads to Since we get for every k ≥ 0 that Young's inequality and the strong convexity of H imply that for every i = 1, . . . , m and every k ≥ 0 and thus Summing up this inequality from k = 0 to N−1, for N ≥ 1, we get This shows that and therefore finishes the proof.  (9) and (x k ) k≥0 be a sequence generated by Algorithm 4.2 with optimal stepsize Then for every N ≥ 1 it holds

Remark 4.7:
The same considerations as in Remark 3.7 about ergodic convergence are applicable also for the rates provided in Theorem 4.5 and Corollary 4.6.

Remark 4.8:
Note the straightforward dependence of the optimal stepsizes as well as the right-hand side of the convergence statement on the data, i.e. the distance of the initial point to optimality, the Lipschitz constants L f i and the probabilities p i . This backs up the intuition that the decreased gradient evaluation, i.e. smaller p i , does not come for free but at the cost of a worse constant in the convergence rate.

Applications
In the numerical experiments carried out in this section, we will compare three versions of the provided algorithms. First of all, the non-incremental version, which takes full subgradient steps with respect to the sum of all component functions instead of every single one individually. This can be viewed as a special case of the algorithms given, when m = 1 and 1,k = 1 for all k ≥ 0. Secondly, we discuss the non-stochastic incremental version, which uses the subgradient of every single component function in every iteration and thus corresponds to the case when i,k = 1 for every i = 1,...,m and every k ≥ 0. Lastly, we apply the algorithms as intended by evaluating the subgradients of the respective component functions incrementally with a probability different from 1.

Tomography
This application can be found in [3] and arises in the reconstruction of images in PET. We consider the following problem where := {x ∈ R n : n j=1 x j = 1, x ≥ 0} and r ij denotes for i = 1, . . . , m and j = 1, . . . , n the entry of the matrix R ∈ R m×n in the i-th row and j-th column and all of these are assumed to be strictly positive. Furthermore, y i denotes for i = 1, . . . , m the positive number of photons measured in the i-th bin. As discussed in Example 2.5 this can be incorporated into our framework with the mirror map H(x) = n i=1 x i log(x i ) for x ∈ and H(x) = +∞, otherwise. As initial value, we use the all ones vector divided by the dimension n.
We also want to point out that a similar example given in [8] in which the minimization of a convex function over the unit simplex somehow does not match the assumption made throughout the paper as the interior of is empty and the function H can therefore not be continuously differentiable in a classical sense. However, with the setting of Section 3 we are able to tackle this problem.
The bad performance, see Figure 1, of the deterministic incremental version of Algorithm 3.2 can be explained by the fact that many more evaluations of the mirror map are needed, which increases the overall computation time dramatically. The stochastic version, however, performs rather well, after only evaluating merely roughly a fifth of the total number of component functions, see Table 1.

Support vector machines
We deal with the classic machine learning problem of binary classification based on the well-known MNIST dataset, which contains 28 by 28 images of handwritten numbers on a grey-scale pixel map. For each of the digits, the dataset comprises around 6000 training images and roughly 1000 test images. In line with [4], we train a classifier to distinguish the numbers 6 and 7, by solving the following optimization problem where for i = 1, . . . , m, x i ∈ {0, 1, . . . , 255} 784 denotes the i-th training image and y i ∈ {−1, 1} denotes the label of the i-th training image. The 1-norm serves as a regularization term and λ > 0 balances the two objectives of minimizing the classification error and reducing the 1-norm of the classifier w. To incorporate this problem into our framework, we set H = 1 2 · 2 which leaves us with the identity as mirror map as this problem is unconstrained. The results comparing the three versions of Algorithm 4.2 discussed at the beginning of this section are illustrated in Figure 2. As initial value we simply use the all ones vector. All three versions show classical first-order behaviour, giving a fast decrease in objective function value first but then slowing down dramatically. More information about the performance can be seen in Table 2. All three algorithms result in a significant decrease in objective function after being run for only 4 s each. However, from a machine learning point of view, only the misclassification rate is of actual importance. In both regards, the stochastic incremental version clearly trumps the other two implementations. It is also interesting to note that it needs only a small fraction of the number of subgradient evaluations in comparison to the full non-incremental algorithm.

Conclusion
In this paper, we present two algorithms to solve nonsmooth convex optimization problems where the objective function is a sum of many functions which are evaluated by their respective subgradients under the implicit presence of a constraint set which is dealt with by a so-called mirror map. By allowing for a random selection of each component function to evaluate in each iteration, the proposed methods become suitable even for very large-scale problems. We prove a convergence order of O(1/ √ k) in expectation for the kth best objective function value, which is standard for subgradient methods. However, even for the case where all the objective functions are differentiable, it is not clear if better theoretical estimates can be achieved, due to the need of using diminishing stepsizes in order to obtain convergence in incremental algorithms. Future work could comprise the investigation of different stepsizes, such as constant or dynamic stepsizes as in [2]. Another possible extension of this would be to use different selection procedures such as random subsets of fixed size. Our framework, however, does not provide the right setting for such a batch approach as it would leave i,k and j,k dependent.

Disclosure statement
No potential conflict of interest was reported by the authors.