A minimax approach to duality for linear distributional sensitivity testing

We consider the problem of finding the maximum of $\mathbb{E}_{\nu}[f(X)]$ where $\nu$ is allowed to vary over all the probability measures on a Polish space $S$ for which $d_c(\mu,\nu)\leq \theta$, in which $d_c$ is an optimal transport distance, $f$ a real-valued function on $S$ satisfying some regularity, $\mu$ a ``baseline"measure and $\theta \geq 0$. Whereas some of the derivations of the dual version of this optimization problem rely on Fenchel duality, we impose compactness on $S$ to allow us to instead use K. Fan's minimax theorem, which does not require vector space structure. This allows one to avoid the use of vector spaces of measures, or dual variables other than the Lagrange multiplier.


Introduction
Let (S, d) be a Polish (i.e.complete and separable) metric space and µ a measure, called the baseline measure or distribution.Consider the optimization problem maximize S f (x) dν(x) (1) where B r (µ), termed the uncertainty set, is a set of measures with transportation cost ≤ r from µ.The transportation cost derives from a cost function c : S × S → R + .Assume f is sufficiently regular such that a maximum exists and denote that maximum by v P .Versions of the duality result are obtained under varying assumptions by amongst others Esfahani and Kuhn [9]; Gao and Kleywegt [8], Blanchet and Murthy [3]; Bartl, Drapeau and Tangpi [1]; and Feng and Schlögl [7].This duality reduces the feasible set from an infinite-dimensional to a finite-dimensional problem.If, as often happens, the inner supremum in Equation ( 3) is analytically tractable, then it becomes a one-dimensional optimization problem that in itself may be analytically solvable, as can be seen in several examples in the abovementioned references.To give just two examples, option prices robust to changes of a certain magnitude in the risk-neutral distribution [2] and robust ruin probabilities [3] have been calculated using such duality results.Esfahani and Kuhn [9] assume that the baseline measure µ is an empirical distribution, consider the Wasserstein distance for p = 1 only and assume a special structure for f as a maximum of a finite number of convex functions.Gao and Kleywegt use lim inf and lim sup inequalities related to the set of minimizers of inf y∈S {λc(x, y) − f (y)}, for the case c(x, y) = d(x, y) p with p ≥ 1.Their proof of duality is based on among others results analogous to Moreau-Yosida regularization.Feng and Schlögl sketch a proof using a Karush-Kuhn-Tucker argument.Blanchet and Murthy work with general cost functions c(x, y) and base their result on Fenchel duality.Their dual variables are the Lagrange parameter λ and a measurable function φ.To use Fenchel duality, they consider a vector space of bounded continuous functions on S × S as well as a vector space of signed finite Borel measures on S × S, equipped with the variation norm.
In this paper we prove (3) using K. Fan's minimax theorem instead of Fenchel duality.One benefit of this is that vector space structure is replaced by weaker convex-concavity requirements.Thus the minimax theorem allows one to proceed without having to introduce vector space structure, signed measures or different topologies.We use only the Lagrange parameter λ as dual variable.The price we pay for this relative simplicity is that we assume S is compact.We hope that in spite of restriction, this type of relatively simple proof may stimulate generalizations and also contribute to making the topic of distributional sensitivity testing more accessible to researchers already familiar with minimax arguments.It could also be possible to extend this argument to non-compact spaces, similar to the compact-to-general extension step in [3].
For the interpretation of the problem, comparison of optimal transport distance with Kulback-Leibler divergence, and various applications, see the above-mentioned literature.In this paper we restrict ourselves to the minimax proof of Equation ( 3).
The structure of this paper is as follows: we define notation and recall the formulation of the problem in terms of transport plans, derive topological preliminaries allowing the application of minimax to the Lagrangian, and then apply minimax.

Formulation in terms of transport plans
For notational convenience we consider the problem obtained by replacing the maximization in (1) by minimization.Since the maximization problem for f can be solved by solving the minimization problem for −f , there is no loss in doing so.
To define the ball of measures which will be the feasible set of the optimization problem, we review a few facts about optimal transport.Let X and Y be Polish spaces.If µ is any Borel measure on X , and T : X → Y a Borel map, then T # µ will denote the image measure defined by (T # µ)(A) = µ(T −1 (A)) for Borel sets A ⊆ Y.If γ is a probability measure on X × Y, its marginal, or projection to X is the measure (proj 1 ) # γ where proj 1 is the coordinate projection X × Y → X : (x, y) → x.Equivalently ((proj 1 ) # γ)(A) = γ(A × S) for each Borel set A ⊆ X .The marginal to Y namely (proj 2 ) # γ is defined similarly.Recall [10, Definition 1.1] that a transport plan, or a coupling, between a measure µ on X and a measure ν on Y is a measure π on X × Y such that (proj 1 ) # π = µ and (proj 2 ) # π = ν.We denote the set of all Borel probability measures on a Polish space X with P (X ), which is topologized by weak convergence of probability measures.If X is Polish then P (X ) is Polish.In particular, the weak convergence of measures in P (X ) is metrizable.If P and Q are sets of measures satisfying P ⊆ P (X ) and Q ⊆ P (Y), then the set of all transport plans from any µ ∈ P to any ν ∈ Q will be denoted by Π(P, Q).A cost function is any lower semicontinuous (l.s.c.) c : X × Y → R + .A typical choice is c(x, y) = d(x, y) p for some p ≥ 1.The optimal transport cost between measures µ, ν on X is defined by d c (µ, ν) := inf{ X ×X c(x, y)dπ(x, y) : π ∈ Π({µ}, {ν})}.
Let (S, d) be a Polish space and f : S → R be l.s.c.For any µ ∈ P (S) and r > 0, we define We consider the optimization problem minimize (By using the word "minimize" we do not imply that the minimum is attained.)The set of transport plans X := Π({µ}, B r (µ)) inherits the weak topology of probability measures from P (S × S).
By translating the condition ν ∈ B r (µ) to a condition on the transport plan one obtains, similarly to arguments in [7, Section 3.1], that Problem (4) is equivalent, in the sense that the infimum agrees, to the following problem over transport plans: subject to π ∈ X and (5b) We will refer to this as the primal problem, and the associated infimum value as v P .

Compactness of transport plans when S is compact
Now we assume that S is also compact.Then P (S) is compact, for example by the Prokhorov theorem or [10, Remark 6.19]).Since B r (µ) is a closed subset of S, it too is compact.
Our main tool will be K. Fan's minimax theorem [6] as formulated by Borwein and Zhuang [4].
Let X and Y be sets, not necessarily having vector space structure.A function K : X × Y → R is said to be convex-concave like on X × Y if for all t, 0 ≤ t ≤ 1, we have (a) for all x 1 , x 2 ∈ X there exists x 3 ∈ X such that for all y ∈ Y K(x 3 , y) ≤ tK(x 1 , y) + (1 − t)K(x 2 , y); and (b) for all y 1 , y 2 ∈ Y there exists y 3 ∈ Y such that for all x ∈ X K(x, y 3 ) ≥ tK(x, y 1 ) + (1 − t)K(x, y 2 ).In our application X = Π({µ}, B r (µ)) is a set of transport -sometimes called transference -plans, which by application of the following theorem is compact in P (S × S).
Lemma 1. [10, Corollary 5.21] Let X and Y be Polish spaces, and let c(x, y) be a real-valued continuous cost function, inf c > −∞.Let K and L be two compact sets of P (X ) and P (Y) respectively.Then the set of optimal transference plans π whose marginals respectively belong to K and L is itself compact in P (X × Y).
We will need the following lemma, which is a variation of [10,Lemma 4.3] suitable for our purpose.
Lemma 2. Let X and Y be Polish spaces.(1) If g is a nonnegative l.s.c.real-valued function on X then the mapping P (X ) → R : ν → X g(x) dν(x) is l.s.c.
(2) If g is a nonnegative l.s.c.real-valued function on X × Y then the mapping P (X × Y) → R : γ → X ×Y g(x, y) dγ(x, y) is l.s.c.Proof.(1) Let ν k → ν weakly.Since g is l.s.c. and nonnegative, we can use the theorem of Baire to obtain a sequence (g n ) n∈N of continuous real-valued functions such that 0 ≤ g n ↑ g.By replacing g n with min{g n , n} if necessary, we can assume g n is bounded.By monotone convergence, The proof of (2) is similar.

Application of minimax
Define the Lagrangian As is typical in the Lagrangian approach, the term sup λ≥0 λ S×S c(x, y) dπ(x, y) − r is infinity if constraint (5c) is not satisfied, and 0 if it is satisfied.It follows that v P = min π∈X sup λ≥0 L(π, λ).
Theorem 2. The conditions of the minimax Theorem 1 are satisfied for X as defined above, Y = R + and the function K chosen as the Lagrangian L of Equation (7).In particular: 1. X is compact Hausdorff in the weak topology 2. L : X × Y → R is l.s.c. on X for every λ ∈ Y .
(1) The weak compactness of X follows from Lemma 1.Since the weak topology on P (S × S) is metrizable, P (S × S), and hence X, is Hausdorff.
(3) Although X and Y are not vector spaces and L is not linear, L does preserve convex combinations: if π 1 and π 2 are transport measures belonging to X and 0 ≤ t ≤ 1 then it is clear that for any λ ∈ Y , This enables us to derive the dual.The fact that the optimization over measures subproblem is solved by a measure concentrated on {(x, y) ∈ S × S : y ∈ arg min z∈S {f (z) + λc(x, z)}} is observed, subject to obvious alterations to translate between minimization and maximization problems, amongst others in [3] and [7].
Proof.By Theorem 2 we have min Since X is compact and L is l.s.c. on X for every λ ≥ 0, L attains a minimum at say π ⋆ ∈ X.In fact we can construct the minimizer.Since S is compact and the mapping y → f (y) + λc(x, y) is l.s.c., for each x ∈ S the set of integrand minimizers m(x) := arg min y∈S {f (y) + λc(x, y)} is non-empty.Let T : S → S be measurable and map x to any element of m(x).The existence of such a map follows from a measurable selection result in [5] and the details are given in Appendix A. This allows us to define the deterministic [10, Definition 1.2] coupling π * := (Id, T ) # µ.It follows from this definition that the support of π * is a subset of {(x, T (x)) : x ∈ S} ⊆ S × S.
Consider an arbitrary transport plan π ∈ X. Combining the above-mentioned fact with the marginalization properties of the transport plans π * and π, we get Combining with Equation (11) we get Equation (10).
Remark 1.A growth condition similar to that of [3] or [8] will be needed in the extension of the result to non-compact S.

Theorem 1 .
[4,  Theorem A] Suppose that X and Y are non-empty sets with K convex-concave like on X × Y .Suppose that X is compact and K(•, y) is l.s.c. on X for each y in Y .Then min

Therefore π 3 :
= tπ 1 + (1 − t)π 2 yields equality in the first part of Definition 1. Equality in the second part of Definition 1 is similar.