Equality-Minded Treatment Choice

Abstract The goal of many randomized experiments and quasi-experimental studies in economics is to inform policies that aim to raise incomes and reduce economic inequality. A policy maximizing the sum of individual incomes may not be desirable if it magnifies economic inequality and post-treatment redistribution of income is infeasible. This article develops a method to estimate the optimal treatment assignment policy based on observable individual covariates when the policy objective is to maximize an equality-minded rank-dependent social welfare function, which puts higher weight on individuals with lower-ranked outcomes. We estimate the optimal policy by maximizing a sample analog of the rank-dependent welfare over a properly constrained set of policies. We show that the average social welfare attained by our estimated policy converges to the maximal attainable welfare at rate uniformly over a large class of data distributions when the propensity score is known. We also show that this rate is minimax optimal. We provide an application of our method using the data from the National JTPA Study. Supplementary materials for this article are available online.


Introduction
In causal inference studies analyzing experimental or quasiexperimental data, treatment response generally varies with individual observable characteristics. Learning about such heterogeneity from the data is essential for designing individualized treatment rules that assign treatments on the basis of individual observable characteristics. The optimal individualized treatment rule maximizes a social welfare criterion representing the policy maker's preferences over population distributions of post-treatment outcomes. The literature on statistical treatment choice initiated by Manski (2004) emphasizes this perspective of welfare-based empirical policy design and pursues statistically sound ways to estimate the optimal treatment assignment rule.
Research on statistical treatment rules typically focuses on the additive social welfare criterion (sometimes called "utilitarian") defined as the mean of the outcomes in the population, even though welfare economics offers a variety of alternative criteria. The additive social welfare criterion offers analytical and computational convenience because the optimal treatment rule then depends only on the conditional average treatment effect. Empirical researchers studying causal impacts of social programs have stressed the importance of evaluating distributional impacts, which are overlooked when only mean outcomes are considered (e.g., Bitler, Gelbach, and Hoynes 2006). The distributional impact of a policy is especially important when the policy maker is concerned about economic inequality in the population.
CONTACT Toru Kitagawa t.kitagawa@ucl.ac.uk Department of Economics, University College London, Gower Street, London, WC1E 6BT, UK. Supplementary materials for this article are available online. Please go to www.tandfonline.com/UBES.
We study the problem of treatment assignment that aims to maximize a rank-dependent social welfare function (SWF), which has the form where Y i is individual i's outcome, Rank(Y i ) is the outcome rank of i from the bottom of the outcome distribution, and ω(·) is a nonnegative weight assigned to each rank. The additive SWF is a special case of (1) with constant ω(·). The class of generalized Gini SWFs proposed by Mehran (1976) and Weymark (1981) consists of SWFs of the form (1) with nonincreasing ω(·). It closely relates to income inequality indices, including the widely used Gini index. Blackorby and Donaldson (1978) showed that, given a specification of ω(·), the rank-dependent SWF can be written as a product of the average outcome and one minus a generalized relative index of inequality, for example, Gini. This implies that these SWFs generate a ranking of outcome distributions that is increasing in the average outcome and decreasing in the chosen index of inequality. While inequality measures are predominantly applied to net income, our analysis allows Y i to denote other outcome variables of interest, including functions of income, consumption, wealth, or human capital. We will therefore refer to Y i simply as "the outcome" in this article. The goal is to choose a treatment rule δ that assigns individuals to one of two treatments d ∈ {0, 1} depending on their observable pretreatment covariates X ∈ X . This choice is made after experimental data have been collected and analyzed.
We do not consider the problem of optimal experimental design in this article, taking the design as given. We assume that an individual's treatment outcome does not depend on treatments received by others. The policy-maker in our setup can only impact the distribution of outcomes through the choice of a treatment assignment rule and cannot combine it with other redistributive policies.
Finding a policy that maximizes a rank-dependent SWF is a nontrivial problem without a closed-form solution even if the conditional distributions of potential outcomes (P(Y 0 |X) and P(Y 1 |X)) are known (in a slight abuse of notation typical in the literature, Y i will denote individual i's realization of random variable Y ≡ (1 − D)Y 0 + DY 1 , whereas Y 0 , Y 1 , and Y d will denote random variables for potential outcomes of treatments 0, 1, and d ∈ {0, 1}). Under an additive SWF u(Y i )di (averaging either outcomes Y i or, more generally, their nonlinear transformations u(Y i )), it is optimal to assign for each subgroup the treatment with the highest conditional mean E(u(Y d )|X). In contrast, a rank-dependent SWF is nondecomposable across subgroups, as the ranking of treatment assignment rules for a given covariate value may change depending on the treatment assignment of other subgroups (see Section B in the online supplement).
We show in Theorem 2.1 that an equality-minded rankdependent SWF is always maximized by a non-randomized treatment rule (assigning the same treatment to all individuals with identical covariates). This result greatly simplifies the space of treatment rules that need to be considered. It also allows us to index treatment rules by their decision sets G ⊂ X , denoting all values of the covariates {X ∈ G} for which treatment 1 is assigned.
Our aim is to estimate from the sample data a treatment assignment rule G belonging to a constrained (but generally large) set of feasible policies G, which is a collection of nonrandomized treatment rules indexed by their decision sets. Policy makers often face legal, ethical, or political constraints that restrict how individual characteristics can be used to determine treatment assignment. One of the advantages of our framework is that it easily incorporates such exogenous restrictions. Our analytical results also require G to satisfy a certain complexity restriction (a finite VC-dimension) to prevent overfitting. Kitagawa and Tetenov (2018a) argue that this is not restrictive for many public policy applications and provide rich examples of treatment rule classes that satisfy this complexity restriction.
We propose estimating the treatment rule G by maximizing a sample analog of W(G), the SWF evaluated at the population distribution of Y that is realized if treatment assignment rule G is implemented. The general idea of estimating a policy by maximizing an empirical welfare criterion is in line with the method developed by Kitagawa and Tetenov (2018a) for the additive welfare case, hence, we refer to the method proposed in the current article for rank-dependent SWFs as equality-minded empirical welfare maximization (EWM). However, the results of Kitagawa and Tetenov (2018a) cannot be directly applied in this setting. We had to construct a suitable sample analog of W(G) using the whole distribution function and derive new methods to show that this distributional welfare analog (which is not a sample average, as in Kitagawa and Tetenov (2018a)) converges uniformly over a VC-class of policies. Additionally, we develop a novel proof technique to establish this convergence for unbounded outcome distributions using a weak tail condition, whereas the results in Kitagawa and Tetenov (2018a) for additive welfare require outcomes to be bounded.
We evaluate the statistical performance of G in terms of regret , which is the average welfare loss relative to the maximum welfare achievable in G with respect to the sampling distribution P n of a size n sample. We derive a non-asymptotic and distribution-free upper bound on regret in terms of the sample size n and a measure of complexity of G, and show that it converges to zero at n −1/2 rate. We also show that this rate is minimax optimal over a minimally constrained class of population distributions, ensuring that no other data-driven treatment rule can lead to a faster welfare loss convergence rate uniformly over the class of distributions. The remainder of this article is organized as follows. Section 1.1 provides an overview of related literature. Section 2 discusses the properties of equality-minded rank-dependent social welfare functions and their application to the analysis of treatment choice. In Section 3, we introduce the general analytical framework and show the convergence rate properties of the EWM rule for rank-dependent welfare. Section 4 provides extensions of the model that incorporate cost or capacity constraints and that allow the sampled population to be only a subset of the full population on which social welfare is defined. In Section 5, we apply our method to the experimental data from the National JTPA Study. Main proofs are collected in Appendix A. An online supplement contains additional proofs, examples, and extensions.

Related Literature
The analysis of statistical treatment rules was pioneered by Manski (2004), and is a growing area of research in econometrics. Important recent developments can be found in Dehejia (2005), Hirano and Porter (2009), Stoye (2009, 2012, Chamberlain (2011), Bhattacharya andDupas (2012), Tetenov (2012), Kasy (2016Kasy ( , 2018, Kitagawa and Tetenov (2018a), Mbakop and Tabord-Meehan (2018), and Athey and Wager (2018), among others. All the existing works on treatment choice except for Kasy (2016) posit an additive welfare criterion as the objective function of the policy maker. Motivated by policy concerns about economic inequality, the current article instead analyzes the treatment choice problem for a class of rank-dependent social welfare functions that embody inequality aversion.
The main feature distinguishing the current analysis from the EWM approach for the additive welfare case considered in Kitagawa and Tetenov (2018a) is that the rank-dependent welfare criterion is nondecomposable. Computing the empirical welfare criterion then requires that the whole distribution of outcomes that would be generated by each policy is estimated first, before a nonlinear transformation is applied to this distribution estimate. This problem has not been previously considered in econometrics nor in the machine learning and statistics literatures on empirical risk minimization problems (Vapnik 1998), where the empirical risk criterion always takes the form of a sample average (with the exception of Wang et al. (2018), who maximized one specific quantile of the outcome distribution). Another novel technical contribution of this article is that we allow outcomes to be unbounded (which is important for analysis of economic outcomes like earnings) with only a weak restriction on the tail of the potential outcome distribution. Kasy (2016) analyzed treatment choice for a class of social welfare functions including rank-dependent social welfare. Our approach differs from his in several aspects. First, Kasy (2016) considered a linear approximation of the rank-dependent welfare function around a status-quo policy to discuss (partial) identification of a welfare-improving local policy change. We instead focus on a globally optimal policy without invoking approximations. Second, we assume that the welfare criterion is point-identified by the sampling process, while Kasy (2016) focused on partial identification of the welfare criterion and construction of the social planner's incomplete preference ordering over policies. Third, we study estimation of an optimal policy and examine optimality of the estimator in terms of welfare regret convergence rate, while Kasy (2016) studied asymptotically valid inference on the welfare rankings.
Aaberge, Havnes, and Mogstad (2013) estimated a rankdependent social welfare function of two policy alternatives: with and without uniform implementation of the treatment. Firpo and Pinto (2016) estimated the impact of uniform implementation of the treatment on measures of inequality, including the Gini coefficient. In contrast, the focus of the current article is on estimating the optimal treatment rule from a large class of individualized assignment rules.
We consider social welfare functions that satisfy the axiom of anonymity, that is, social welfare functions that are functionals of the distribution of outcomes after treatment assignment and that are indifferent to reshuffling of the outcomes between individuals. Thus, our objective does not depend on the distribution of individual treatment effects P(Y 1 − Y 0 ), which has also received attention recently in the program evaluation literature (Heckman, Smith, and Clements 1997;Firpo and Ridder 2019;Fan and Park 2010).
Building social welfare functions satisfying the Pigou-Dalton principle of transfers (that a transfer of income from a higher ranked individual to a lower ranked individual that does not change their ranks is always desirable) is one of the central themes in the literature of inequality measurement and welfare economics (see Cowell 1995Cowell , 2000Lambert 2001). Currently, there are two widely used classes of social welfare functions that meet the Pigou-Dalton principle. The first is the class of Atkinson-type SWFs (Atkinson 1970) where F(y) is the cumulative distribution function (cdf) of the outcome and U(·) is a concave nondecreasing function. Since the Atkinson-type social welfare function is linear in F, the EWM approach of Kitagawa and Tetenov (2018a) can be readily applied by defining the outcome observations as U(Y). The second class, which is this article's main focus, is the class of rank-dependent social welfare functions introduced by Mehran (1976), Blackorby and Donaldson (1978), and Weymark (1981) and axiomatized by Yaari (1988). The key axiom of Yaari (1988) that distinguishes rank-dependent social welfare from Atkinson-type social welfare is invariance under a rank-preserving lump-sum change of incomes at the upper tail, which means that the preference ordering between two income distributions F and F is invariant to any fixed lump-sum increase (decrease) in income of all those above (below) the τ th quantile of F and F for any τ ∈ (0, 1). On the other hand, the key axiom that characterizes the Atkinson-type social welfare is the independence axiom: the preference ordering between F and F is invariant to any mixing with another common income distribution. As noted in Machina (1982), the rankdependent social welfare generalizes the Atkinson-type social welfare exactly as rank-dependent expected utility generalizes classical von Neumann-Morgenstern expected utility (Machina 1982;Quiggin 1982) by relaxing the controversial independence axiom. These rich and insightful works in welfare economics have not yet been well linked to econometrics and empirical analysis for policy design. One of the main aims of the current article is to establish a link between these two important literatures.

Treatment Choice With Equality-Minded Social Welfare Functions
We call a SWF equality-minded if it satisfies the Pigou-Dalton principle of transfers: a transfer of income from a higher ranked individual to a lower ranked individual is always desirable when it does not change their ranks in the income distribution. The equality-minded SWFs analyzed in this article are the rank-dependent SWFs with decreasing welfare weights (also called generalized Gini SWFs), introduced by Mehran (1976) and Weymark (1981) and axiomatized by Yaari (1988). An equality-minded rank-dependent SWF admits the following representation: where (·) : [0, 1] → [0, 1] is a nonincreasing, nonnegative function with (0) = 1 and (1) = 0. A rank-dependent SWF satisfies the Pigou-Dalton principle of transfers if and only if (·) is convex. The term rank-dependent is due to an equivalent representation of (2) as a weighted sum of outcomes. Given that a convex (·) is almost everywhere differentiable, we can apply integration by parts to equivalently express W (F) as where where ω(τ ) specifies the rankspecific welfare weight assigned to individuals with outcomes at the τ th quantile. If the SWF is equality-minded then (·) is convex, hence, ω(·) is nonincreasing and assigns larger welfare weights to individuals with lower outcomes.
Assumption 2.1 implies that the rank-specific weight function ω(·) defined in (3) does not asymptote at the origin, that is, the welfare weight assigned to the lowest rank is bounded. This restriction rules out SWFs that closely approximate the Rawlsian social welfare function.
We consider the problem of choosing a policy that assigns individuals to one of two treatments d ∈ {0, 1} to maximize the chosen SWF. A treatment assignment rule δ : X → [0, 1] specifies the proportion of individuals with observable pretreatment covariates X ∈ X ⊂ R d x who will be assigned to treatment 1 by the policy-maker. The policy randomly assigns individuals with covariates X to the two treatments with probabilities 1 − δ(X) and δ(X)). The population distribution of outcomes induced by treatment rule δ has cdf where Y 0 and Y 1 denote the potential outcomes of the two treatments with conditional distributions F Y 0 |X and F Y 1 |X given X and P X is the marginal distribution of X.
If the population distribution of (Y 0 , Y 1 , X) were known, the optimal policy maximizing the social welfare function (2) would be For the additive welfare function (the mean of Y), the welfare maximization problem simplifies to This social welfare function is additive across covariates and depends on the outcome distributions only through their conditional means E(Y d |X). Then the optimal policy is In contrast, the optimal rule for a rank-dependent welfare function (2) depends on the whole conditional distributions of potential outcomes F Y 0 |X and F Y 1 |X , not only on their means. The optimal rule can differ from the one maximizing an additive welfare if there is no first-order stochastic dominance relationship between F Y 0 |X and F Y 1 |X for some covariate values. Even with the knowledge of the distribution of (Y 0 , Y 1 , X), a simple characterization of the optimal rule does not seem available for rank-dependent SWFs. The following theorem mitigates this complication by substantially reducing the set of candidate treatment rules that need to be considered.
Theorem 2.1. If W (·) satisfies Assumption 2.1, then for every measurable treatment rule δ : If all upper level sets of δ belong to a collection G of Borel subsets of X : This theorem shows that a treatment assignment rule maximizing equality-minded rank-dependent welfare is nonrandomized (assigns all individuals with the same covariates to the same treatment). We can therefore restrict our search for an optimal policy to the set of non-randomized rules that can be succinctly characterized by their decision sets G ⊂ X . Decision set G determines the group of individuals {X ∈ G} to whom treatment 1 is assigned. With abuse of notation, we denote the welfare of a non-randomized treatment rule with decision set G by W (G), suppressing the cumulative distribution function in its argument, Our goal is to estimate from the sample data a treatment assignment rule that attains the maximum level of social welfare sup G∈G W (G) over the set of feasible policies G ≡ {G ⊂ X }, which is a collection of nonrandomized treatment rules (subsets of the covariate space X ). An important feature of our empirical welfare maximization approach is that the complexity of G is constrained by a finite Vapnik-Cervonenkis (VC) dimension (defined in Appendix A).
Assumption 2.2 (VC). The class of decision sets G has a finite VC-dimension v < ∞.
The VC-dimension is a restriction on the complexity of the set of feasible policies. Without it, maximizing a sample analog of W (G) over G can lead to arbitrarily complicated policies (overfitting) and prevent us from learning the optimal policy on the basis of a finite number of observations. It does not require G to be finite and allows for very large classes of treatment rules. For example, a class of treatment rules defined by a linear equation in Kitagawa and Tetenov (2018a) for other examples of classes G that satisfy Assumption 2.2. An example of G that does not satisfy Assumption 2.2 is the class of all monotone treatment rules Mbakop and Tabord-Meehan (2018) in the additive welfare case.

EWM for Equality-Minded Welfare
We proceed to propose our method of estimating the treatment rule in finite samples and analyze its properties.
The population from which the sample is drawn is The data are a size n random sample from P of observations Based on this data, the policy-maker has to choose a conditional treatment rule G ∈ G that determines whether individuals with covariates X in the target population will be assigned to treatment 0 or to treatment 1. The following are our maintained assumptions about the class P of population distributions of (Y 0 , Y 1 , D, X): (SO) Strict overlap: There exist κ ∈ (0, 1/2] such that the propensity score satisfies e(x) ∈ [κ, 1 − κ] for all x ∈ X .
These assumptions generally hold if the data come from an experiment with randomized treatment assignment. In observational studies, on the other hand, unconfoundedness rules out situations in which the observed treatment assignments depend on subjects' unobserved characteristics that can be associated with their potential outcomes. Strict overlap can also be violated in an observational study if only one of the treatments is assigned in the sampling process for some covariate values. We do not constrain any feature of the joint distribution of (Y 0 , Y 1 , X) except that the distributions of Y 0 and Y 1 satisfy the tail condition (TC). A sufficient condition for (TC) is that for some > 0. The outcome variable and the covariates can be discrete, continuous, or their combination, and the support of X does not have to be bounded. Throughout the main text, we maintain the assumption that the propensity score e(X) ≡ P(D = 1|X) is known, as is usually the case in experimental data. Section C in the online supplement extends the analysis to observational data for which the propensity score is unknown and needs to be estimated.
We estimate the treatment rule by maximizing a sample analog of the population SWF. The equality-minded EWM treatment rule G maximizes a sample analog W (G) of the welfare criterion over the set of feasible rules G ∈ G. The unknown outcome distribution F G induced by treatment rule G in (9) could be estimated by Under Assumption 3.1 (UCF), F G (y) is an unbiased estimator of F G (y).
The sample analog of welfare (Equation (2)) is defined as and the equality-minded EWM treatment rule is then The maximum (∨) of F G (y) and 0 is taken in (13) because F G (y) may take values smaller than 0, for which (·) is not defined. The summands in (12) are nonnegative, so F G (y) ≤ 1 for all y. F G (y) = 1 for all y ≥ max 1≤i≤n Y i .

F G may not be a proper cdf because lim
may be either below or above zero in finite samples. We also consider properties of the normalized equalityminded EWM rule using a normalized cdf sample analog which always yields a proper cdf. The ranking of treatment rules by the normalized criterion W R (G) is invariant to positive affine transformations of outcomes Y, whereas the ranking by W (G) is not.
Note that This is similar to the idea of normalizing propensity score weights recommended for the estimation of the average treatment effect (Imbens 2004).

Rate Optimality of EWM
The next theorem derives a uniform upper bound of the average welfare loss of the EWM rule.
for all n > 1, and for all n > C R where P is the class of all distributions satisfying Assumption 3.1 and C, C R 1 , C R 2 , C R 3 > 0 are universal constants.
Proof. The proof of (17) is in Appendix A. The proof of (18) is in the online supplement.
This theorem shows that for a large class of data generating processes characterized by Assumption 3.1, the welfare of the equality-minded EWM rule is guaranteed to converge to the maximal attainable welfare no slower than at n −1/2 rate (the second term in bound (18) is of a lower order). The uniform convergence rate of n −1/2 coincides with that of the EWM rule for the additive welfare shown in Theorem 2.1 of Kitagawa and Tetenov (2018a). This is a nontrivial result, given that the rankdependent welfare function depends on the whole conditional distributions of the potential outcomes given covariates, rather than only on their conditional means, as is the case for the additive welfare criterion.
The next theorem provides a universal lower bound for the worst-case average welfare loss of any treatment rule.
Theorem 3.2. Suppose that Assumptions 2.1 and 2.2 hold with v ≥ 2, then for any non-randomized treatment choice rule G that is a function of the sample, and for any τ * ∈ (0, 1] at which where P is the class of all distributions satisfying Assumption 3.1.
Proof. See the online supplement.
Since (·) is convex and | (0)| > 0, there also exists some τ * > 0 for which | (τ * )| > 0. Hence, the bound (19) is always positive for some τ * > 0. A comparison of the lower bound of this theorem with the welfare loss upper bound of the EWM rule obtained in Theorem 3.1 shows that the EWM rule is minimax rate optimal over the class of data generating processes satisfying Assumption 3.1. We can therefore claim that in the absence of any additional restrictions other than Assumption 3.1, no other data-driven procedure for obtaining a nonrandomized rule can outperform the EWM rule in terms of the uniform convergence rate over P. This optimality claim is analogous to that of the EWM rule for the additive welfare case (Kitagawa and Tetenov 2018a, Theorems 2.1 and 2.2), and the minimax optimal rate attained by the equality-minded EWM rule is the same as the optimal rate in the additive welfare case. It is remarkable to see that even in the absence of any analytical characterization of the true optimal assignment rule in terms of the population distribution of (Y 0 , Y 1 , X), maximizing the empirical welfare leads to a policy that, if implemented, is guaranteed to reach the maximum attainable social welfare at the minimax optimal rate.
It is also worth noting that the VC-dimension of G appears in the same order both in the upper and lower bound expressions of Theorems 3.1 and 3.2. Since these bounds are non-asymptotic, we can let v increase with the sample size, and we can still conclude the minimax rate optimality of the equality-minded EWM rule. This insight is similar to the EWM rule for the additive welfare case (Kitagawa and Tetenov 2018a, Remark 2.6).

Social Welfare is Defined on a Population Larger than the Sampled Population
One of the distinguishing features of rank-dependent social welfare is that it is not additive over subpopulations (see Section B in the online supplement for an illustration). If the subpopulation for which the policy intervention takes place (e.g., unemployed workers) is only a subset of the whole population on which the rank-dependent SWF is defined (e.g., the population of a country), it is important to explicitly take into account the outcome distribution for the rest of the population in estimating the optimal assignment rule. Suppose that the social welfare function is defined on a population with distribution J that is a mixture of two subpopulations with distributions F and H: Let F be the outcome distribution on the targeted subpopulation from which the experimental data are sampled and on which the estimated treatment policy is to be implemented. Let H be the outcome distribution for the rest of the population (excluded from the sampling process and unaffected by the chosen treatment assignment rule). The mixture weight η represents the size of subpopulation F. For simplicity, we assume that η and H are known to the social planner. We also assume that the outcome distribution H is invariant to the treatment assignment policy applied to subpopulation F, for example, there are no spillover or general equilibrium effect across F and H. Implementing treatment assignment rule {X ∈ G} on subpopulation F leads to full population welfare equal to where F G (·) is the cdf defined in (9). The empirical welfare maximization method in this case consists of maximizing a sample analog of W (J G ), where F G is defined in (12).
The uniform convergence proof of Theorem 3.1 can be easily extended to this case, the only change being the proportionality of the bound to η.
where C > 0 is a universal constant defined in Theorem 3.1.

Cost of Treatment
In the preceding sections, we did not take into account the cost of treatment even though cost differences among treatments are often an important consideration in practice. In this section, we discuss how to take the cost of treatment into account in the estimation of welfare maximizing treatment assignment policies. Let 0 ≤ c(x) < ∞, x ∈ X , be the cost of treatment 1 for a subject whose observable characteristics are x. We assume that treatment 0 is cost-free and c(·) is known. For the additive social welfare function, we can easily incorporate treatment costs into the EWM criterion by subtracting the per-capita cost of treatment C(G) ≡ G c(x)dP X (x) from W(G). The additive social welfare criterion depends only on the average treatment cost, it is invariant to assumptions about who pays the cost. For rank-dependent social welfare this invariance does not hold, hence, we have to be explicit about who bears the cost in the construction of the social welfare criterion. We illustrate this using two cost allocation scenarios.
In the first scenario, assume that the outcome variable is income and the cost of treatment is self-financed by each recipient of the treatment. Specifically, the income of individuals assigned to treatment 1 (individuals with X ∈ G) will be reduced by the full cost of their own treatment c(X). The transformed potential outcomes in this scenario areỸ 1i ≡ Y 1i +c−c(X i ) and Y 0i ≡ Y 0i +c. We add the constantc ≡ sup x∈X c(x) < ∞ to all outcomes to keep them nonnegative in line with Assumption 3.1 (TC). The welfare ranking of policies is unchanged when a constant is added uniformly to all outcomes.
The rank-dependent SWF of policy G with self-financed treatment cost is where FỸ 0 |X=x (·) and FỸ 1 |X=x (·) are the cdfs of the transformed potential outcomes. An empirical analog for W sf (G) can be obtained by replacing F G (y) in (13) by whereỸ i ≡ Y i +c − D i · c(X i ). Since this modification does not affect the validity of Assumption 3.1, the EWM rule with selffinanced treatment cost attains the uniform welfare loss upper bounds of Theorem 3.1 with ϒ +c in place of ϒ.
In the second scenario, suppose that the treatment cost is financed by all of the population members equally via a lumpsum transfer. The average per-capita treatment cost C(G) is subtracted from every individual's income regardless of their covariates and assigned treatment. Using representation (3), the rank-dependent SWF with equal lump-sum treatment costs can be expressed as using the fact that 1 0 ω(τ )dτ = (0) − (1) = 1 and addinḡ c to ensure nonnegative outcomes. Per-capita treatment cost of policy G could be estimated using its sample analog C(G) ≡ 1 n n i=1 c(X i ) · 1{X i ∈ G} and the EWM rule is obtained by In this article, we do not consider the joint optimization of the treatment assignment and cost allocation. However, the comparison of W sf (G) and W ls (G) shows that the allocation of treatment costs across the population can be used as an additional vehicle of policy intervention to increase a rankdependent SWF.

Capacity-Constrained Treatment
Another practical concern ruled out in the preceding sections is a capacity constraint limiting the proportion of population that can be assigned to treatment. Suppose that the proportion of the target population that could receive treatment 1 cannot exceed K ∈ (0, 1). If P X is unknown, then policies that seem to satisfy the capacity constraint based on the sample estimates of P X (G) may not actually satisfy it in the population. The analysis of the welfare loss needs to take into account what happens if the proposed policy is infeasible. For tractability, we continue to restrict attention only to non-randomized treatment rules (the result in Theorem 2.1 need not hold with a capacity constraint). For the additive welfare case, Kitagawa and Tetenov (2018a) proposed a capacity-constrained EWM procedure assuming that if a proposed treatment rule G violates the capacity constraint (P X (G) > K) then the scarce treatment is randomly rationed to a fraction K P X (G) of individuals with X ∈ G independently of (Y 0 , Y 1 , X). This random rationing approach can be straightforwardly extended to the EWM for the rankdependent social welfare.
With the capacity constraint and random rationing, the cdf of outcomes generated by policy G can be written as Hence, the social welfare under the capacity constraint and random rationing is W K (G) ≡ ∞ 0 (F K G (y))dy. Its sample analog can be constructed by replacing F G (y) in (13) with where P X,n (G) ≡ 1 n n i=1 1{X i ∈ G} is a sample analog of P X (G).
Proof. See the online supplement.

Empirical Illustration
To illustrate equality-minded empirical treatment choice, we apply our method to the experimental data from the National Job Training Partnership Act (JTPA) Study. A detailed description of the study and an assessment of average program effects for five large subgroups of the target population is found in Bloom et al. (1997). The study randomized whether applicants were eligible to receive a mix of training, job-search assistance, and other services for a period of 18 months. It collected background information about the applicants prior to random assignment, as well as administrative and survey data on applicants' earnings in the 30-month period following assignment. Our sample consists of 9223 observations with available data on years of education and preprogram earnings from the sample of adults (22 years and older) used in the original evaluation of the program and in subsequent studies (Bloom et al. 1997;Heckman, Ichimura, and Todd 1997;Abadie, Angrist, and Imbens 2002). The probability of being assigned to the treatment was two-thirds in this sample. For this illustration, total individual earnings in the 30month period following program assignment serve as the measure of income. We use three social welfare functions from the extended Gini family (4) with parameters k ∈ {2, 3, 6}. k = 2 corresponds to the additive social welfare, which is not equalityminded. k = 3 corresponds to the standard Gini SWF with welfare weights ω 3 (τ ) = 2(1 − τ ) and k = 6 corresponds to an extended Gini SWF with welfare weights ω 6 (τ ) = 5(1 − τ ) 4 , which places even greater weight on low-ranked outcomes.
For simplicity, we consider only the distribution of earnings in the population sampled for the experiment in the social welfare function. This embodies concerns about inequality within the study population (JTPA-eligible economically disadvantaged adults). In practice, policy makers are more likely to be concerned with inequality in the overall population, which also includes individuals outside of the experiment's sampling frame. Then the social welfare function should be evaluated on the income distribution of the whole population of interest.
Pretreatment variables on which we consider conditioning treatment assignment are the individual's years of education and earnings in the year prior to assignment. We do not use race, sex, or age to condition treatment assignment. Though treatment effects may vary with these characteristics, using them to condition treatment assignment is often socially unacceptable and illegal. Education and earnings are verifiable characteristics, which is also important for conditioning treatment assignment. The performance of treatment rules that condition on unverifiable characteristics is hard to evaluate if individuals change their self-reported characteristics to obtain their desired treatment assignment (either in the experiment or when the policy is implemented). Table 1 compares empirical estimates of social welfare measures (representative income) for a few alternative treatment rules. First, we consider simple treatment rules that either assign no one or everyone to treatment. Second, we consider empirically optimal rules from the class of quadrant treatment rules: This class of treatment eligibility rules is easily implementable and is often used in practice. To be assigned to treatment according to such rules, an individual's education and preprogram earnings both have to be above (or below) some specific thresholds. Third, we consider empirically optimal rules from the class of linear treatment rules: The first column in Table 1 displays the estimated average income under each treatment rule. The second column shows the standard Gini social welfare, expressed in terms of the representative income of the policy (the income distribution generated by the policy is valued as much as an equal income distribution with the representative income). The third column shows the representative income under an extended Gini SWF with k = 6. The fourth column lists the proportion of the target population assigned to treatment by each policy. Figure 1 compares the quadrant treatment rules maximizing the average income, the standard Gini SWF, and an extended Gini SWF (k = 6). Figure 2 compares the linear treatment rules maximizing the same three criteria. The size of black dots shows the number of individuals with different covariate values. Many individuals would be assigned to treatment by treatment rules maximizing any of the considered welfare functions, but there are also notable differences. Treatment rules maximizing the standard Gini SWF target a smaller proportion of the population, focusing on individuals with lowest preprogram earnings. Treatment rules maximizing the more equality-minded extended Gini SWF assign even fewer individuals to treatment. The estimated treatment rules change discontinuously with the Gini parameter k. For example, the linear treatment rule maximizing the additive SWF (k = 2) is also optimal for a range of other Gini parameters (k = 2.25, 2.5, 2.75), whereas the linear rule maximizing the extended Gini SWF (k = 6) is also optimal for larger parameters (k = 7, 8, 9).    Figure 3 explores the trade-off between treatment rules maximizing different social welfare functions. We compute the income distributions generated by G Add , the quadrant treatment rule maximizing average income, and by G Gini , the quadrant rule maximizing the Gini SWF. The left panel displays the difference between the income distributions generated by these treatment rules at each quantile: The average-maximizing treatment rule G Add generates an income distribution in which top quantiles (0.8 and higher) are substantially higher than in the income distribution generated by the Gini treatment rule. However, the distribution produced by the Gini treatment rule is better at midrange quantiles (0.4-0.8). The additive welfare criterion equally weights changes of all quantiles, hence, it favors G Add .
The standard Gini welfare criterion, in contrast, uses decreasing welfare weights ω 3 (τ ) = 2(1 − τ ). The right panel of Figure 3 displays the same quantile differences between the two income distributions weighted by ω 3 (τ ). With these equalityminded welfare weights, the gains offered by treatment rule G Add at top quantiles get a lower welfare weight than the gains offered by G Gini in the middle of the income distribution, hence, G Gini is preferred under the Gini SWF.

Conclusion
This article develops the first method for individualized treatment choice when the policy maker's objective is to maximize an equality-minded rank-dependent SWF. We showed that the average social welfare obtained by the estimated policy converges at the minimax-optimal n −1/2 rate. The key restriction underlying these rate results is the complexity restriction (Assumption 2.2 (VC)) imposed on the set of feasible policies. This complexity restriction still allows for rich classes of individualized treatment rules and offers a flexible and convenient way to incorporate exogenous constraints that policy makers face in realistic settings of policy design. Our analytical results cover a general class of equality-minded rank-dependent SWFs, and the shown regret bounds are valid even when the complexity of policies grows with the sample size. Efficient computation for the equality-minded EWM and a data-driven choice of complexity for policies (e.g., as proposed by Mbakop and Tabord-Meehan (2018) for the additive welfare case) remain open questions.

Appendix A: Lemmas and Proofs
Proof of Theorem 2.1. Denote an upper level set of δ(x) at level u ∈ [0, 1] by G(u) ≡ {x ∈ X : δ(x) ≥ u}. By noting that we can rewrite F δ (y) defined in (6) as where F G(u) (y) is the distribution of outcomes induced by treatment rule δ G(u) ≡ 1{x ∈ G(u)}. By convexity of (·), we obtain and this leads to F G(u) ) > 0 for all u ∈ [0, 1]. Then the integral of this function over the set u ∈ [0, 1] of positive measure must also be strictly positive, which is a contradiction. Therefore, there exists u * ∈ [0, 1] for which W (F G(u * ) ) ≥W , hence, W (F G(u * ) ) ≥ W (F δ ). If all upper level sets G(u) of δ belong to G, then also G(u * ) ∈ G.
The following five lemmas will be used in the proof of Theorem 3.1. The first lemma establishes a quadratic upper bound for the function t −1/2 for t ≥ 1.
Then g(t) ≤ h(t) for t = 0 and for all t ≥ 1.

Proof of Lemma
Now consider the function (h − g)(t) and its derivatives for t ≥ 1: First, we will show that (h − g)(t) ≥ 0 for t ∈ [1, t 0 ]. The function is positive at t = 1: We will next show that (h − g)(t) ≥ 0 between t = 1 and t = t 0 .
The second derivative of (h − g) is positive at t = t 0 , because t 0 > 1 by assumption, hence, t −1/2 0 < 1. Since the third derivative is positive on [1, t 0 ], it follows that the second derivative is either positive everywhere on [1, t 0 ], or it is first negative on some interval [1, t 2 ) and then positive on (t 2 , t 0 ]. The first derivative of (h − g) equals zero at t = t 0 : If the second derivative is positive everywhere on [1, t 0 ], then the first derivative must be negative everywhere on [1, t 0 ). If the second derivative changes sign from negative to positive, then the first derivative must either be negative on [1, t 0 ) or it could switch sign from positive on some interval [1, t 1 ) to negative on (t 1 , t 0 ).
Second, consider t > t 0 . At t = t 0 , (h−g)(t 0 ) = 0, (h−g) (t 0 ) = 0, and the second derivative is positive for all t > t 0 because it is positive at t = t 0 (A.4) and the third derivative is positive for all t ≥ t 0 . It follows that (h − g)(t) > 0 for all t > t 0 .
The second lemma applies the bound in Lemma A.1 to the expectation of the function g(·) of a binomial variable.
Lemma A.2. Suppose that random variable B ∼ Binomial(n, p) with np > 1 and g(·) is the function defined in (A.2). Then because np > 1 implies p > 0 and (np) −1 < (np) −1/2 . Let x l ≡ {x 1 , . . . , x l } be a finite set with l ≥ 1 points in X . Given a class of subsets G in X , define N(x l ) = |{x l ∩G : G ∈ G}| as the number of different subsets of x l picked out by G ∈ G. The VC-dimension v ≥ 1 of G is defined by the largest l such that sup x l N(x l ) = 2 l holds (Vapnik 1998). See Vapnik (1998), Dudley (1999, chap. 4), and van der Vaart and Wellner (1996) for extensive discussions. Note that the VCdimension is smaller by one compared to the VC-index used to measure the complexity of a class of sets in empirical process theory, for example, van der Vaart and Wellner (1996).
The third lemma is reproduced from Kitagawa and Tetenov (2018b, Lemma A.1). It establishes a link between the VC-dimension of a class of subsets in the covariate space X and the VC-dimension of a class of subgraphs of functions on Z.
Lemma A.3. Let G be a VC-class of subsets of X with VC-dimension v < ∞. Let g and h be two given functions from Z to R. Then the set of functions from Z to R is a VC-subgraph class of functions with VC-dimension less than or equal to v.
The fourth lemma, reproduced from Kitagawa and Tetenov (2018b, Lemma A.4), is a maximal inequality that bounds the mean of a supremum of a centered empirical process indexed by a VC-subgraph class of functions.
Lemma A.4. Let F be a class of uniformly bounded functions, that is, there existsF < ∞ such that f ∞ ≤F for all f ∈ F . Assume that F is a VC-subgraph class with VC-dimension v < ∞. Then, there is a universal constant C 1 such that holds for all n ≥ 1.
The last novel lemma allows us to prove Theorem 3.1 for unbounded outcomes.
Lemma A.5. Let F be a class of uniformly bounded functions, that is, holds for all n ≥ 1.
Proof of Lemma A.5. We start by deriving upper bounds for each value of y, y > 0, on First, consider values of y for which nP(Y > y) ≤ 1. Due to the envelope condition, 1 Also, It follows that where the last inequality holds because nP(Y > y) ≤ 1.
Second, we consider values of y for which nP(Y > y) > 1. Define random variables N y ≡ n i=1 1{Y i > y} for the number of observations in the data with Y i > y, then If N y ≥ 1 then Note that {Z i } i:Y i >y is an iid sample of size N y from the conditional distribution P(Z|Y > y). We next apply the bound in Lemma A.4 for each value of N y ≥ 1: Combining inequality (A.8) with bound (A.9), and using the definition g(t) = t −1/2 for t ≥ 1 from (A.2), we obtain a bound on the conditional expectation of ξ n (y) for N y ≥ 1: E P n ξ n (y)|N y ≤F N y n − P(Y > y) + P(Y > y)C 1F √ vg(N y ). where the last equality uses the definition g(0) = 0 from (A.2). Therefore, the conditional expectation bound (A.10) also holds for N y = 0.
The unconditional expectation of ξ n (y) is then bounded by E P n ξ n (y) ≤FE P n N y n − P(Y > y) +P(Y > y)C 1F √ vE P n g(N y ) .
(A.11) The random variable N y has a Binomial(n, P(Y > y)) distribution, hence, E P n N y n − P(Y > y) ≤ var N y n = P(Y > y)(1 − P(Y > y)) n ≤ P(Y > y) n .
Since nP(Y > y) > 1, applying Lemma A.2 yields E P n g(N y ) ≤ 2 √ n √ P (Y>y) . Combining this inequality with (A.11) and v ≥ 1 we obtain E P n ξ n (y) ≤F P(Y > y) n where C T ≡ 2(1 + C 1 ). This bound is higher than the bound (A.7) derived for y such that nP(Y > y) ≤ 1, hence, bound (A.12) holds for all y ≥ 0. The last step is to integrate the bound (A.12) over y and apply (A.5): ∞ 0 E P n ξ n (y) dy ≤ is a VC-subgraph class with VC-dimension of at most v.
Assumption 3.1 (SO) implies that w G (Z i ) ∈ 0, 1 κ , hence, functions in W are uniformly bounded by 1 κ . Since F G (y) = 1 − E P w G (Z) · 1{Y > y} and F G (y) It follows from Assumption 3.1 (TC) and (A.17) Applying Lemma A.5 to (A.15) yields Setting C = 4C T completes the proof of (17). The proof of (18) is found in the online supplement.

Supplementary Materials
Supplementary Materials, available on the journal's website, include the dataset and replication code for the empirical application, an online appendix containing an illustrative example of the properties of rankdependent SWFs, proofs of Theorems 3.1, 3.2 and Theorem 4.1, and an extension of the method with estimated propensity score.