Multiply robust estimation for average treatment effect among treated

We propose a multiply robust estimator for the Average Treatment Effect Among the Treated (ATT). The proposed estimation procedure can simultaneously accommodate multiple working models for both the propensity score and the conditional mean of the counterfactual outcome given covariates. In addition, it can explicitly balance a set of user-specified moments of the covariate distributions between the treatment groups. The resulting estimator is consistent if any working model is correctly specified. With the data generating process typically unknown for observational studies, the proposed method provides substantial robustness against possible model misspecifications compared to existing estimators of the ATT. Simulation results show the excellent finite sample performance of the proposed estimator.


Introduction
In causal inference, since we only observe what happens to an individual under the treatment condition they actually receive, it is generally impossible to estimate the causal effects for individuals.The causal effects one typically considers involve summary statistics of the individual effects across populations or sub-populations of interest.Two widely considered causal summaries are the average treatment effect (ATE) and the average treatment effect among the treated (ATT).The ATE of a treatment relative to a control is the comparison of the mean outcome which had the entire population been treated versus had the entire population been the control.The ATT is the comparison of the mean outcome under treatment among those who are treated with the mean outcome which had the treated subjects received control instead.A study can estimate both ATE and ATT, but one or the other may be better suited for a particular situation.The ATE may be of more interest if each treatment can potentially be offered to every member of the population.Conversely, if the research question focuses on the effectiveness of an alternative treatment were it to replace the standard treatment, and then the ATT may be of more interest because it measures the relative effectiveness of the two treatment options on the population that is receiving the standard treatment.
There has been a large literature on estimating the ATE and the ATT for observational studies, where a typical consideration is confounding in the sense that individual characteristics are related to both the treatment assignment and the outcome of interest.Propensity score (PS) based methods, where the PS is the probability of receiving a treatment given covariates (Rosenbaum & Rubin, 1983), are commonly used to deal with confounding and to achieve balance of covariate distributions between different treatment groups.Such methods include PS matching (e.g., Abadie & Imbens, 2006;Rosenbaum & Rubin, 1985) and weighting (e.g., Hirano et al., 2003).Refer to Imbens and Rubin (2015) and Hernán and Robins (2018) for more details.Parametric modelling of the PS is common, especially when the dimension of covariates is moderate to large.Misspecification of the PS model is usually a major concern as it may lead to substantial estimation bias.
To mitigate the impact of misspecification of the PS model, substantial interests have been given to doubly robust estimators, which involve models for both the PS and the conditional mean of the counterfactual outcome, or outcome regression (OR), and remain consistent if either model is correctly specified.The original doubly robust estimator was constructed through augmented inverse probability weighting (AIPW) in the missing data context (Robins et al., 1994).Since then, a large number of doubly robust estimators have been proposed in both missing data and causal inference settings (e.g., Bang & Robins, 2005;Cao et al., 2009;Han, 2012;Kang & Schafer, 2007;Qin et al., 2008;Qin & Zhang, 2007;Rotnitzky et al., 2012;Tan, 2010;van der Laan & Gruber, 2010).When both the PS model and the OR model are wrong, these estimators are in general no longer consistent.
As an improvement over double robustness, in the missing data context, Han and Wang (2013) proposed a multiply robust estimation procedure that can simultaneously accommodate multiple working models for both PS and OR, and the resulting estimators are consistent if any of these models is correctly specified.Since the data generating process for observational studies is typically unknown, it is common that several candidate models all seem reasonable yet none rules out the possibility of others, especially when the dimension of covariates is moderate to large.In such a case, the multiply robust method in Han and Wang (2013) provides a useful tool for data analysis with more protection on estimation consistency.Such a method has attracted considerable interest in both missing data and causal inference research (e.g., Chan & Yam, 2014;Chan et al., 2016;S. Chen & Haziza, 2017;Duan & Yin, 2017;Han, 2014aHan, , 2014bHan, , 2016aHan, , 2016b;;Han et al., 2019;Li et al., 2020).Especially, Wang (2019) extended the multiply robust method to the estimation of the ATE.It is worth pointing out that the term " multiply robust" has also been used by other authors in different settings with different meanings.For example, in Molina et al. (2017), it refers to estimation consistency when various combinations of the components of a factorized likelihood are correctly modelled, while in Wang and Tchetgen Tchetgen (2018) and Shi et al. (2020) it refers to estimation consistency being achieved across the union of three different observed data models.We use " multiply robust" to refer to the property that estimation consistency is achieved if one of the multiple working models for the same quantity is correctly specified.
Despite the success in dealing with missing data problems and in estimating the ATE, multiply robust estimators have not been developed for estimating the ATT.In this paper, we construct such a desirable estimator for ATT.In addition to being multiply robust, the proposed estimator can easily achieve certain level of balancing of covariate distributions between treatment groups.Covariate balancing is a highly desired property for causal effect estimation.For the estimation of ATT, Hainmueller (2012) proposed the entropy balancing (EB) method by imposing a set of balance constraints so that certain moments of covariate distributions for different treatment groups match exactly.Zhao and Percival (2017) found that EB implicitly fits a logistic linear regression model for the PS and a linear regression model for the OR, and when either model is correctly specified the EB estimator is consistent.The estimator we propose preserves the same balance of covariate distributions as EB, and in addition it accounts for multiple models for PS and OR simultaneously so that consistency of the resulting estimator is guaranteed if any one model is correctly specified.
The rest of the paper is organized as follows.Section 2 gives the setup and a brief review of some existing methods.Section 3 presents our proposed multiply robust estimator for the ATT.Simulation studies are provided in Section 4, followed by some discussion in Section 5.

Setup and some existing methods
Let A denote the treatment indicator, where A = 1 for the treatment of interest and A = 0 for control.Let Y 1 and Y 0 denote the counterfactual outcomes under treatment and under control, respectively.Let Y denote the observed outcome in the data, for which we make the consistency assumption that Y = AY 1 + (1 − A)Y 0 .Let X denote the pre-treatment covariates in the dataset, including potential confounders.The potential full data is (A, Y 1 , Y 0 , X) whereas the observed data is (A, Y, X).The causal effect we are interested in is the ATT τ = E(Y 1 − Y 0 | A = 1), which is also τ 1,1 − τ 1,0 , where τ a,b is the mean outcome for subjects who receive treatment a had they instead received treatment b, i.e., To ensure the identifiability of τ from the observed data, we make some assumptions following the main literature (e.g., Rosenbaum & Rubin, 1983).First, we assume that all potential confounders are included in X, or in other words, the treatment assignment process and the counterfactual outcome are independent given all the covariates measured.Second, we assume that every subject has a positive probability of being assigned to either the treatment or the control group.These two assumptions are formalized as follows.
Let π(X) = P(A = 1 | X) denote the PS for treatment assignment.In the following we give a brief review of some widely used estimators of the ATT.Note that τ 1,1 = E(Y 1 | A = 1) can be consistently estimated by the sample average τ1,1 = ( n i=1 A i Y i )/( n i=1 A i ) over the treatment group.Therefore, the estimation of the ATT reduces to the estimation of τ 1,0 = E(Y 0 | A = 1), where the counterfactual outcome Y 0 is not observable for individuals in the A = 1 group.
One straightforward way to estimate τ 1,0 is to impute Y 0 for those individuals in the A = 1 group.Suppose from Assumption 1 and Y 0 is fully observed in the A = 0 group, γ can be estimated by γ based on individuals in the control group alone.Also from Assumption 1, we have E(Y 0 | X) = E(Y 0 | X, A = 1), and thus the unobserved Y 0 in the treatment group can be imputed by d 0 ( γ ).Therefore, an outcome regression estimator of τ 1,0 is τ1,0reg = A widely used alternative method is inverse probability weighting (IPW).From Assumption 1, since where f (•) is a generic notation for a probability density function, we have Therefore, an estimator of τ 1,0 = E(Y 0 | A = 1) can be constructed by properly weighting the observed Y 0 in the A = 0 group, and this leads to an IPW estimator where π(X) is the estimated value of π(X).Consistency of τ1,0reg and τ1,0ipw requires correct modelling of E(Y 0 | X) and π(X), respectively.To improve the robustness against possible model misspecifications, the AIPW method combines the models for E(Y 0 | X) and π(X) so that estimation consistency is guaranteed if either model is correctly specified but not necessarily both.The AIPW estimator for τ 1,0 is given as To achieve certain covariate balancing, the EB method (Hainmueller, 2012) considers matching covariate moments between the treatment and the control groups.Specifically, let w i be a set of positive weights assigned to the control subjects {i : A i = 0}.The EB method imposes covariate balancing constraints on w i , where h(X) contains some user-specified moments of X.These constraints ensure that certain moments of X exactly match between the two groups.The EB estimator of τ 1,0 is τ1,0eb = {i:A i =0} ŵeb,i Y i where ŵeb,i minimizes {i:A i =0} w i log w i subject to the constraints in (3).Although no parametric models are explicitly fitted by the EB method, Zhao and Percival (2017) showed that, implicitly, the EB method fits linear regression models for logit{π(X)} and E(Y 0 | X) with components of h(X) as regressors.The EB estimator τ1,0eb is consistent if either model is correctly specified, and thus is doubly robust.

The proposed multiply robust estimator for ATT
Our goal is to construct an easy-to-implement estimator of the ATT that is multiply robust and achieves explicit covariate balancing as the EB estimator.Let P = {π (j) (α (j) ) : j = 1, . . ., J} denote a set of multiple parametric models for π(X) and D 0 = {d (k) 0 (γ (k) ) : k = 1, . . ., K} a set of multiple parametric models for E(Y 0 | X), where α (j) and γ (k) are the corresponding parameters.α (j) is typically estimated by α(j) that maximizes the binomial likelihood and γ (k) is typically estimated by γ (k) based on individuals in the control group because from Assumption 1 and Y 0 is fully observed in the control group.
Note that here ĝ( α, γ ) can also contain components based on taking b(X) to be h(X), a vector of user-specified moments of X.This can be used to achieve certain degree of covariate balancing between the treatment and control groups, similar to the EB method.Subject to the constraints in (8), we consider the weights ŵmr,i on the subjects in the control group {i : A i = 0} that maximize the empirical likelihood {i:A i =0} w i and propose to estimate τ 1,0 by τ1,0mr = {i:A i =0} ŵmr,i Y i .
Following the standard empirical likelihood technique (e.g., Qin & Lawless, 1994), we have where ρ solves For implementation, directly solving (9) for ρ is not the ideal way because, as pointed out in Han (2014a), ( 9) typically has multiple roots but only one of them makes the ŵmr,i positive.We recommend calculating ρ by minimizing F(ρ) ≡ − {i:A i =0} log{1 + ρ ĝi ( α, γ )}, which is the negative antiderivative of the left-hand side of (9).As shown in Han (2014a), this is a convex minimization that always has a unique minimizer when 0 is inside the convex hull of {ĝ i ( α, γ ) : A i = 0}, which indeed holds at least when n is large because of the moment equality ( 7) and ( 8) being an empirical version of ( 7).The unique minimizer of F(ρ) is ρ needed for calculating ŵmr,i and can be easily found by a Newton-Raphson-type algorithm.Refer to J. Chen et al. (2002) and Han (2014a) for such an algorithm.
In the following we provide arguments and explanations for the multiple robustness of τ1,0mr .We do not include detailed technical conditions and mathematical proofs for these arguments so that the main ideas are easier to follow.To see the consistency of τ1,0mr when P contains a correctly specified model for π(X), say π (1) (α (1) ) without loss of generality, let and define λ in such a way that ρ1 = ( λ1 + 1)/( θ − 1) and ρ−1 = λ−1 /( θ − 1), where ρ1 and λ1 are the first component of ρ and λ, respectively, and ρ−1 and λ−1 are vectors of the rest components.Then some simple algebra shows that , for i satisfying A i = 0 and (9) becomes an equation for λ From the Z-estimator theory (e.g., van der Vaart, 1998) and because of the moment equality (7), we must have λ = O p (n −1/2 ), which leads to In other words, when P contains a correctly specified model for π(X), this correct model is implicitly accounted for by the empirical likelihood procedure to make ŵmr,i the form of the IPW weight as used in (2), and this leads to the consistency of τ1,0mr : where the second last equality follows from that can be shown by taking b(X) = {1 − π(X)}/π(X) in ( 5), and the last equality follows from (1).
In summary, we have the following result on the multiple robustness property for the proposed estimator of the ATT.
Proposition: If P contains a correctly specified model for π(X) or D 0 contains a correctly specified model for E{Y 0 |X}, The proposed estimator τ1,0mr and the EB estimator τ1,0eb are both weighted averages of the observed outcomes for the control group subjects, and both are calibration-type estimators originally considered in survey sampling (Deville & Särndal, 1992) that were later extended to missing data analysis and causal inference (e.g., Chan & Yam, 2014;S. Chen & Haziza, 2017;Han & Wang, 2013;Kim, 2010;Kim & Park, 2010;Qin & Zhang, 2007;Tan, 2010;Wu & Sitter, 2001;Zhang et al., 2022).Some of the constraints in (8) based on the h(X) component in ĝ( α, γ ) are actually the constraints (3) used by the EB method, and thus our proposed method achieves the same degree of covariate balancing between the treatment and the control groups.The other constraints in (8) based on parametric models are for multiple robustness purpose.
Note that the constraints in ( 8) are for estimating ATT and are determined by ( 7), which balance the control group with the treatment group.This is different from estimating ATE where the constraints balance the control group with the full sample or the treatment group with the full sample (e.g., Fan et al., 2023).Adding the latter type of constraints into (8) would not work, since the weight matching control to treatment is different from the weight matching control to the full sample.
Standard error of the proposed estimator is needed to make inference, such as constructing confidence intervals.Due to the presence of multiple models and the lack of knowledge of which one is correct, deriving the asymptotic distribution of the proposed estimator as an approximation to the finite sample distributions is challenging.Therefore, we recommend using bootstrap to calculate the standard error.The excellent performance of the bootstrap method for multiply robust estimators in missing data context has been demonstrated through comprehensive simulation studies (e.g., Han, 2014aHan, , 2014b)), and its effectiveness for the proposed estimator will be shown in the next section.

Simulation studies
In this section we conduct simulation studies to evaluate the finite sample performance of the proposed multiply robust estimator of the ATT.The simulation setting mimics that in Han and Wang (2013).The data are generated in the following way: Comparison of different estimators based on 1000 replications and n = 300.Each digit of the four-digit number inside an estimator's name, from left to right, indicates if π (1) (α (1) ), π (2) (α (2) ), d (1) 0 (γ (1) ) and d (2) 0 (γ (2) ) are used, respectively.The results have been multiplied by 100.
logit{π(X)} and d 0 (X) with regressors X and X 2 , and this makes the estimator EB-2 consistent under the first three data generating processes and inconsistent under the last one where π(X) = 1 − exp[− exp{0.5 + 0.5X − 0.3 exp(X)}] and {d 0 (X), d 1 (X)} = {1 + 2X + 3 exp(X), 2 + 3X + 3 exp(X)}.This theoretical conclusion is well confirmed by inspecting the bias of EB-2.The estimator MR-2 with exponential tilting replaced by empirical likelihood is inconsistent under all four data generating processes.Table 3 contains a summary of the performance of the bootstrap method for standard error calculation.Due to similarity of the performance under different settings, we only include the results for π(X) = {1 + exp(0.8+ 0.5X − 0.3X 2 )} −1 and {d 0 (X), d 1 (X)} = (1 + 2X + 3X 2 , 2 + 3X + 3X 2 ) with n = 300.The results are summarized based on 1000 replications with the bootstrap resampling size 200.It is seen that the average of the standard errors based on boostrap is very close to the empirical standard error.In addition, for the estimators that are consistent under this data generating process, i.e., MR-1000, MR-0010, MR-1100, MR-1010, MR-1001, MR-0110, MR-0011, MR-1110, MR-1101, MR-1011, MR-0111 and MR-1111, the coverage percentage of the 95% confidence intervals constructed based on bootstrap standard errors is close to the nominal level.Both observations show that the bootstrap method is reliable to calculate the standard errors for the proposed estimators.

Discussion
In this paper we have proposed a multiply robust estimator for the causal effect ATT.The estimator provides more protection on estimation consistency compared to existing doubly robust estimators.It is worth pointing out that, although the proposed method can simultaneously account for multiple working models, there does not have to be multiple models to apply this method.The construction of constraints is very flexible and the method can be applied when only a single working model is available.Therefore, the proposed method provides a unified framework as an alternative to various existing methods, including the IPW and AIPW methods.
A major difference between the proposed method and the EB method is the objective function being optimized.The EB method minimizes the Shannon entropy {i:A i =0} w i log w i , or equivalently the exponential tilting function Table 3. Performance of the bootstrap method in calculating the standard errors for the proposed estimators in the setting of π(X) = {1 + exp(0.8+ 0.5X − 0.3X 2 )} −1 and {d 0 (X), d 1 (X)} = (1 + 2X + 3X 2 , 2 + 3X + 3X 2 ).Results are summarized based on 1000 replications with n = 300 and the bootstrap resampling size is 200.Each digit of the four-digit number inside an estimator's name, from left to right, indicates if π (1) (α (1) ), π (2) (α (2) ), d (1) 0 (γ (1) ) and d (2) 0 (γ (2) ) are used, respectively.in the empirical likelihood literature (e.g., Newey & Smith, 2004), whereas we proposed to maximize the empirical likelihood {i:A i =0} w i .The advantage of considering an empirical likelihood is manifold.First, it improves the robustness of estimation consistency by using constraints based on parametric working models.Although the same constraints can be used by the EB method as well, a correct model does not make the EB estimator consistent, and consistency of the EB estimator is achieved only when logit{π(X)} or E(Y 0 | X) is a linear model with regressors h(X).Second, by dividing the constraints into two parts, one based on parametric working models and one based on moments of X, the proposed method is more flexible when the dimension of X is large.With π(X) and/or E(Y 0 | X) depending on a high dimensional X, consistency of the EB estimator requires a large number of constraints to include all the regressors for logit{π(X)} and/or E(Y 0 | X), and this may jeopardize the numerical performance in practice.The proposed method, on the contrary, has a separate model building step where complex working models can be built and only the ones that are deemed to be close to the truth are used in the constraints.In this case, the parametric working models help to achieve consistency, and thus the other part of constraints based on moments of X can be chosen exclusively for the goal of achieving covariate balancing.This may considerably reduce the number of constraints compared to the EB method and thus improve the numerical performance.Third, it is well known in the empirical likelihood literature (e.g., Newey & Smith, 2004) that the empirical likelihood has smaller higher-order bias, which translates to a better numerical performance under a finite sample size, compared to exponential tilting and other alternatives.As an empirical version of the moment equality (7), the constraints in (8) are legitimate, at least when the sample size is large.Therefore, the 'model misspecification' problem that typically exists in the empirical likelihood literature under which the performance of empirical likelihood may be worse than that of the exponential tilting (Schennach, 2007) is not of a concern here.
: empirical standard error.se-bp: averaged bootstrap standard error.cp-95%: coverage percentage of the 95% confidence interval constructed based on the bootstrap standard error.MR: multiply robust.