Convolutional neural networks for valid and efficient causal inference

Convolutional neural networks (CNN) have been successful in machine learning applications. Their success relies on their ability to consider space invariant local features. We consider the use of CNN to fit nuisance models in semiparametric estimation of the average causal effect of a treatment. In this setting, nuisance models are functions of pre-treatment covariates that need to be controlled for. In an application where we want to estimate the effect of early retirement on a health outcome, we propose to use CNN to control for time-structured covariates. Thus, CNN is used when fitting nuisance models explaining the treatment and the outcome. These fits are then combined into an augmented inverse probability weighting estimator yielding efficient and uniformly valid inference. Theoretically, we contribute by providing rates of convergence for CNN equipped with the rectified linear unit activation function and compare it to an existing result for feedforward neural networks. We also show when those rates guarantee uniformly valid inference. A Monte Carlo study is provided where the performance of the proposed estimator is evaluated and compared with other strategies. Finally, we give results on a study of the effect of early retirement on hospitalization using data covering the whole Swedish population.


Introduction
Convolutional Neural Networks (CNN) have been found successful in discovering location-invariant patterns in speech, images, and time series data (LeCun et al., 1995).In particular, they have been shown to have a universal approximation property and be more efficient in terms of number of hidden layers than fully-connected multi-layer networks in high-dimensional situations (Zhou, 2020).In this paper we show how CNN can be useful in controlling confounding information when using rich observational databases in order to perform semiparametric inference on a low dimensional causal parameter: we focus on average causal effects of a binary treatment on an outcome of interest, although our results are relevant for the semiparametric estimation of other low dimensional parameters of interest (see e.g.Chernozhukov et al., 2018).
Augmented Inverse Probability Weighting (AIPW) estimators (also called Double Robust (DR) estimators, Robins et al., 1994) attain the semiparametric efficiency bound and yield uniformly valid inference as long as the nuisance functions of the confounding covariates are fitted consistenlty with fast enough convergence rates; e.g.all nuisance functions are estimated with order n −1/4 (Belloni et al., 2014;Farrell, 2015;Kennedy, 2016;Moosavi et al., 2021).In this paper, we contribute to this theory by showing that CNN fits of nuisance functions achieve the n −1/4 convergence rate required to obtain uniformly valid inference on causal parameters.To show this we use a result obtained by Farrell et al. (2021) for general Rectified Linear Unit (ReLU)-based Feed-forward Neural Networks (FNN).They show that, for large samples, the estimation error rate of FNN are bounded by the following term, with a probability increasing exponentially with γ: log n/n × complexity penalty + (log log n + γ)/n + approximation rate.(1) We deduce the above approximation rate for CNN architectures inspired by earlier work by Zhou (2020).However, in contrast to the latter paper, we consider a larger number of free parameters by considering multi-channel convolutional neural network, so as to achieve a trade-off between complexity penalty and approximation rate in (1), and thereby obtain the convergence rate n −1/4 for the CNN fit of the nuisance functions.In the next section we formally define the causal parameters of interest using the potential outcome framework (Rubin, 1974), and introduce the assumptions yielding identification, and locally efficient and uniformly valid inference when using AIPW estimators.We also introduce the convolutional network architectures with which we propose to fit the nuisance functions used by AIPW, followed by our main theoretical results, including conditions to obtain uniformly valid inference when using CNN based AIPW estimation.Section 3 presents numerical experiments illustrating the finite sample behaviour of this estimation strategy under different data generating mechanisms.The proposed estimator is compared to AIPW, Targeted Maximum Likelihood Estimation (TMLE) (van der Laan and Rose, 2011) and Outcome Regression (OR) estimation (Tan, 2007;Moosavi et al., 2021) using fully-connected feed-forward ReLU based neural networks (Multilayer Perceptron (MLP) in Farrell et al., 2021) and Lasso to fit the nuisance functions.In Section 4, we study the effect of early retirement (at 62 years old), compared to retiring later in life, on morbidity and mortality outcomes (Barban et al., 2020).We use population wide Swedish register data and follow cohorts born in 1946 and 1947 for which we have a rich reservoir of potential pre-treatment confounders, including hospitalization and income histories.CNN allows us to consider that such life histories may contain location-invariant patterns that confound the causal effects of the treatment (decision to retire early).
2 Theory and method

Causal parameters and uniformly valid inference
The Average Causal Effect (ACE) and Average Causal Effect among the Treated (ACET) of a binary treatment (T ) are parameters defined using potential outcomes (Rubin (1974)), respectively: where Y (1) and Y (0) are the outcomes that would be observed if T = 1 and T = 0, respectively.For a given individual only one of these potential outcomes can be observed.This intrinsic missing data problem implies that assumptions need to be made to identify τ and τ t .For this purpose, and given a vector of observed pretreatment covariates X, we assume: Assumption 1.
a.No unmeasured confounders: Note that ACET is identified if only 1a(i), 1b(i) and 1c hold.Assumption 1a requires that the observed vector X includes all confounders.Assumption 1b requires that for any value X both treatment levels have non-zero probability to occur, and by Assumption 1c one of the potential outcomes is observed for each individual, and its value is not affected by the treatment received by other individuals in the sample.
We aim at uniformly (over the class of Data Generating Process (DGP)s, for which Assumption 2 hold) valid inference while using semiparametric estimation (Moosavi et al., 2021).Therefore, the following AIPW estimators of ACE are considered (Robins et al., 1994;Scharfstein et al., 1999): and μt (x) is an estimator of For ACET, we use the estimator where We use the following assumptions for the DGPs.
b. Let U = Y (t) − µ t (X).There is some r > 0 for which E |µ t (x i ) µ t (x i )| 1+r and E |u i | 4+r are bounded, for given values of t and t .
The estimators of the nuisance functions have yet to be introduced for these AIPW estimation strategies.In order to get the desired results, the proposed estimators should be well behaved.More precisely, the following consistency and rate conditions are considered for the nuisance function estimators.Assumption 3. Let p(x) be an estimator of P[T = 1|X = x i ] which only depends on {x i , t i } n i=1 (assumption of "no additional randomness", Farrell, 2015).Moreover, for a given t we have The following proposition describes uniform validity results obtained by Farrell (2018, Corollary 2 and 3) under the above regularity conditions.Proposition 1.For each n, let P n be the set of distributions obeying Assumptions 1a(i), 1b(i), 1c and 2a.Further, assume Assumption 2b holds for t = t = 0, and let p(x) and μ0 (x) fulfill Assumption 3.Then, we have: where Let also Assumptions 1a(ii), 1b(ii) be fulfilled and Assumption 2b hold for t, t ∈ {0, 1}.Additionally, assume μ1 (x) fulfills Assumption 3.Then, we have: where Remark 1.A multiplicative rate condition as Assumption 3(b) is weaker than separate conditions on the two nuisance model estimators.It only requires that one of the nuisance functions is estimated at faster rate if the other one is estimated at slower rate.However, using the regularity conditions in this paper and Farrell et al. (2021), the rate o P (n −1/4 ) is obtained for each of the nuisance estimators separately.This is to make sure Assumption 3(c) for μ is also fulfilled.This assumption can be, however, dropped by considering sample splitting (Chernozhukov et al., 2018).
Note, finally, that because the AIPW estimators is based on the efficient influence function, its asymptotic variance is equal to the semiparametric efficiency bound (Tsiatis, 2006).

Convolutional neural networks
We consider a specific CNN architecture with parallel hidden layers structured as follows.Let the input column vector be denoted by h 0 := (x 1 , • • • , x d ) ∈ Ω, where Ω ⊆ R d .We consider L to be the number of hidden layers in which we have E number of parallel vectors.Let σ be the ReLU function defined on the space of real numbers as σ(z) = max(0, z) for z ∈ R. The vectors in each hidden layer l ∈ {1, • • • , L} have size d l and are defined by , where e ∈ {1, • • • , E}, h 0 e = h 0 , b e l is a vector called bias vector in the neural network literature, W l e is a weight matrix defined as: , and the weight parameter w e i,l is the i-the element of the filter mask (w e 0,l , • • • , w e S,l ).S is considered fixed through all values of l and e.We have d l = d 0 + Sl, where d l is the size of vectors in the l-th layer and d 0 = d.
The structure of a CNN as defined above provides us with the following function space on Ω: Similar to Zhou (2020) (who, however, uses E = 1) we assume that for all values of e in {1, with the d l − 2S repeated components in the middle.Therefore, the total number of parameters (duplicate/shared weights are counted more than once), Q for the neural network fulfills the following inequality: where the notation x n y n for two sequences of random variables x n and y n means

Main results
In this section, we provide results that allow us to use the CNN architectures described above to fit the nuisance models needed for AIPW estimations of ACE and ACET, and obtain uniformly valid inference.In particular, we need to show that the rate conditions in Assumption 3 are fulfilled.
Assumption 5. Let be a loss function that, for any arbitrary functions f and g and z a realization of Z, fulfills: where f = arg min E( (f, Z)).
Further, let F be an arbitrary set of functions.We define: where • ∞ is the supremum norm.The following result is a special case of Theorem 2(b) given in Farrell et al. (2021), and gives the rate of convergence for our estimate fC of a target f * (a nuisance function).Prior to presenting the theorem, the target space in which we seek to find the nuisance functions need to be defined.We use the notation W β,∞ (Ω) for the Sobolev space with smoothness β ∈ N + .We consider W to be the unit ball on the Sobolev space defined on such as: and D α f is the weak derivative.
Theorem 1.Let f * ∈ W, and assume that Assumptions 4-5 hold.For an absolute constant M > 0 and M = 2M , let the approximation error ε W,C be defined as in (4).
With probability at least 1 − e −γ , and for large enough n, where the constant C > 0 is independent of n, but may depend on d, M , and other fixed constants.
Our next step is to bound the term ε 2 W,C in the above inequality on the estimation error.The following result is similar to Theorem 1 in Zhou (2020).However, we allow here the number of parallel layers to grow with n, which translates into faster growing function space for our specific CNN architectures, thereby a larger decay rate for the error.
where c is an absolute constant and therefore Finally, our main result gives conditions for the CNN nuisance function fit to fulfill Assumption 3, and therefore conditions for AIPW estimation based on CNN fit to yield uniformly valid inference for ACE and ACET (Proposition 1).
Theorem 3. Let the conditions in Theorem 1 hold for u i = x i , v i = y i , and f * = µ t .Moreover the same conditions are fulfilled for Remark 2. The rate result for CNN based estimators in Theorem 3 corresponds to similar results for MLP in Farrell et al. (2021, Theorem 3).However, the conditions on smoothness is less strict in our case.The rate of growth for the number of parameters considered for MLP based estimators in Farrell et al. (2021, Theorem 1) is n d β+d log 5 n, while we have considered the growth rate n d+1 2d+4 log 4 (n).These rates are similar up to a log factor if d is large and β is close to d. Notice, however, that these are not necessarily minimal rates required for valid inference on ACE(T).

Simulation Study
We perform a simulation study to evaluate the use of the CNN together with AIPW estimation strategy of average causal effects proposed above.We focus here on τ t (ACET) but similar results are obtained for τ .Comparisons are made with other algorithms to fit nuisance functions, including, MLP, (Farrell et al., 2021) and postlasso estimators (Farrell, 2015).Moreover, we also include alternative strategies to AIPW, including OR estimation (Tan, 2007) and targeted learning (van der Laan and Rose, 2011).
Setting 1 Potential outcomes and treatment assignment are generated as follows: In this setting, polynomial functions of the difference of some neighboring variables in the sequence are used to generate the nuisance models.
Setting 2 Potential outcomes and treatment assignment are generated as follows: where l 1 (x, y, z, w) = 10, if y/x, z/y, and w/z > 1.15, 0, otherwise, Each channel (feature map) in layer l (e.g. the blue vector in the second layer) is formed by first applying E l−1 number of kernels (convolution filters) on the vectors in layer l − 1 (as demonstrated by the blue arrow in the set of kernels applied on the first layer) and then summing the results.There exist some covariates which are not time series.Inclusion of those vectors in the neural network is not illustrated in the figure for simplification purpose.A fully connected feed-forward structure has been used for those covariates and the results are then combined with the results from the time-series data in the flattening layer demonstrated above.
l 2 (x, y, z, w) = 5, if y/x, z/y, and w/z < 1.05, 0, otherwise, Here, the output of l 1 and l 2 are non-zero only if an increase by a factor bigger than 1.15 and a decrease by a factor bigger than 1.05 is found in all consecutive pair of covariates in the input of size four, respectively.The value of l 3 is nonzero if an increase by a factor bigger than 1.1 is followed by a decrease and followed by another increase by a factor of bigger than 1.1 or if the oscillation has the opposite direction.
The following algorithms to estimate the nuisance functions are evaluated: CNN A convolutional neural network using the ReLU activation function, with architecture illustrated in Figure 1.Note that this architecture is slightly more flexible than the one considered in Section 2.2, hence, we expect it to perform at least as well in terms of approximation rate.Both for the potential outcome model µ 0 and the propensity score p, the number of channels in the input layer is one.For the outcome models 128 channels in the first hidden layer and 16 channels in the second hidden layer have been used, while 32 channels in the first hidden layer and 8 channels in the second hidden layer have been used for fitting the propensity score.We have implemented the CNN together with AIPW estimation strategy in the package DNNCausal which is available at https://github.com/stat4reg/DNNCausal.
MLP A ReLU activation function based feedforward neural network with two hidden layers.Two different feedforward architectures are used in order to estimate µ 0 , and p (propensity score).The networks for fitting the outcome model consist of 128 and 80 nodes within the two layers, respectively.For the propensity score, two layers contain 32 and 8 nodes, respectively.
post-lasso Nuisance models are fitted in two steps, where all higher order terms up to order three are considered.First, a lasso variable selection step is performed using the R-package hdm (Chernozhukov et al. ( 2016)), and then maximum likelihood is used to fit the models with the selected covariates (those with coefficients not shrinked to zero by lasso).Variable selection is performed using two different strategies: double selection which takes the union of variables selected by regressing outcome and treatment variables and uses that to refit both models (Belloni et al., 2014;Moosavi et al., 2021), and single selection, which utilizes two sets of variables selected by regressing outcome and treatment, and refits each of the models using their corresponding set (Farrell, 2015).
We use the above nuisance function estimators in several estimators of τ t , as follows: AIPW is implemented using CNN, MLP, and post-lasso with both single and double selection (denoted respectively: DRcnn, DRmlp, DRss, DRds).
OR estimator with post-lasso and double selection (denoted ORds).
TMLE using the R-package tmle and the function SL.nnet to fit all nuisance functions (denoted TMLE).
Full details with code for this simulation study are available at the Github page: https://github.com/stat4reg/CausalCNN/blob/main/README.md.We set sample sizes to n = 5000 and 10000, and run 1000 replicates.Bias, standard errors (both estimated and Monte Carlo), mean squared errors (MSE, bias squared plus variance) and empirical coverages for 95% confidence intervals are reported in Table 1 and 2. The DGP of Setting 1 (Table 1) is the least complex one and its polynomial form is favorable to the post-lasso methods which are based on higher order terms.It is The DGP of Setting 2 (Table 2) is less smooth and we observe that the DRcnn and DRmlp methods have the lowest biases and MSEs, where DRcnn performs slightly better.TMLE and post-lasso both have such a large bias that the confidence interval coverages are far from being at the nominal level.It should be noted, however, that the TMLE uses the built-in feedforward neural network, which is not adjusted to the generated datasets, and so the comparison is not a fair one between TMLE and the DR estimators.

Effect of early retirement on health
The existing theoretical and empirical evidence on the effects of early retirement on health are mixed, and both positive and negative results may be expected and have been reported in different situations; see Barban et al. (2020) and the references therein.In this context, we add to the empirical evidence by studying the effect of early retirement on health using a database linking at the individual level a collection of socio-economic and health registers on the whole Swedish population (Lindgren et al., 2016).This allows us to follow until 2017 two complete cohorts born in 1946 and 1947 and residing in Sweden in 1990.As health outcome we observe the number of days in hospital per year after retirement.To study the effects of early retirement we consider those who were still alive at age 62, and either retire at age 62 (T = 1, treatment) or retire later (T = 0, control group).For the cohorts studied, retirement pensions are accessible from age 61, although the usual age of retirement is 65 years of age (see Figure 2 for descriptive statistics on age of retirement).Therefore, retiring at age of 62 is considered as early retirement since it decreases your annual pension transfers compared to later retirement; see, e.g.Barban et al. (2020) for details on the Swedish pension system.Barban et al. (2020) also contains a study on the effects of early retirement, although based on fewer measurements for time-dependent covariates and using a matching design.This study provides new evidence by taking into account richer histories of the time-dependent covariates, and by incorporating CNN to address complex dynamics in pre-treatment confounding covariates.More precisely, the design of the study is as follows.An individual alive at age 62 is considered as taking early retirement at that age if hers/his pension transfers become larger than income from work at that age for the first time (i.e., they were never so earlier).For replicability, the exact definition using income and transfer variables from the Swedish registers are available at https://github.com/stat4reg/CausalCNN/blob/main/population description.md.The health outcomes of interest are the number of hospitalization days per year during the next five years following early retirement.We, however, check first whether early retirement has an effect on survival during the five first years following early retirement.There were no (or hardly significant) such effects at the 5% level (results reported in the appendix, Table 4) and we therefore focus the analysis on the survivors when looking at effects on hospitalization.The following covariates are used as input in the CNN architures to fit treatment and outcome nuisance models.Besides the birth year we include the the following (pre-treatment) covariates measured at 61 years of age: marriage status, municipality, education level, Spouse education level, and the number of biological children.Moreover, we consider the measurements of the following covariates for each of the ten years preceding retirement: days of hospitalization and open health care, annual income from labour, annual income from pension, annual income from unemployment, annual income from early retirement and sickness benefit, annual compensation for illness, and spouse retirement status.Thus, covariates include eight time(-structured) series of ten observations each per individual.We do separate analyses for men and women which gives two samples of approximately 100000 individuals each.
The code giving details on the CNN architectures and tuning parameters can be found at https://github.com/stat4reg/CausalCNN/blob/main/population description.md.We also apply the other methods evaluated in the simulation study above.
Results are presented in Table 3.Based on the naive estimation not controlling for confounders, early retirement has a clear positive effect on health (varying between 0.167 and 0.396 average number of days) and for the five years of follow up considered.These naive effects are larger for women in the beginning.These negative effects disappear when controlling for the considered covariates.This is seen most clearly with CNN which yields the effects smaller than 0.1 day (in absolute value) for all cases but one.Confidence intervals at the 95% level cover zero in most cases.Thus, while there appeared to be a positive effect on health of early retirement, this effect was probably due to confounding.

Discussion
We have proposed and studied a semiparametric estimation strategy for average causal effects which combine convolutional neural network with AIPW.As long as the conditions given are met, this strategy yields locally efficient and uniformly valid inference.The use of CNN has the advantage over fully-connected feed forward neural networks that they are more efficient on the number of weights (free parameters) used  A main contribution of the paper is to show under which conditions CNN fits of nuisance functions achieve n −1/4 convergence rate required to obtain uniformly valid inference on an ACE.Using the result in Farrell et al. (2021) in which convergence rate of a ReLU-based feed-forward network is shown to follow Equation (1), we show that the rate conditions in Assumption 3 are fulfilled.Specifically, we use CNN with a given complexity for which we show that approximation rate in (1) is small enough.A key component of result in Farrell et al. (2021) is the upper bound on empirical process terms found in Bartlett et al. (2005) based on Rademacher averages, which are measures of function space complexity.Global Rademacher averages do not provide fast rates as in (1).To derive this fast rate, localization analysis is employed which takes into consideration only intersection of the function space with a ball around the goal function; considering that in reality the algorithm only searches in a neighborhood around the true function.Moreover, the tight bound on Pseudodimension of deep nets found in Bartlett et al. (2019) is used.
We also present numerical experiments showing the performance of the estimation strategy in finite sample and compare it to other machine learning based estimation strategies.The applicability of the method is illustrated through a population wide observational study of the effect of early retirement on hospitalization in Sweden.

Appendix
In this section we also consider the Sobolev space H β (Ω) that is defined using the L 2 norm instead of the L ∞ norm.Moreover, let f L 2 (Ω) be the L 2 norm of a function f defined on Ω.

Proof of Theorem 3
Proof.The rate conditions (a) and (b) of Assumption 3 are fulfilled if both E n [(μ t (x i )− µ t (x i )) 2 ] and E n [(p(x i ) − p(x i )) 2 ] are o P (n −1/2 ).For this, the three terms in the upper bound (5) in Theorem 1 must be o P (n −1/2 ).For the first term, based on 2, we have The second term is also o P (n −1/2 ) if γ = o(n 1/2 ).For the third term, using Theorem 2 we have Therefore, we can conclude that under the stated conditions

Early retirement effects on survival
Additional results on the effect of early retirement on survival can be found in Table 4.
and s and d fulfill the assumptions of Theorem 2.Then, the rate conditions in Assumption 3 are fulfilled for nuisance functions estimators (3) with F = C.

Figure 1 :
Figure 1: Illustration of the convolutional neural network utilized in the simulations and the real data study.Each layer l contains different number of channels E l , where E 0 (not shown in the Figure) is the number of time series in the set of input covariates.Each channel (feature map) in layer l (e.g. the blue vector in the second layer) is formed by first applying E l−1 number of kernels (convolution filters) on the vectors in layer l − 1 (as demonstrated by the blue arrow in the set of kernels applied on the first layer) and then summing the results.There exist some covariates which are not time series.Inclusion of those vectors in the neural network is not illustrated in the figure for simplification purpose.A fully connected feed-forward structure has been used for those covariates and the results are then combined with the results from the time-series data in the flattening layer demonstrated above.

Figure 2 :
Figure 2: Retirement age for men and women who belong to the cohorts born in Sweden 1946 and 1947.The last staple in each histogram displays those who retired at age 71 or later.

.
To prove condition (c) of Assumption 3, we use the proof of lemmaFarrell et al.  (2021, Lemma 10).In this proof it is shown that for an arbitrary class of feedforward neural networks with probability at least 1 − exp(−nd+3 4d+8 log 6 n) E n (μ t,C (x i ) − µ t (x i )) 1 − 1{t i = t} P[T = t|X = x i ]

Table 1 :
Setting 1 DGP: Bias, standard errors (both estimated, Est sd, and Monte Carlo, MC sd), MSEs and empirical coverages for 95% confidence intervals over 1000 therefore expected that the post-lasso methods ORds, DRss, DRds have low bias.DRcnn has larger bias, but decreasing with n.All methods have similar MSEs even though DRcnn has the lowest MSE for the largest sample size.Empirical coverages are close to the nominal 95% level for all methods.

Table 3 :
Effect of early retirement on number of days in hospital during five years of follow up, for the early retirees, and 95% confidence intervals.