Skip to Main Content
148
Views
0
CrossRef citations to date
Altmetric

Articles

Visually Communicating and Teaching Intuition for Influence Functions

Received 08 Oct 2018
Accepted 24 Dec 2019
Accepted author version posted online: 21 Jan 2020
Accepted author version

Abstract

Estimators based on influence functions (IFs) have been shown to be effective in many settings, especially when combined with machine learning techniques. By focusing on estimating a specific target of interest (e.g., the average effect of a treatment), rather than on estimating the full underlying data generating distribution, IF-based estimators are often able to achieve asymptotically optimal mean-squared error. Still, many researchers find IF-based estimators to be opaque or overly technical, which makes their use less prevalent and their benefits less available. To help foster understanding and trust in IF-based estimators, we present tangible, visual illustrations of when and how IF-based estimators can outperform standard “plug-in” estimators. The figures we show are based on connections between IFs, gradients, linear approximations, and Newton-Raphson.

Disclaimer

As a service to authors and researchers we are providing this version of an accepted manuscript (AM). Copyediting, typesetting, and review of the resulting proofs will be undertaken on this manuscript before final publication of the Version of Record (VoR). During production and pre-press, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal relate to these versions also.

1 Introduction

Influence functions (IFs) are a core component of classic statistical theory, and have emerged as a popular framework for incorporating machine learning algorithms in inferential tasks (van der Laan and Rose, 2011; Kennedy et al., 2017; Chernozhukov et al., 2018). Estimators based on IFs have been shown to be effective in causal inference and missing data (Robins et al., 1994; Robins and Rotnitzky, 1995; van der Laan and Robins, 2003), regression (van der Laan, 2006; Williamson et al., 2017), and several other areas (Bickel and Ritov, 1988; Kandasamy et al., 2014).

Unfortunately, the technical theory underlying IFs intimidates many researchers away from the subject. This lack of approachability slows both the theoretical progress within the IF literature, and the dissemination of results.

One typical approach for partially explaining intuition for IF-based estimators is to describe properties that can be easily seen in their formulas. For example, IFs can be used to estimate average treatment effects from observational data, after first modeling the process by which individuals are assigned to treatment, and the outcome process that the treatment is thought to affect. The resulting IF-based estimates have been described as “doubly robust (DR)” in the sense that they remain consistent if either the treatment model or the outcome model is correctly specified up to a parametric form (van der Laan and Robins, 2003; Bang and Robins, 2005; Kang et al., 2007). While the DR property can sometimes be checked by simply observing an estimator’s formula, it does not necessarily provide intuition for the underlying theory of IF-based estimators. Furthermore, the DR property often does not capture an arguably more important benefit of these estimators, which is that they can attain parametric rates of convergence even when constructed based on flexible nonparametric estimators that themselves converge at slower rates. Unlike the DR explanation, the notion of faster convergence rates with no parametric assumptions can also extend to applications of IFs beyond the goal of treatment effect estimation (Bickel and Ritov, 1988; Birgé and Massart, 1995; Kandasamy et al., 2014; Williamson et al., 2017).

This paper visually demonstrates a general intuition for IFs, based on a connection to linear approximations and Newton-Raphson. Our target audiences are statisticians and statistics students who have some familiarity with multivariate calculus. Our hope is that these illustrations can be similarly useful to illustrations of the standard derivative as the “slope at a point,” or illustrations of the integral as the “area under a curve.” For these calculus topics, a guiding intuition can be visualized in minutes, even though formal study typically takes over a semester of coursework.

In Section 2 we introduce notation. We also review “plug-in” estimators, which will serve as a baseline for comparison. In Sections 3 & 4 we show figures illustrating why nonparametric, IF-based estimators can asymptotically outperform plug-in estimators, but may underperform with small samples. We avoid heuristic 2-D or 3-D representations of an infinite-dimensional distribution space, and instead show literal, specific 1-dimensional paths through that space. In Section 5 we briefly discuss connections to semiparametric models, higher order IFs, and robust statistics. Our overall goal is to facilitate discussion and teaching of IF-based estimators so that their benefits can be more widely developed and applied.

2 Setup: target functionals and “plug-in” estimates

Suppose we observe a sample z1,z2,,zn representing n independent and identically distributed draws of a random vector Z following an unknown distribution P. For ease of notation, we will generally assume that Z is continuous, unless otherwise specified in particular examples. We consider the setting where we wish to estimate a particular 1-dimensional “target” description of the distribution P, also known as an estimand. Any such “target” can be written as a functional of a distribution function, using notation such as T(P). The term “functional” simply indicates that the input to T is itself a (distribution) function. For example, if Z=(Z1,Z2) is bivariate, we may consider the mean of Zj, denoted by Tmean,j(P):=EP(Zj); the covariance of Z1 and Z2, denoted by Tcov(P):=EP(Z1Z2)EP(Z1)EP(Z2); or the conditional expectation of Z1, denoted by Tcond,z2(P):=EP(Z1|Z2=z2), where EP is the expectation function with respect to the distribution P.

One intuitive approach for estimating functionals T(P) is to simply “plug-in” the empirical distribution. This produces the estimate T(P̂), where P̂ is the distribution placing probability mass 1/n at each observed sample point z1,,zn. While plugging in P̂ will suffice for certain estimation targets, such as the mean of a scalar variable Z, it is unreliable for other targets, such as the density of a continuous, scalar variable Z at a previously unobserved value znew. The conditional expectation functional described above, Tcond,z2(P)=EP(Z1|Z2=z2), poses a similar challenge in the bivariate setting. If the value z2 has not been previously observed, then some form of interpolation beyond P̂ will be required. Of course, the “plug-in” approach easily extends to allow this. Rather than using P̂, any smoothed or parametric estimate P˜ of the distribution P can be plugged in to estimate T(P) as T(P˜). Further, if P˜ is a parametric, maximum likelihood estimate (MLE) of P, then T(P˜) is a MLE as well, and enjoys similar optimality properties when the likelihood assumptions are correct (by the invariance property of the MLE; see Casella and Berger, 2002).

The focus of this paper is on estimation techniques that weaken the likelihood assumptions required for plug-in MLEs. Specifically, we will see that estimates based on influence functions allow us to use flexible estimates for P, and to make asymptotic statements about estimator performance, without strict parametric assumptions. Importantly, these IF-based estimates adapt to the particular target of interest T, whereas likelihood-based approaches ignore the choice of T (see discussion in Section 1.4 of van der Laan and Rose, 2011). When likelihood assumptions do not hold, estimators based on influence functions will often converge more quickly than simpler plug-in estimates.

3 First order based-corrections: visualizing influence functions for estimands

Influence functions (IFs) were originally introduced as a description of estimator stability, namely, of how much an estimator changes in response to a slight perturbation in the sample distribution (Hampel, 1974; see Section 5.3, below). In the case of plug-in estimators, IFs can also address the parallel, more optimistic question: “how would the plug-in estimate T(P˜) change in response to a slight improvement in our estimate P˜?” Remarkably, this question can be informed even without directly observing a more accurate version of P˜, as we illustrate in the remainder of this section.

To clarify what we mean by a “slight improvement” in P˜, we define a set of distribution estimates indexed by their accuracy. Specifically, let p and p˜ be probability densities for P and P˜ respectively. As in the previous section, P˜ here denotes a smoothed or parametric estimate of P. Let Pϵ be the distribution with density(3.1) pϵ(z):=(1ϵ)p(z)+ϵp˜(z)(3.1)

for ϵ[0,1], where the accuracy of Pϵ improves as ϵ approaches zero. Distributions of this form are sometimes written with the shorthand Pϵ:=P+ϵ(P˜P). We now refer to the set P:={Pϵ}ϵ[0,1] as a path within the space of possible distribution functions that connects P˜ to P. For each distribution Pϵ along this path, there exists a corresponding value for T(Pϵ), though note that in practice the functional can only be computed at the end point ϵ = 1.

We illustrate an example of such a set of distributions in Figure 1-A, and illustrate the values T(Pϵ) along this path in Figure 1-B. As a working example for our illustrations, we will use the functional of the integrated squared density, T(P)=p(z)2dz, for a 1-dimensional variable Z (Bickel and Ritov, 1988; Birgé and Massart, 1995; Laurent, 1996; Giné and Nickl, 2008; Robins et al., 2009). This is purely for the purposes of coding an example figure however. The technical discussion below does not assume T(P)=p(z)2dz. In the appendix, we additionally illustrate the special case where Z is discrete, and where we can show the space of all possible distributions in a 2-dimensional figure.

Our ultimate goal is to find the y-intercept of the curved, solid line in Figure 1-B. We denote this line by the function v, where v(ϵ):=T(Pϵ) and the y-intercept of interest is v(0)=T(P0)=T(P). Fortunately, although the solid curve v(ϵ) is unknown and can only be evaluated at ϵ = 1, we will see shortly that it is still possible to approximate this curve, and to find the y-intercept of our approximation. Specifically, we will see that we can estimate the slope of v(ϵ) at ϵ = 1, denoted here by v(1):=ϵT(Pϵ)|ϵ=1. This, in turn, lets us approximate the curve v(ϵ) linearly at ϵ = 1. The y-intercept for our approximation of v is then equal to T(P1)dv(1) (shown as “1-step” in Figure 1-B), where d = 1 is the distance between P1 and P0 in terms of ϵ. Thus, an ideal estimator for T(P0) might resemble {T(P1)v(1)}, motivated by how our plug-in estimate (T(P1)) would change if our initial distribution estimate (P1) became infinitesimally more accurate (v(1)). Before considering how v(1) may be estimated, we discuss two interpretations of this “1-step” approach (see also Bickel, 1975; Kraft and van Eeden, 1972 for early examples of 1-step estimators).

One understanding of the “1-step” approach comes from an analogy to Newton-Raphson – an iterative procedure for finding the roots of a real function f. Given an initial guess x0R of a root of f (a value xroot satisfying f(xroot)=0), Newton-Raphson attempts to improve on this guess by approximating f linearly at x0. The root of this linear approximation is taken as an updated guess for a root of f, and the procedure is iterated until convergence. When v (defined above) is invertible, finding the value of T(P)=v(0) is equivalent to a root-finding problem for v1, and the “1-step” method described above is equivalent to 1 step of Newton-Raphson for the function v1 (see Pfanzagl, 1982).

The “1-step” approach can also be motivated from the Taylor expansion of the function v:(3.2) T(P0)=v(0)=v(1)+v(1)(01)R2=T(P1)+ϵT(Pϵ)|ϵ=1(01)R2,(3.2)

where R2=(1/2)v(ϵ¯)=(1/2)2ϵ2T(Pϵ)|ϵ=ϵ¯ for some value ϵ¯[0,1] by Taylor’s theorem (Serfling, 1980).1 The first two terms in Eq. (3.2) are equal to T(P˜)v(1), reproducing the “1-step approach” described above, and the remaining R2 term can typically be shown to be small. Formally studying R2 via Taylor’s Theorem requires that v and v are finite, and that v is continuous, although these conditions are not necessary if the R2 term can instead be studied directly (see Section 4; and Serfling, 1980). Because our 1-step approach T(P1)v(1) uses only on the first derivative of v(ϵ)=T(Pϵ), we refer to it as a first order bias-correction. We refer to this derivative as a pathwise derivative along P. We now turn to the task of estimating this derivative, which is precisely where IFs will prove useful.

We start with the case when Z is a discrete random variable, as this makes estimation of v(1)=ϵT(Pϵ)|ϵ=1 appear relatively straightforward. Let {z1,,zK} be the set of values that Z may take. With some abuse of notation, we can determine the derivative ϵT(Pϵ)|ϵ=1 from the partial derivatives of T(Pϵ) with respect to each value of the probability mass function pϵ(zk), using the multivariate chain rule:(3.3) ϵT(Pϵ)|ϵ=1=k=1KT(Pϵ)pϵ(zk)pϵ(zk)ϵ|ϵ=1(3.3) (3.4) =k=1KT(Pϵ)pϵ(zk)|ϵ=1{p˜(zk)p(zk)}.(3.4)

Eq. (3.3) states that the change in T(Pϵ) depends on how T(Pϵ) changes with each probability mass pϵ(zk), and on how each probability mass changes with ϵ. However, the above equation is an abuse of notation in the sense that marginal increases to pϵ(zk) result in pϵ no longer being a valid probability mass function (its total mass will not equal 1), which can cause the partial derivatives T(Pϵ)pϵ(zk) to be ill-defined. Any marginal additional mass at p(zk) must instead be accompanied by an equal decrease in mass elsewhere in the distribution.

This shortcoming of the partial derivatives of T motivates us to replace them with the influence function for T, defined below (see Kandasamy et al., 2014, and Section 6.3.1 of Serfling, 1980).

Definition 3.1. For a given functional T, the influence function for T is the function IF satisfying(3.5) T(G+ϵ(QG))ϵ|ϵ=0=IF(z,G){q(z)g(z)}dz(3.5)

and IF(z,G)g(z)dz=0 for any two distributions G and Q with densities g and q. Above, G+ϵ(QG) denotes the distribution with density function g(z)+ϵ(q(z)g(z)), as defined in Eq. (3.1).

Roughly speaking, the left-hand side of Eq. (3.5) is the change in T(G) that would occur if we were to “mix” G with an infinitesimal portion of the distribution Q. This quantity is known as the Gâteaux derivative (Serfling, 1980), and can be interpreted as the sensitivity of T(G) to small changes in the underlying distribution G, in the “direction” of Q.

The IF in Eq. (3.5) has a similar interpretation to the partial derivative in Eq. (3.4). To see this, we can isolate the IF term IF(z, G) by setting Q equal to the point mass distribution at z, denoted by δz (see Hampel, 1974; van der Vaart, 2000). Here, Eq. (3.4) reduces to(3.6) T(G+ϵ(δzG))ϵ|ϵ=0=IF(z,G).(3.6)

The left-hand side is the change in T(G) that would occur in response to infinitesimally upweighting z, analogous to the interpretation of the partial derivative in Eq. (3.4) (see also Section 6.3.1 of Serfling, 1980). With this analogy in mind, note the similarity between the right-hand sides of Eq. (3.4) and Eq. (3.5). Roughly speaking, the IF lets us apply the “multivariate chain rule” approach from Eq. (3.4), but remains well defined even when the partial derivatives in Eq. (3.4) are not.

A common alternative (though in many cases equivalent) “score-based” definition of the IF is presented in the Appendix (see Bickel et al., 1993; Tsiatis, 2006). This definition allows the IF to directly describe derivatives along more general pathways of distributions, extending beyond pathways of the form G+ϵ(QG). Such pathways become of particular interest in cases where prior knowledge restricts the space of distributions that we consider possible, and where this restricted space is not closed under mixture of distributions (see discussion in Section 5.1).

Returning to our example of the pathway P, we can now use the IF to derive an empirical estimate of ϵT(Pϵ)|ϵ=1 (e.g., the dashed line in Figure 1). Applying Eq. (3.5), we have2 (3.7) ϵT(Pϵ)|ϵ=1=IF(z,P˜){p(z)p˜(z)}dz(3.7) (3.8) =IF(z,P˜)p(z)dzfrom IF(z,P˜)p˜(z)dz=01ni=1nIF(zi,P˜).(3.8)

In this way, IFs can provide estimates (Eq. (3.8)) of distributional derivatives (Eq. (3.7), which corresponds to the dashed line in Figure 1). Studying these estimates is fairly straightforward if P˜ can be treated as fixed, for instance, if P˜ is estimated a priori or using sample splitting. In such cases, we can treat Eq. (3.8) as a simple sample average. Alternatively, if we allow the current dataset {z1,,zn} to inform the selection of P˜ as well as the calculation of the summation in Eq. (3.8), then formal study of the estimator in Eq. (3.8) is still possible as long as P˜ is selected from a sufficiently regularized class (e.g., a Donsker class). In this case, the bias and variance of i=1nIF(zi,P˜) can be studied using empirical process theory (van der Vaart, 2000). Hereafter, we assume the simpler case where P˜ is estimated a priori, and can be treated as fixed.

Combining the results from Eq. (3.2) and 3.8, we can approximate T(P) using our dataset, as(3.9) T(P)T(P˜)+1ni=1nIF(zi,P˜)R2,(3.9)

where the approximation symbol captures the fact that we are using a sample average. This motivates the “1-step” estimatorT̂1step:=T(P˜)+1ni=1nIF(zi,P˜).

Conditions under which the R2 term converges to zero are discussed in the next section.

We can see from Figure 1 that when R2 is in fact negligible, the only challenge remaining is to estimate the slope ϵT(Pϵ)|ϵ=1, which can be done in an unbiased and efficient way via Eq. (3.8). It should not be surprising then that the estimator T̂1step, which takes precisely this approach, has optimal mean-squared error (MSE) properties when R2 is small. More specifically, given no parametric assumptions on P, it can be shown that no estimator of T(P) can have a MSE uniformly lower than n1Var(IF(z,P)). We refer to van der Vaart (2000); van der Vaart (2002) for more details on this minimax lower bound result. In practice, the variance bound n1Var(IF(z,P)) can be approximated by n1Var(IF(z,P˜))=Var(T̂1step). Thus, when R2 is negligible and Var(IF(z,P˜)) approximates Var(IF(z,P)) well, estimating the slope through P˜ yields an approximately unbiased and efficient estimator.

Fig. 1 Linear approximation of P - Given P and P˜, Panel A shows a subset of the distributions in P as we vary ϵ[0,1] (see Eq. (3.1)). When ϵ = 0 we have pϵ=p, and when ϵ = 1 we have pϵ=p˜. In Panel B, the solid line shows the target functional value (y-axis) as we vary ϵ (x-axis). The dotted line shows the slope of T(Pϵ) with respect to ϵ at ϵ = 1. This slope is calculated using the IF (see Eq. (3.7), and the Appendix). Because ||PPϵ||2=ϵ||PP˜||2 (see Section 4, and the Appendix), the x-axis can equivalently be expressed either in terms of ||PPϵ||2 or in terms of ϵ. Reflecting this, we show the distributional distance ||PPϵ||2 on a secondary horizontal axis at the top of the figure.

4 Visualizing the residual R2, and the sensitivity to the choice of initial estimator P˜

Formal study of the R2 term is often done on a case-by-case basis by algebraically simplifying the residual EP{T̂1stepT(P)}, and so Taylor’s Theorem is often not needed to describe the R2 term (Eq. (3.2)). In many cases, the R2 term reveals itself to be a quadratic combination of one or more error terms. For example, for the integrated squared density functional T(P)=p(z)2dz, the R2 term can be shown to be exactly equal to the negative of {p(z)p˜(z)}2dz (see the Appendix). When the error term p(z)p˜(z) converges (uniformly) to zero, the 2nd degree exponent implies that R2 converges to zero even more quickly.

A similar result can be shown for the general case of smooth functionals T. Here, R2 will turn out to depend on two pieces of information that make the problem difficult: the underlying distributional distance between P˜ and P, which is typically assumed to converge to zero as sample size grows, and the “smoothness” of T (defined below). In the remainder of this section we visually illustrate this result (Figure 2), and review this result formally.

Fig. 2 Linear approximations overlaid for several paths - Panel A overlays the same illustration as Figure 1-A, but for several alternative initial distribution estimates P˜(1),,P˜(K). For each distribution P˜(k), a path P(k) connecting P to P(k) can be defined in the same way as P. Panel B shows the values of the target parameter at each point P˜ϵ(k) along each path P(k), as well as a linear approximation of each path. For each value of k1,,K, we show the distribution P˜(k) (Panel A) and pathway P(k) (Panel B) in the same color. On the x-axis in Panel B, we plot each distribution’s distance from P, in order to show several paths simultaneously. The y-intercept of each linear approximation corresponds to a different 1-step estimator, and the accuracy of this estimator will depend on the distance ||PP˜(k)||2.

Figure 2 shows how Figure 1 would change if we had selected an initial distribution estimate different from P˜. Figure 2-A shows several alternative distribution estimates, denoted by P˜(k) for k=1,,K. For each initial estimate P˜(k), we define the path P(k) as the set of distributions Pϵ(k)=(1ϵ)P+ϵP˜(k) for ϵ[0,1], analogous to P. Figure 2-B shows each of these K paths, as well as the 1-step estimators corresponding to each path. We can see that the 1-step estimators are generally more effective when P˜(k) is “closer” to P (defined formally below). We can also see that, as in Figure 1, the performance of 1-step estimators depends on the smoothness of T(Pϵ(k)) with respect to ϵ.

Quite informally, we can think of Figure 2-B as a “Magician’s Tablecloth Pull-Plot.” To see this analogy, try to imagine the functional T as a hyper-surface over the space of possible distributions. (In the Appendix, we illustrate a special case where this hyper-surface reduces to a standard 3-dimensional surface.) Then, imagine a magician pinching this surface at the point P, and pulling the surface to one side as one might dramatically pull a tablecloth from a table, with the unpinched fabric folding in on itself as it billows in the air. As we watch this pulling action (e.g., from a neighboring table), all of the dimensionality of the hyper-surface folds into 1 dimension: how far each point on the surface (or “fabric”) is from the distribution P (the point the magician is pulling from). In Figure 2-B, we can imagine the intersection point on the left-hand side as the point from which the magician is pulling the tablecloth.

To formalize the notion of how “far” two distributions G and Q are, we use the L2 distance ||GQ||2:=[g(z)q(z)]2dz, where g and q are the densities of G and Q respectively.

This distance measure is useful in part because it lets us visually overlay several paths with a common, meaningful x-axis (Figure 2), and in part because it helps us formally compare the “smoothness” of T along paths that stretch over different distances. Recall that the path {Pϵ}ϵ[0,1] connects the two distributions P˜ and P, which are a distance of ||P˜P||2 from each other. One approach for describing the smoothness of T is to consider how quickly T(Pϵ) changes in response to changes in ϵ, but this notion of smoothness is highly sensitive to our choice of P˜ – the starting point of our pathway. For example, if we were to move P˜ closer to P, then T would appear to be smoother. In order to describe the smoothness of T in a way that is not sensitive to the choice of P˜, we consider the following reindexing of Pϵ. Let(4.1) PΔrescaled:=P+(Δ||P˜P||2)(P˜P),for Δ[0,||P˜P||2].(4.1)

This definition produces the same pathway as in Eq. (3.1), as Pϵ=PΔrescaled when ϵ=Δ/||P˜P||2. However, it can be shown that Δ tells us the absolute distance Δ=||PΔrescaledP||2, whereas ϵ tells us the relative distance ϵ=||PϵP||2/||P˜P||2 (see the Appendix). In this way, the information represented by Δ is less dependent on the choice of P˜.

We can now describe the smoothness of T more formally, using the following condition on its jth derivative with respect to the distance-adjusted parameter Δ.

Condition 4.1. (jth order smoothness from all directions) For a given value of j, and for any choice of P˜, the function T(PΔrescaled) is j-times differentiable with respect to Δ, and jΔjT(PΔrescaled)|Δ=Δ¯=O(1) as Δ¯0.

For j = 1, Condition 4.1 bounds the degree to which T(P) can change in response to any small change to P. In Figure 2, this means that curves cannot deviate too far from flat lines as they approach the leftmost region. For j = 2, Condition 4.1 bounds the degree to which T(P) can change nonlinearly in response to any small change in P. That is, curves cannot get “too squiggly” as they approach the leftmost region of Figure 2. Note that, for notational convenience, have suppressed the dependence of PΔrescaled on P˜ in Eq. (4.1) & Condition 4.1.

The connection between Condition 4.1 and estimator performance can be formalized as follows.

Remark 1. (Asymptotic bias of plug-in and 1-step estimators) If P˜ is fixed in advance (for example, from sample splitting), and if Condition 4.1 holds for j = 2, then the bias for T̂1step is equal to (4.2) R2=EP(T̂1step)T(P)=O(||PP˜||22).(4.2)

Similarly, if P˜ is fixed and Condition 4.1 holds for j = 1, then the error of the plug-in estimate is equal to (4.3) T(P˜)T(P)=O(||PP˜||2).(4.3)

Since we treat T(P˜) as fixed, given P˜, the error of T(P˜) (Eq. (4.3)) is also equal to the bias of T(P˜).

To explain in words, as P˜ approaches P, the biases of plug-in estimators and 1-step estimators are both guaranteed to converge to zero. However, the worst-case rate of convergence for 1-step estimators is substantially faster than that of plug-in estimators (O(||PP˜||22) relative to O(||PP˜||2)). The proof of Remark 1 follows from Taylor’s Theorem (see the Appendix, as well as Eq. (1) of Robins et al., 2008 for a similar discussion).

Results similar to Remark 1 are often expressed by instead defining the influence function as the unique function IF satisfying(4.4) T(P˜)T(P)=IF(z,P˜)d(P˜(z)P(z))+R2(P˜,P),(4.4)

and EP[IF(z,P)]=0 for any two distributions P˜,P, where R2 satisfies either R2(P˜,P)=O(||PP˜||22) or a similar condition. Eq. (4.4) is often referred to either as the distributional Taylor expansion of T, or as the von Mises expansion of T (von Mises, 1947; Serfling, 1980; Robins et al., 2008, 2009; Fernholz, 1983; Carone et al., 2014; Robins et al., 2017). The expansion is analogous to the standard Taylor expansion in Eq. (3.2), but plugs in the integral term from Eq. (3.7).

To obtain a more complete view of 1-step estimators, we must consider the convergence rate of R2 in combination with the convergence rate of the sample average in Eq. (3.9). Whichever of these two rates is slower will determine the asymptotic behavior of the 1-step estimator. To see why, recall from Eq. (3.9) that the error of the 1-step estimator is equal to(4.5) T̂1stepT(P)=[1ni=1nIF(zi;P˜)EPIF(Z;P˜)]+R2,(4.5)

where the bracketed term is a centered sample average that is asymptotically normal after n scaling. Here we have implicitly assumes that sample splitting has been used to estimate P˜; if not, then the bracketed term can be rearranged and studied using empirical process theory.3 The R2 term is the second-order remainder described in Eq. (3.2) and (4.2), which depends on the smoothness of T and the accuracy of P˜. Finite-sample bounds (e.g., using concentration inequalities on the bracketed term, and functional-specific bounds on R2) could be used to construct confidence intervals valid for any n. However this would require precise knowledge of the error in P˜ as well as bounds on or variance of the IF, and such intervals may be quite wide in realistic examples. The most common approach in practice is therefore to assume the R2 term (and any empirical process terms) are negligible, and assume the bracketed term in Eq. (4.5) can be well-approximated by a normal distribution with appropriate variance. If R2 = oP(1/n) then this will often be a reasonable approximation at least with large sample sizes, where the specific meaning of “large” could be assessed via simulations. However, if R2=OP(1/nα) for some α<1/2, then such an approximation will not even be asymptotically valid – the first-order correction is not enough in this case, and instead either sensitivity analyses or higher-order corrections are required (see Section 5.1, and Robins et al., 2008, 2009; Carone et al., 2014; Robins et al., 2017).

In summary, the performance of 1-step estimators depends on the sample size (via the sample average in Eq. (3.8)), the smoothness of the functional of interest (T), and the quality of the initial distribution estimate (P˜). Graphically, we can visualize the smoothness of T by the bumpiness of the paths shown in Figure 2-B. We show the quality of the initial distribution estimate (P˜) by the x-axis in Figure 2-B.4 Reasonably accurate estimates of P˜ land us in the leftmost region of Figure 2-B, where bias corrections are especially effective. Inaccurate initial estimates, i.e., slow convergence rates due to high-dimensionality, land us in the rightmost area of Figure 2-B, where linear corrections based on IFs are least effective.

5 Discussion

In this section we briefly review extensions and other uses of IFs. For deeper treatments of IFs and related topics, interested readers can see (Serfling, 1980; Pfanzagl, 1982; Bickel et al., 1993; van der Vaart, 2000; van der Laan and Robins, 2003; Tsiatis, 2006; Huber and Ronchetti, 2009; Kennedy, 2016; Maronna et al., 2019).

5.1 Semiparametric models

Thus far, we have considered so-called nonparametric models, in which no a priori knowledge or restrictions are assumed about the distribution P. In certain cases though, we may already know certain parameters of the probability distribution. For example, we may know the process by which patients are assigned to different treatments in a particular cohort, but may not know the distribution of health outcomes under each treatment. This more general framework is known as a semiparametric model, with the nonparametric model forming a special case of no priori knowledge.

When some parameters of P are known, the distributions along the path P may not all satisfy the restrictions enforced by that knowledge. We can encode these restrictions in the form of a likelihood assumption, and focus our attention only on pathways of distributions concordant with this likelihood. Because we only need to consider derivatives along allowed pathways, the function IF no longer needs to be valid for all distributions G and Q (see Definition 3.1), and can instead be defined in terms of the score function for the likelihood (see the Appendix). This relaxed criteria for the influence function will now be met not just by a single function IF, but by a set (S) of functions. Of these, if we can identify the “efficient influence function” IF equal to argminIF˜SVar(IF˜(Z,P)), then we can more efficiently estimate the derivatives along allowed pathways. We can also show that no unbiased estimator may have a variance lower than n1Var(IF(Z,P)), which is equal to or lower than the nonparametric bound described above (n1Var(IF(z,P))). Determining IF requires a projection operation that is usually the focus of figures illustrating the theory of influence functions (see Sections 2.3 & 3.4 of Tsiatis, 2006), but this operation is beyond the scope of this paper.

5.2 Higher order influence functions

The approach of Section 3 amounts to approximating T(Pϵ) as a linear function of ϵ, but several alternative approximations of T(Pϵ) exist as well. For example, the standard “plug-in” estimator T(P˜) can be thought of as approximating T(Pϵ) as a constant function of ϵ, and extrapolating this approximation to estimate T(P0). Given that the linear approximation often gives improved estimates over the constant approximation, we might expect that a more sophisticated approximation T(Pϵ) would improve accuracy even further. Indeed, for the special case of the squared density functional T(P)=p(z)2dz shown in Figures 1 & 2, a second degree polynomial approximation of T(Pϵ) fully recovers the original function with no approximation error. In general, deriving higher order polynomial approximations requires that we are able to calculate higher order derivatives of T(Pϵ), which forms part of the motivation for recent work on higher order influence functions.

Interestingly, it turns out that using higher-order influence functions is not as straightforward as the first-order case, simply because higher-order influence functions do not exist for most functionals of interest (e.g., the integrated density squared, average treatment effect, etc.). In other words, although there is often a function IF satisfyingT(P˜)T(P)=IF(z,P˜)d(P˜(z)P(z))+R2(P˜,P),

for an appropriate second-order term R2(P˜,P) (though not always - see for example Kennedy et al. (2017)), there is typically no function IF2 satisfyingT(P˜)T(P)=IF(z,P˜)d(P˜(z)P(z))+12IF2(z(1),z(2),P˜)j=12d(P˜(z(j)P(z(j))))+R3(P˜,P),

for an appropriate third-order term R3(P˜,P). This has led to groundbreaking work by, for example, Robins et al. (2008, 2009); Carone et al. (2014); Robins et al. (2017), aimed at finding approximate higher-order influence functions that can be used for extra bias correction beyond linear/first-order corrections discussed here. There are many open problems in this domain.

5.3 Robust statistics, and influence functions for estimators

IFs were first proposed to describe the stability of different estimators in cases where outliers are present, or where a portion of the sample deviates from parametric assumptions (Hampel, 1974); see also (Hampel et al., 1986; Huber and Ronchetti, 2009; Maronna et al., 2019). To see how IFs achieve these goals, consider the plug-in estimate that takes the empirical distribution of the data P̂ as input. If we substitute G with P̂ in Definition 3.1, the resulting Eq. (3.5) tells us how our plug-in estimate T(P̂) would change in response to a portion of the sample (P̂) being replaced with data from a noise distribution Q. Making the same substitution in Eq. (3.6), we see that the IF for T also describes how the estimate T(P̂) would change in response to an upweighting of any outlying sample point z. Thus, in order to produce plug-in estimates that are robust to noise contamination and outliers, a common approach is to derive functionals with bounded IFs.

Several extensions and related uses of IFs exist for studying estimators in the form of functionals of the sample distribution. La Vecchia et al. (2012) extend IFs to describe higher order approximations of an estimator’s sensitivity to sample perturbations, analogous to the approximations discussed in Section 5.2. The authors also present a visual illustration of how IFs, and higher order IFs, can approximately capture robustness (see their Figure 1, which is similar to our Figure 1). Because the L2 norm used in Section 4 is relatively unaffected by the presence of outliers, an alternative choice of norm can be useful when studying robustness (see Hampel, 1971; Chapter 2 of Huber, 1981; and pages 4-5 of Clarke, 2000). IFs can also capture the asymptotic stability of an estimator (see Chapter 5 of van der Vaart, 2000).

IFs for estimators have also recently gained traction in the machine learning literature. Xu et al. (2018) and Belagiannis et al. (2015) use bounded loss functions when fitting a neural network, in order to reduce the influence of outliers and to improve generalization error. Christmann and Steinwart (2004) derive conditions under which the IF for a classifier is bounded. Koh and Liang (2017) compare the influence of different sample points on the predictions produced by a black box model, in order to understand what information contributed to each prediction. Efron (2014); Wager et al. (2014) use IFs, referred to as “directional derivatives,” to study the sampling variance of bagged estimators. Similarly, Giordano et al. (2019) propose using linear approximations of how a model will change in response to a change in the training weights, as a computationally tractable alternative to bootstrapping or cross-validation.

Conclusion

For many quantitative methods, visualizations have proved to be valuable tools for communicating results and establishing intuition (e.g., for gradient descent, Lagrange multipliers, and graphical models). In this paper we provide similar tools for illustrating IFs, based on a connection to linear approximations and Newton-Raphson. Our overall goal is to make these methods more intuitive and accessible.

The growing field of IF research shows great promise for estimating targeted quantities with higher precision, and delivering stronger scientific conclusions. Progress has been made in diverse functional estimation problems, ranging from density estimation to regression to causal inference. The approach also naturally encourages interdisciplinary collaboration, as the selection of the target parameter (T) benefits from deep subject area knowledge, and the initial distribution estimate (P˜) is often attained using powerful, flexible machine learning methods. There are many opportunities for new researchers to tackle theoretical, applied, computational, and conceptual challenges, and to push this exciting field even further.

A Illustrations for the discrete case

Figures 3 and 4 Bickel, P. J., Klaassen, C. A., Bickel, P. J., Ritov, Y., Klaassen, J., Wellner, J. A., and Ritov, Y. (1993). Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore. [Google Scholar] show alternate versions of Figures 1 and 2 for the special case where Z can take only 3 discrete values: z1, z2, and z3. In this case, any probability distribution for Z can be fully described by the probability it assigns to z1 and z2. This simplicity allows us to depict the full space of possible distributions, and the value of T for each distribution, in a 2-dimensional figure.

Fig. 3 Linear approximation of P (discrete case) - Here we show a special case where Z can take only 3 discrete values: z1, z2, and z3. Panel A shows the space of all possible distributions for Z, indexed (along the x and y axes) by the probability assigned to z1 and z2. For each possible distribution P, the value of T(P)=i=13P(Z=zi)2 is shown via shading. The upper-right triangle of the figure is left blank, as this region corresponds to invalid distributions with total mass greater than 1. Within the space of valid distributions, we show the path P as a straight line. As ϵ moves from 1 to 0, we move from P˜ to P (see Eq. (3.1)). Panel B follows the same format as Figure 1-B. The solid line shows the target functional value T(Pϵ) (y-axis) as we vary ϵ (x-axis). The dotted line shows the slope of T(Pϵ) with respect to ϵ at ϵ = 1. As in Figure 1-B, we show the distributional distance on a secondary horizontal axis at the top of the figure. In this case though, distributional distance ||PPϵ||2=i=13{P(Z=zi)Pϵ(Z=zi)}2 can also be visually approximated by Euclidean distance in Panel A (ignoring the third summation term {P(Z=z3)Pϵ(Z=z3)}2).

Fig. 4 Linear approximations overlaid for several paths (discrete case) - Above, we overlay the same illustrations as in Figure 3, but for several alternative initial distribution estimates P˜(1),,P˜(K). The result is analogous to Figure 2, for the special case where Z is discrete. Here, Panel A shows several paths through the space of distributions, each defined in the same way as in Eq. (3.1), but starting from a different initial estimate P˜(k). Panel B shows the values of the target parameter at each point P˜ϵ(k) along each path P(k), as well as a linear approximation of each path. The x-axis shows distributional distance from P, the dotted lines show linear approximations, and the y-intercepts of each dotted line correspond to different 1-step estimators. Again, we see that the accuracy of each estimator will depend on the distance ||PP˜(k)||2. This distance can also be approximated from Euclidean distance in Panel A.

Note that Figures 3-B and 4-B are essentially unchanged from Figures 1-B and 2-B. This is because it is always possible to visualize 1-dimensional paths through the space of possible distributions, regardless of the dimensionality of that space. In other words, we can visualize paths through the space of distributions regardless of whether we can visualize the space itself (as in Figures 3 & 4).

B Score-based definition of the IF

An alternative definition of the IF describes derivatives along paths not necessarily of the form G+ϵ(QG). This can be especially beneficial when prior knowledge restricts the space distributions that we consider possible, and when this allowed distribution space is not closed under convex combinations of the form G+ϵ(QG) (see Section 5.1). We can define a more general pathway as simply the set of distributions consistent with a certain likelihood model (z;e), with scalar parameter e[0,1]. Let we(z) be the density associated with the likelihood function (z;e), and let We be the associated distribution function. With this notation, we can now give an alternate definition for the IF (see Bickel et al., 1993; Tsiatis, 2006).

Definition B.1. (“score-based” IF) The influence function for T is the function IF satisfying(B.1) T(We)e|e=0=EW0[IF(Z,W0)s0(Z)],(B.1)

and EW0IF(Z,W0)=0 for any likelihood We, where se is the score function se(z)=elogwe(z), with we being the density of We.

It is fairly straightforward to show that Definition B.1 implies Definition 3.1. That is, if a function satisfies Definition B.1, it must also satisfy Definition 3.1 (in the case of no prior restrictions on the space of allowed distributions). To see this, note that for any two distributions G and Q we can define a likelihood We:=G+e(QG) with score functions0(z)=elog[g(z)+e{q(z)g(z)}]|e=0=q(z)g(z).g(z).

Definition B.1 now implies thatT(We)e|e=0=IF(z,W0)s0(z)q0(z)dz=IF(z,G){q(z)g(z)g(z)}g(z)dz=IF(z,G){q(z)g(z)}dz,

which shows that IF satisfies Definition 3.1.

C Derivation of IF and R2 term for the squared integrated density functional

Let G and Q be defined as in Definition 3.1, with densities g and q that are dominated by an integrable function ν. For T(G)=g(z)2dz, the influence function is equal to IF(z,G)=2(g(z)T(G)) (Bickel and Ritov, 1988; Robins et al., 2008). To see this, we first show Eq. (3.5).T(G+ϵ(QG))ϵ|ϵ=0=e[g(z)+ϵ{q(z)g(z)}]2dz|e=0e[g(z)+ϵ{q(z)g(z)}]2dz|e=0Dominated Convergence Thm2[g(z)+ϵ{q(z)g(z)}][q(z)g(z)]dz|e=02[g(z)T(G)][q(z)g(z)]dzfrom T(G)[q(z)g(z)]dz=0IF(z,G)[q(z)g(z)]dz.

This, in combination with the fact thatIF(z,G)g(z)dz=2{g(z)2T(G)g(z)}dz=0,

establishes that IF(z,G)=2(g(z)T(G)) is the influence function for T(G)=g(z)2dz.

Given a fixed distribution estimate P˜, the bias (R2 term) of T̂1step is equal toEP(T̂1step)T(P)={T(P˜)+IF(z,P˜)p(z)dz}T(P)=T(P˜)+2p˜(z)p(z)dz2T(P˜)T(P)=T(P˜)+2p˜(z)p(z)dzT(P)={p˜(z)p(z)}2dz.

D Showing distance results for Pϵ and PΔrescaled

To show ||PPϵ||2/||PP˜||2=ϵ, we have(D.1) ||PPϵ||2=[p(z)pϵ(z)]2dz=[p(z)(1ϵ)p(z)ϵp˜(z)]2dz=ϵ2[p(z)p˜(z)]2dz=ϵ||PP˜||2.(D.1)

The fact that ||PΔrescaledP||2=Δ now follows from||PPΔrescaled||2=||PPΔ/||PP˜||||2=Δ||PP˜||||PP˜||=Δ,

where the first equality follows from the definition of PΔrescaled, and the second equality comes from Eq. (D.1).

E Proof of Remark 1

We begin with Eq. (4.3), which we will show using Taylor’s Theorem and Condition 4.1 for j = 1. Taylor’s Theorem implies that there exists a value ϵ¯[0,1] such that(E.1) T(P1)T(P0)=ϵT(Pϵ)|ϵ=ϵ¯.(E.1)

In order to study the right-hand side, we introduce a function to help map between distributions in the form of Pϵ and PΔrescaled. Let D(ϵ):=ϵ||P˜P||2, with inverse function D1(Δ):=Δ/||P˜P||2, such that Pϵ=PD(ϵ)rescaled and PΔrescaled=PD1(Δ). (For notational convenience, we omit the dependence on P˜ when writing D, D1,PΔrescaled, and ϵ¯.) Returning to Eq. (E.1), we have(E.2) T(Pϵ)ϵ=T(PD(ϵ)rescaled)ϵ={T(PD(ϵ)rescaled)D(ϵ)}{D(ϵ)ϵ}by the chain rule={T(PD(ϵ)rescaled)D(ϵ)}||P˜P||2.(E.2)

Plugging this into Eq (E.1), we have(E.3) T(P˜)T(P)=T(PD(ϵ)rescaled)D(ϵ)|ϵ=ϵ¯||P˜P||2=T(PΔrescaled)Δ|Δ=D(ϵ¯)||P˜P||2=O(1)×||P˜P||2(E.3)

=O(||P˜P||2),(E.4)

Where the limits in Eq. (E.3) & Eq. (E.4) are taken as ||P˜P||20. To arrive at Eq. (E.3), note that when ||P˜P||20 we have D(ϵ)=ϵ¯||P˜P||2||P˜P||20, and therefore T(PΔrescaled)Δ|Δ=D(ϵ¯)=O(1) by Condition 4.1 (with j = 1).

Turning to Eq. (4.2), the first equality of follows from Eq. (3.2) and Eq. (3.8).

We can show the second equality of Eq. (4.2) by again applying Taylor’s Theorem and Condition 4.1, this time with j = 2. Taylor’s Theorem implies that there exists a value ϵ¯[0,1] satisfying R2=(1/2)2ϵ2T(Pϵ)|ϵ=ϵ¯, as discussed in the text following Eq. (3.2). To study this second derivative of T(Pϵ), we will show that, for finite j,(E.5) jT(Pϵ)ϵj={jT(PD(ϵ)rescaled)D(ϵ)j}||P˜P||2j.(E.5)

The proof of Eq. (E.5) is by induction. We have already shown the base case of j = 1 in Eq. (E.2). For the induction step, given that Eq. (E.5) holds for j – 1, we can show that Eq. (E.5) holds for j as follows.jT(Pϵ)ϵj=ϵ{j1T(Pϵ)ϵj1}=[ϵ{j1T(PD(ϵ)rescaled)D(ϵ)j1||P˜P||2j1}]by Eq. (E.2)for j1=[D(ϵ){j1T(PD(ϵ)rescaled)D(ϵ)j1||P˜P||2j1}][D(ϵ)ϵ]by the chain rule=jT(PD(ϵ)rescaled)D(ϵ)j||P˜P||2j.

Finally, applying Eq. (E.5), we have(E.6) R2=122ϵ2T(Pϵ)|ϵ=ϵ¯=12{2T(PD(ϵ)rescaled)D(ϵ)2}||P˜P||22|ϵ=ϵ¯=12{2T(PΔrescaled)Δ2}|Δ=D(ϵ¯)||P˜P||22=O(||P˜P||22).(E.6)

As in Eq. (E.3), the limit in Eq. (E.6) is taken as ||P˜P||20. Eq. (E.6) comes from the fact that when ||P˜P||20 we have D(ϵ)=ϵ¯||P˜P||2||P˜P||20, and therefore 2T(PΔrescaled)Δ2|Δ=D(ϵ¯)=O(1) by Condition 4.1 (with j = 2).

E Notes

1 Absorbing a negative sign into the definition of R2 will help to simplify residual terms later on.

2 To apply Eq. (3.5) in Eq. (3.7), we rearrange ϵT(Pϵ)|ϵ=1 as ϵT(P+ϵ(P˜P))|ϵ=1=aT(P˜+a(PP˜))|a=0.

3 To account for estimation of P˜, the bracketed term in Eq. (4.5) can be written as

1ni=1n[{IF(zi,P˜)IF(zi,P)}EP{IF(Z,P˜)IF(Z,P)}]+1ni=1n[IF(zi,P)EP(Z,P)],

Note that both summations are centered around their expectation. The first summation can be studied using empirical process theory, and the second summation can be studied as a simple sample average (see, for example, van der Laan and Rubin, 2006; van der Vaart, 2000).

4 Also see Figure 4 in the Appendix.

Supplemental material

Supplemental Material

Download Zip (447 KB)

    References

  • Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973. 
  • Belagiannis, V., Rupprecht, C., Carneiro, G., and Navab, N. (2015). Robust optimization for deep regression. In Proceedings of the IEEE International Conference on Computer Vision, pages 2830–2838. 
  • Bickel, P. J. (1975). One-step huber estimates in the linear model. Journal of the American Statistical Association, 70(350):428–434. 
  • Bickel, P. J., Klaassen, C. A., Bickel, P. J., Ritov, Y., Klaassen, J., Wellner, J. A., and Ritov, Y. (1993). Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore. 
  • Bickel, P. J. and Ritov, Y. (1988). Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā: The Indian Journal of Statistics, Series A, 50(3). 
  • Birgé, L. and Massart, P. (1995). Estimation of integral functionals of a density. The Annals of Statistics, 23(1):11–29. 
  • Carone, M., Díaz, I., and van der Laan, M. J. (2014). Higher-order targeted minimum loss-based estimation. 
  • Casella, G. and Berger, R. L. (2002). Statistical inference, volume 2. Duxbury Pacific Grove, CA. 
  • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21:C1–C68. 
  • Christmann, A. and Steinwart, I. (2004). On robustness properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5(Aug):1007–1034. 
  • Clarke, B. R. (2000). A review of differentiability in relation to robustness with application to seismic data analysis. Proceedings of the Indian National Science Academy, 66(5):467–482. 
  • Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American Statistical Association, 109(507):991–1007. 
  • Fernholz, L. T. (1983). Von Mises calculus for statistical functionals. Lecture Notes in Statistics (Springer-Verlag). 
  • Giné, E. and Nickl, R. (2008). A simple adaptive estimator of the integrated square of a density. Bernoulli, 14(1):47–61. 
  • Giordano, R., Stephenson, W., Liu, R., Jordan, M., and Broderick, T. (2019). A swiss army infinitesimal jackknife. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1139–1147. 
  • Hampel, F. R. (1971). A general qualitative definition of robustness. The Annals of Mathematical Statistics, pages 1887–1896. 
  • Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383–393. 
  • Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986). Robust statistics. Wiley Online Library. 
  • Huber, P. J. (1981). Robust statistics. Wiley. 
  • Huber, P. J. and Ronchetti, E. M. (2009). Robust statistics. Wiley. 
  • Kandasamy, K., Krishnamurthy, A., Poczos, B., Wasserman, L., and Robins, J. M. (2014). Influence functions for machine learning: Nonparametric estimators for entropies, divergences and mutual informations. arXiv preprint arXiv:1411.4342
  • Kang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539. 
  • Kennedy, E. H. (2016). Semiparametric theory and empirical processes in causal inference. In Statistical Causal Inferences and Their Applications in Public Health Research, pages 141–167. Springer. 
  • Kennedy, E. H., Ma, Z., McHugh, M. D., and Small, D. S. (2017). Nonparametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B, 79(4):1229–1245. 
  • Koh, P. W. and Liang, P. (2017). Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org. 
  • Kraft, C. H. and van Eeden, C. (1972). Asymptotic efficiencies of quick methods of computing efficient estimates based on ranks. Journal of the American Statistical Association, 67(337):199–202. 
  • La Vecchia, D., Ronchetti, E., and Trojani, F. (2012). Higher-order infinitesimal robustness. Journal of the American Statistical Association, 107(500):1546–1557. 
  • Laurent, B. (1996). Efficient estimation of integral functionals of a density. The Annals of Statistics, 24(2):659–681. 
  • Maronna, R. A., Martin, R. D., Yohai, V. J., and Salibián-Barrera, M. (2019). Robust statistics: theory and methods (with R). John Wiley & Sons. 
  • Pfanzagl, J. (1982). Contributions to a general asymptotic statistical theory. Springer Science & Business Media. 
  • Robins, J., Li, L., Tchetgen, E., and van der Vaart, A. W. (2009). Quadratic semiparametric von mises calculus. Metrika, 69(2-3):227–247. 
  • Robins, J. M., Li, L., Mukherjee, R., Tchetgen Tchetgen, E., and van der Vaart, A. W. (2017). Minimax estimation of a functional on a structured high dimensional model. The Annals of Statistics, 45(5):1951–1987. 
  • Robins, J. M., Li, L., Tchetgen Tchetgen, E. J., and van der Vaart, A. W. (2008). Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman
  • Robins, J. M. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129. 
  • Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846–866. 
  • Serfling, R. J. (1980). Approximation theorems of mathematical statistics. John Wiley & Sons. 
  • Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. New York: Springer. 
  • van der Laan, M. J. (2006). Statistical inference for variable importance. The International Journal of Biostatistics, 2(1). 
  • van der Laan, M. J. and Robins, J. M. (2003). Unified methods for censored longitudinal data and causality. Springer Science & Business Media. 
  • van der Laan, M. J. and Rose, S. (2011). Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media. 
  • van der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1). 
  • van der Vaart, A. W. (2000). Asymptotic Statistics, volume 3. Cambridge University Press. 
  • van der Vaart, A. W. (2002). Part iii: Semiparameric statistics. Lectures on Probability Theory and Statistics, pages 331–457. 
  • von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. The annals of mathematical statistics, 18(3):309–348. 
  • Wager, S., Hastie, T., and Efron, B. (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal of Machine Learning Research, 15(1):1625–1651. 
  • Williamson, B. D., Gilbert, P. B., Simon, N., and Carone, M. (2017). Nonparametric variable importance assessment using machine learning techniques. UW Biostatistics Working Paper Series, Working Paper 422. 
  • Xu, Y., Zhu, S., Yang, S., Zhang, C., Jin, R., and Yang, T. (2018). Learning with non-convex truncated losses by sgd. arXiv preprint arXiv:1805.07880
 

People also read